Bioinformatics Flashcards
What is bioinformatics?
*development and use of software tools/programming code to analyse complex biological data.
* a bioinformatician is the person who develops the
software/algorithsms/models.
* an evolutionary biologist (for example) is the person who uses it.
* transdisciplinary combination of biology, software engineering, programming, information engineering, computer science, mathematics and particularly statistics.
* rapidly evolving field – some methodologies (not specific programmes but general approaches to analysis) from 2019 are already obsolete.
* used in physiology, biochemistry, genetics, all ‘-omic’ and ‘meta-omic’ methods, evolutionary biology, taxonomy and systematics, structural biology, drug design, personalised biomedicine.
Microbiological origins
The technology can move very quickly however sequences in databases can be dodgy and misleading as it hasn’t gone through specific checks.
What are -omic methods
Describe;
-genomics
-transcriptomics
-proteomics
- ‘-omic’ methods study the whole of something in an organism:
- genomics is the study of the sum of the chromosomal DNA of an organism (the genome).
- transcriptomics is the study of the sum of the mRNA in an organism under a specific growth condition (the transcriptome). It’s always comparative to a control (tumour cell compared to normal cell)
- proteomics is the study of the sum of the protein content of an organism under a specific growth condition (the proteome). (not a true “ome” can’t get all proteins of the cell)
Others have appeared: metabolomics impossible) (sum of all metabolites in an organism under a specific growth condition), metalomics (sum of all metals in an
organism under a specific growth condition) – they get VERY contrived (impossible)
(“surfaceomics…”).
* Usually some level of e.g. comparative genomics etc nowadays.
What is meta-omic methods
meta = beyond (everything)
* ‘meta-omic’ methods study the whole of something in an ecosystem:
* metagenomics is the study of the sum of the chromosomal DNA of all organisms of all Domains in any given ecosystem.
* metatranscriptomics is the study of the sum of the mRNA of all organisms of all Domains in any given ecosystem.
* metaproteomics is the study of the sum of the protein content of all organisms of all Domains in any given ecosystem.
* DO NOT confuse metagenomics with sequencing a single gene from a whole ecosystem – some people call that ‘metagenetics’ or ‘microbiomics’, to me it’s just ‘molecular ecology’!
targeted meta (it has been filtered or sieved)
What is Some terminology: pan-omic methods
Only pangenomic
* ‘pan-omic’ methods study the whole of something across every sample of a given species:
* pangenomics is the comparative study of the sum of the chromosomal DNA of every strain of a given species versus every strain of other species e.g.every strain of Thermithiobacillus tepidarius versus every strain of
Thermithiobacillus plumbiphilus – usually to look for evolutionary changes and to assist taxonomy and systematics, or to look at the evolution of a specific trait such as pathogenicity – very common in medical microbiology.
* can also be applied in e.g. every single known museum specimen of Bos taurus subsp. taurus L. DNA to look at the evolution of populations within the subspecies.
Basic principles of alignments
- If you have two sequences of something (DNA, mRNA, rRNA, protein, Roman alphabet, Greek alphabet, Georgian script…) you can search it
using various methods that use sequence alignment to either search or analyse the data.
e.g. if we take the sequences WHICH and WITCH, we can compare them easily as they are the same length – no need to align: WHICH, WITCH
The red W and CH are conserved (conservative) positions, in which the sequence is the same in everything in our dataset (two sequences in the dataset total). The rest of the letters are non-conserved (non-conservative)
positions in which the sequences differ across the dataset.
Describe further the alignments
We can look at it conserved= all have it
semi-conserved= most have it.
- stars are conserved
- dashes are not conserved
What happens if the sequences are not all the same length
The computer can create “gaps”to nudge the sequence to make it align.
sometimes dashes are used in the gaps.
The gaps are only there to help with calculation. The gaps are not added spaces
27 sites, of which 21 are conserved
Sequence identity = 21/27*100 = 77.78 % identity
(identity is used for DNA and RNA mainly but also protein)
Sequence similarity counts identical positions AND similar
positions – usually for amino acids and defined based on chemical properties. DON’T CONFUSE IDENTITY AND SIMILARITY!
similarity is more used when there is semi conservative sequences.
don’t use similarity in genetic context unless you actually know the sequences have a Similarity.
What does an exclamation mark mean in an equation?
multiplying it with every number before it until 1 e.g if you have 99! it means 99X98X97X96… until 1
What are Pairwise alignments – every pair in a dataset
allows you to match regions in sequences to identify probable structural and functional similarities
Pairwise alignments – every pair in a dataset
In real datasets, various alignment algorithms are used, such as CLUSTALW, MUSCLE, MAAFT etc – all have benefits and
weaknesses.
Understanding how they work is important if you are to select the right one – a ‘cookie-cutter’ approach of copying a paper doesn’t work
– their dataset might be bigger, more diverse etc than yours.
MUSCLE= good for large data
CLUSTALW= better for smaller
MAAFT= slow but most accurate in large dataset
Pairwise alignments – every pair in a dataset
2
An evolutionary algorithm is then used to put some numbers (and statistics) on the level of relatedness between each pair.
Many algorithms (some for aa, some for nt), and they are selected by model-testing against the aligned dataset.
Model-testing can be done on the basis of various information criteria e.g. BIC, AIC, AICc – the specific one used must be selected to match the diversity of the dataset, and is not arbitrarily “AICc is best” etc. Generally, for diverse datasets,
BIC is best for ore diverse data such as gene from every member of the eukarya, and then as datasets get less diverse, AIC takes over and AICc for very non-diverse datasets. e.g. if doing an alignment of the same protein across a
whole Class or Phylum, BIC, if across a single genus, AICc. That said, depends how conserved the sequence is!
The output is a matrix of numbers – not very interesting or useful but can be used
to generate phylogenetic trees to look at levels of relatedness by branch length
Pick model that has the lowest value
BLAST – the most common informatic tool
Basic local alignment search tool (BLAST) was launched in the late 1980s and is a search tool for searching a sequence database using a query sequence. Could be used to get a rough idea of what a species is based on a
marker gene (e.g. 16S rRNA gene) or what type of enzyme a protein is.
It is just a search tool – you need a database! Most people who say “BLAST” mean “NCBI BLAST”, “EBI BLAST” i.e. the BLAST search
tool for interrogating the GenBank database of DNA and protein sequence data. You can have a local BLAST database on your own PC just to search your own data. There are specialist databases out there too.
Proper phraseology “…the GenBank™ database was interrogated
using the BLASTn algorithm (Altschul et al., 1990) using the putiative cbbL sequence from Thiomicrorhabdus heinhorstii…”
Will a protein have high identity with other sequences.
identity in amino acids will not have high percentage identity as many amino acids do similar things so they will have higher similarity.
What is the e-value
e value is a measure of quality. The longer the sequence the better quality as shorter sequences will probably be in many sequences on the earth.
tells you if its happening by chance the higher the number =nonsense match same sequence but randomly probably because sequence is too short.