Bioinformatics 7: Sequence alignment and its significance Flashcards
The 2 types of homolog and their differences?
ortholog - separated via speciation event
paralog - separated via duplication event
What is meant by the orthology conjecture?
Orthologs are more likely to show more functional conservation that paralogs
i.e. ortholog genes usually related in function
paralog genes duplicate and diverge (as 2 copies of gene, same function)
What is meant by ‘chance similarity’?
Any two sequences that show similarity by chance
not structurally or functionally
2 ways dna sequences might differ?
Mismatches
Gaps
created by substitutions and indels
What is a dotplot in the context of allignment?
Matrix of 2 sequences marked where rows and colums match
Used by alignment algorithms to find most likely evolutionary pathway between the 2 sequences
Which is more common - indels or substitutions? How does this affect alignment?
Substitutions far more common than indels
-> must be considered in alignment algorithms
thus ‘quality’ of alignments is assessed via a scoring matrix (matches +ve, mismatches 0, gaps -ve) -> algorithms maximise score
Types of gap penalty?
Constant
Proportional
Affine
How would a penalty applied to an amino acid substitution vary in severity?
If amino acid which has been substituted is similar in chemical properties (function) -> low penalty
If completely different, likely to be deleterious -> high penalty
How do heuristic algorithms work and why are they used over dynamic programming algorithms? Example of one?
Heuristic methods assume high scoring alignments contain short regions of exact matches
- > they break queries into short ‘words’ and look for matches above a threshold
- > initial hits examined to see if they can be extended
- > alignment then scored to quantify similarity
e. g. BLAST
How does BLAST work?
Basic local alignment search tool
word (W) size: 3 (proteins), 11 (DNA)
-> searches only for word matches above threshold, T
- > matches above T extended (form HSPs) until gaps cause alignment score to fall drastically
- > neighbouring HSPs are joined, HSPs in low identity regions are not joined
High-scoring segment pairs (HSPs) along query are reported + ordered by score
Types of blast searches and their uses?
blastn: nucleotide query vs nucleotide db (what gene is this?)
blastp: protein query vs protein db (what protein is this?)
blastx: translated nucleotide query vs protein db (does this DNA code for a known protein?)
tblastn: protein query vs translated nucleotide db (what DNA might encode this protein)
tblastx: translated nucleotide query vs translated nucleotide db (does this DNA code for a novel protein?)
what is FASTA? how does it compare to BLAST?
Heuristic algorithm
AND sequence format (single line of description followed by sequence data)
- more sensitive to distant relationships but slower than BLAST
In the context of a search result, what is a P value and an E value?
P value = probability of observing as high of an alignment score between 2 unrelated sequences of the same length + composition
E (Expect) Value = How often a match would be expected to occur in a db by chance (at a given p value)