9. MSAs Flashcards
What are the two different applications of substitution matrices?
1.using identities or a substitution matrix to detect similarities of closely or distantly related sequences
2.using a substitution matrix to optimize an alignment
Comparative sequence analysis:
starting with: seq A + seq B –>
similarity / homology? compute (optimal) alignment
Comparative sequence analysis:
starting with: one sequence + many sequences –>
find database sequences that are similar (homologous) to the query sequence
Comparative sequence analysis:
starting with: homologous sequences, not aligned –>
compute a multiple sequence alignment
Comparative sequence analysis:
starting with: homologous sequences, aligned + many sequences
model the alignment, find additional family members
Multiple sequence alignments (MSAs)
How can we collect homologous sequences for this?
collect (putatively) homologous sequences
- BLAST
- clustering approaches, …
What can an MSA be used for?
use the MSA to do further analysis
- description of variable & conserved regions
- phylogenetic inference of sequences
- test for signatures of selection
- predict protein structure and function
- PCR-primer design
- …
Multiple sequence alignments (MSAs)
what is this?
arranging sequences such that residues within a column
* result in an optimal or reasonable score for a given a scoring scheme
* show maximal similarity
* are homologous (positional homology)
* play a common functional role
* are in equivalent positions in the corresponding structures
What is positional homology and how is this relevant for MSAs?
positional homology
* aligned residues share a common ancestral residue in the ancestral sequences
* changes in the columns correspond to mutations
MSA, in the context of evolutionary analysis:
* a hypothesis about the positional homology of residues in homologous sequences
Challenges for good MSAs
biological?
- biological accuracy: criterion for accuracy?
- reconcile multiple pw alignments into a MSA - different possible alignments
- highly divergent sequences
Challenges for good MSAs
non biological?
large datasets
- fast heuristics are needed to align thousands (millions?) of sequences
- accuracy of large-scale approaches?
computational
- mathematical accuracy: no fast solution exists, all approaches use heuristics
MSAs - de novo alignment approaches (+ examples)?
which ones most relevant to us?
- multiple “local” alignments (Dialign)
- *progressive (iterative) alignment (Clustal, MAFFT) *
- divide & conquer for huge data sets (PASTA)
- *meta-alignments / consensus methods (M-Coffee) *
- (machine learning?)
Types of progressive alignment? + examples
- consistency-based approach (T-Coffee, MAFFT)
- phylogeny-aware alignments (Prank)
- very fast heuristics (MAFFT)
MSAs - reference or seed-based methods?
- probabilistic approaches (HMMs: HMMer)
Progressive alignment
steps?
1.compute a pairwise distance matrix
2.use alignment scores to compute a guide tree (not a phlyogeny!)
3.align closely related sequences, progressively add more distantly related sequences
- sub-alignments are “frozen”
Progressive alignment
What do we need to do for subalignments?
What is often variable?
★compute profiles for subalignments - summary/statistical information about conservation/residues in each column
* often: variable substitution matrix - at each step: based on distance between sequences to be compared
For a progressive alignment, what is the sum-of-pairs score?
What does it use?
What assumption does it make?
What is the WSP?
sum of scores of all induced pairwise alignments
assumes statistical independence for all columns
uses a substitution matrix
weighted sum of pairs (WSP): pw scores are adjusted for biased phylogenetic distribution
- identity: 1
- mismatch: -1
- gap: -2
Progressive alignment
- first implementation?
- frequently used implementations?
(* basic progressive alignment )
- first implementation: 1987
- frequently used implementations
- 1994: CLUSTAL W
- 1997: CLUSTAL X
- 2011: CLUSTAL O
Progressive alignment - improvements on original implementation:
challenge: errors that are frozen in subalignments
solution?
solution: iterative refinement (most programs)
Progressive alignment: improvements on original implementation:
- challenge: suboptimal pw alignments in MSA
solution?
(* basic progressive alignment )
solution: consistency scores (T-Coffee)
Progressive alignment: improvements on original implementation:
- challenge: evolutionary correct (aware) MSAs
(* basic progressive alignment )
solution: modify gap costs (Prank)
Progressive alignment: improvements on original implementation:
- challenge: guide tree takes long for huge data sets
solution: fast clustering of sequences (Clustal-O)
Progressive alignment
major weakness?
solution?
major weakness:
* alignment errors cannot be corrected once they are introduced
solution: repair errors during post-processing
* iterative refinement
* implemented in most programs
Progressive alignment
Iterative refinement?
- remove and re-align a single sequence
- partition the sequences (randomly or tree-based), re-align within groups, then align groups
- re-align after each profile-alignment (seq-profile, profile-profile)