9. MSAs Flashcards
What are the two different applications of substitution matrices?
1.using identities or a substitution matrix to detect similarities of closely or distantly related sequences
2.using a substitution matrix to optimize an alignment
Comparative sequence analysis:
starting with: seq A + seq B –>
similarity / homology? compute (optimal) alignment
Comparative sequence analysis:
starting with: one sequence + many sequences –>
find database sequences that are similar (homologous) to the query sequence
Comparative sequence analysis:
starting with: homologous sequences, not aligned –>
compute a multiple sequence alignment
Comparative sequence analysis:
starting with: homologous sequences, aligned + many sequences
model the alignment, find additional family members
Multiple sequence alignments (MSAs)
How can we collect homologous sequences for this?
collect (putatively) homologous sequences
- BLAST
- clustering approaches, …
What can an MSA be used for?
use the MSA to do further analysis
- description of variable & conserved regions
- phylogenetic inference of sequences
- test for signatures of selection
- predict protein structure and function
- PCR-primer design
- …
Multiple sequence alignments (MSAs)
what is this?
arranging sequences such that residues within a column
* result in an optimal or reasonable score for a given a scoring scheme
* show maximal similarity
* are homologous (positional homology)
* play a common functional role
* are in equivalent positions in the corresponding structures
What is positional homology and how is this relevant for MSAs?
positional homology
* aligned residues share a common ancestral residue in the ancestral sequences
* changes in the columns correspond to mutations
MSA, in the context of evolutionary analysis:
* a hypothesis about the positional homology of residues in homologous sequences
Challenges for good MSAs
biological?
- biological accuracy: criterion for accuracy?
- reconcile multiple pw alignments into a MSA - different possible alignments
- highly divergent sequences
Challenges for good MSAs
non biological?
large datasets
- fast heuristics are needed to align thousands (millions?) of sequences
- accuracy of large-scale approaches?
computational
- mathematical accuracy: no fast solution exists, all approaches use heuristics
MSAs - de novo alignment approaches (+ examples)?
which ones most relevant to us?
- multiple “local” alignments (Dialign)
- *progressive (iterative) alignment (Clustal, MAFFT) *
- divide & conquer for huge data sets (PASTA)
- *meta-alignments / consensus methods (M-Coffee) *
- (machine learning?)
Types of progressive alignment? + examples
- consistency-based approach (T-Coffee, MAFFT)
- phylogeny-aware alignments (Prank)
- very fast heuristics (MAFFT)
MSAs - reference or seed-based methods?
- probabilistic approaches (HMMs: HMMer)
Progressive alignment
steps?
1.compute a pairwise distance matrix
2.use alignment scores to compute a guide tree (not a phlyogeny!)
3.align closely related sequences, progressively add more distantly related sequences
- sub-alignments are “frozen”
Progressive alignment
What do we need to do for subalignments?
What is often variable?
★compute profiles for subalignments - summary/statistical information about conservation/residues in each column
* often: variable substitution matrix - at each step: based on distance between sequences to be compared
For a progressive alignment, what is the sum-of-pairs score?
What does it use?
What assumption does it make?
What is the WSP?
sum of scores of all induced pairwise alignments
assumes statistical independence for all columns
uses a substitution matrix
weighted sum of pairs (WSP): pw scores are adjusted for biased phylogenetic distribution
- identity: 1
- mismatch: -1
- gap: -2
Progressive alignment
- first implementation?
- frequently used implementations?
(* basic progressive alignment )
- first implementation: 1987
- frequently used implementations
- 1994: CLUSTAL W
- 1997: CLUSTAL X
- 2011: CLUSTAL O