9. MSAs Flashcards
What are the two different applications of substitution matrices?
1.using identities or a substitution matrix to detect similarities of closely or distantly related sequences
2.using a substitution matrix to optimize an alignment
Comparative sequence analysis:
starting with: seq A + seq B –>
similarity / homology? compute (optimal) alignment
Comparative sequence analysis:
starting with: one sequence + many sequences –>
find database sequences that are similar (homologous) to the query sequence
Comparative sequence analysis:
starting with: homologous sequences, not aligned –>
compute a multiple sequence alignment
Comparative sequence analysis:
starting with: homologous sequences, aligned + many sequences
model the alignment, find additional family members
Multiple sequence alignments (MSAs)
How can we collect homologous sequences for this?
collect (putatively) homologous sequences
- BLAST
- clustering approaches, …
What can an MSA be used for?
use the MSA to do further analysis
- description of variable & conserved regions
- phylogenetic inference of sequences
- test for signatures of selection
- predict protein structure and function
- PCR-primer design
- …
Multiple sequence alignments (MSAs)
what is this?
arranging sequences such that residues within a column
* result in an optimal or reasonable score for a given a scoring scheme
* show maximal similarity
* are homologous (positional homology)
* play a common functional role
* are in equivalent positions in the corresponding structures
What is positional homology and how is this relevant for MSAs?
positional homology
* aligned residues share a common ancestral residue in the ancestral sequences
* changes in the columns correspond to mutations
MSA, in the context of evolutionary analysis:
* a hypothesis about the positional homology of residues in homologous sequences
Challenges for good MSAs
biological?
- biological accuracy: criterion for accuracy?
- reconcile multiple pw alignments into a MSA - different possible alignments
- highly divergent sequences
Challenges for good MSAs
non biological?
large datasets
- fast heuristics are needed to align thousands (millions?) of sequences
- accuracy of large-scale approaches?
computational
- mathematical accuracy: no fast solution exists, all approaches use heuristics
MSAs - de novo alignment approaches (+ examples)?
which ones most relevant to us?
- multiple “local” alignments (Dialign)
- *progressive (iterative) alignment (Clustal, MAFFT) *
- divide & conquer for huge data sets (PASTA)
- *meta-alignments / consensus methods (M-Coffee) *
- (machine learning?)
Types of progressive alignment? + examples
- consistency-based approach (T-Coffee, MAFFT)
- phylogeny-aware alignments (Prank)
- very fast heuristics (MAFFT)
MSAs - reference or seed-based methods?
- probabilistic approaches (HMMs: HMMer)
Progressive alignment
steps?
1.compute a pairwise distance matrix
2.use alignment scores to compute a guide tree (not a phlyogeny!)
3.align closely related sequences, progressively add more distantly related sequences
- sub-alignments are “frozen”
Progressive alignment
What do we need to do for subalignments?
What is often variable?
★compute profiles for subalignments - summary/statistical information about conservation/residues in each column
* often: variable substitution matrix - at each step: based on distance between sequences to be compared
For a progressive alignment, what is the sum-of-pairs score?
What does it use?
What assumption does it make?
What is the WSP?
sum of scores of all induced pairwise alignments
assumes statistical independence for all columns
uses a substitution matrix
weighted sum of pairs (WSP): pw scores are adjusted for biased phylogenetic distribution
- identity: 1
- mismatch: -1
- gap: -2
Progressive alignment
- first implementation?
- frequently used implementations?
(* basic progressive alignment )
- first implementation: 1987
- frequently used implementations
- 1994: CLUSTAL W
- 1997: CLUSTAL X
- 2011: CLUSTAL O
Progressive alignment - improvements on original implementation:
challenge: errors that are frozen in subalignments
solution?
solution: iterative refinement (most programs)
Progressive alignment: improvements on original implementation:
- challenge: suboptimal pw alignments in MSA
solution?
(* basic progressive alignment )
solution: consistency scores (T-Coffee)
Progressive alignment: improvements on original implementation:
- challenge: evolutionary correct (aware) MSAs
(* basic progressive alignment )
solution: modify gap costs (Prank)
Progressive alignment: improvements on original implementation:
- challenge: guide tree takes long for huge data sets
solution: fast clustering of sequences (Clustal-O)
Progressive alignment
major weakness?
solution?
major weakness:
* alignment errors cannot be corrected once they are introduced
solution: repair errors during post-processing
* iterative refinement
* implemented in most programs
Progressive alignment
Iterative refinement?
- remove and re-align a single sequence
- partition the sequences (randomly or tree-based), re-align within groups, then align groups
- re-align after each profile-alignment (seq-profile, profile-profile)
T-Coffee
What type of approach / objective function?
What can it also incorporate?
Position-specific library?
consistency-based: uses consistency as an objective function
best alignment: the one that agrees most with pw alignments
(cf the one with the highest sum of pairs score in progressive alignment)
Consistency:
- evaluates consistency with pairs of residues found in optimal local alignments and heuristic global alignments
- does not score gaps explicitly
can also incorporate extraneous information (e.g., structural constraints)
position-specific library
* similarity of the pair of sequences (sequence fragments) the residue pair comes from
* consistency of that residue pair with all other residue pairs
* score for aligning xi and yj
Gaps in MSAs:
To make biological sense, how should a deletion be penalised? insertion?
What do most methods do?
Why is this a problem? solution?
What new problem does this present?
- deletion in a MSA: should be penalized only where it occurs
- insertion in a MSA: should be penalized only once
most methods don’t distinguish between insertions and deletions in MSAs: all gaps are considered deletions
problem: high penalties for a single insertion
solution: reduce the gap costs in regions already containing gaps, increase gap costs near existing gaps
new problem! encouraging overlap of gaps:
- collapse of independent and nearby insertions
- can lead to alignment over-compression: 2 independent insertion events, alignment over compressed
➡ violates positional homology
➡ incorrect alignment
Gaps in MSAs: PRANK
what result? how?
computes ancestral sequences, marks insertion so they will not be:
- further penalized
- (mis)matched
during later alignment steps
–> improved results with denser sampling, when guide tree = true tree
–> better evolutionary awareness: eg for evolution through short insertions and deletions
gaps in MSAs
ClustalW vs Prank
Clustalw
shrinkage/expansion through overlapping point mutations
Prank
evolution through short insertions and deletions –> better evolutionary awareness
What did we learn about large alignments?
Alignment of sequences up to 1k / columns ca up to 8/10k: accurate alignments & phylogenies can be computed
- if the best aligners are used and/or
- evolutionary rate of indels is low
Any more seqs/cols: most aligners failed to complete
But low-accuracy methods complete –> alignments & trees are highly inaccurate
eg:
MAFFT, Clustal-Omega, PASTA
* fast pairwise comparison using clustering, guide sub-trees
* decreased accuracy: 60% agreement between methods (M Chatzou et al., assigned reading)
What did we learn about huge alignments?
Applications?
Approaches?
applications
* MSA, phylogenetics, evolutionary analysis
* in the context of protein structure prediction
approaches
* divide & conquer approaches (e.g., Sate, PASTA, SATCHMO-JS, PROMALS, MAPGAPS): divide sequences into a subset of at most size X, align sequences in each subset, merge subsets into a final
alignment
* seed-based approach (e.g., UPP, MAFFT-Sparsecode, regressive): select subset, align, compute pHMM, use pHMM to align all remaining sequences to it
Evaluation of alignment accuracy/usefulness
dilemma? solution?
dilemma!
* > 100 alignment programs are available
* heuristics! co-optimal alignments!
* errors in sequence alignments cannot be avoided
solution:
tolerate but quantify errors / uncertainty
➜ carefully select the alignment approach/software
➜ evaluate the computed alignment
* how good is the entire alignment? are specific regions?
* how useful is the alignment for the intended purpose?
* does the alignment have to be reduced/masked?
Selecting a MSA method:
what do we need to ask?
can the method reconstruct the (near) correct alignment?
➜ true alignment generally unknown!
- method’s published strengths & weaknesses
- faster or more accurate?
- for few or lots of sequences?
- designed for structural or evolutionary analysis?
- tested against a benchmark data set?
- e.g., Balibase (structure-based alignments) or simulated alignments?
Types of (problematic) alignments
- short, long
- highly divergent
- extensions
- insertions
- orphans
- subfamilies
- repeats
- motifs
- lots of sequences
- …
How do we evaluate alignments?
usually evaluate by column
- (in)consistently aligned?
- between methods: M-Coffee
- (within methods: HoT, GUIDANCE)
or evaluate by sequence or sequence region
* non-homologous sequence?
* homologous sequence, misaligned?
* non-homologous sequence stretch
(e.g., assembly or annotation error)
evaluate by overall score?
What are meta-alignments?
what problem do they solve?
how?
example?
different methods lead to different alignments (lead to different conclusions)
we can compute several MSAs and select the “best”, or generate a consensus
–> meta-methods: compute a MSA that is consistent with the original alignments (M-Coffee)
M-Coffee
What is it an example of?
What does it do?
Idea?
Approach?
Example of meta-alignment
It combines alternative MSAs into one final output
Idea:
- errors produced by independent approaches should not be consistent
- agreement suggests correctness
- correlated methods violate M-Coffee’s assumption: method selection is important!
Approach (T-Coffee based)
- library = multiple sequence alignments
- compile MSAs into into a single new MSA
- score (color/numeric) as described for T-Coffee
evaluate the alignment:
what is the consequence of low alignment scores?
for further analysis (e.g., phylogenetic inference):
mask or remove sequences or alignment regions / columns that likely violate positional homology
Multiple sequence alignments:
What kind of datatsets?
homologous coding sequences (DNA, protein) –> linear & mostly global alignments
entire genomes (or genomic scaffolds)
RNA families
alignment-free sequence comparison
Multiple sequence alignments of whole genomes ?
What do we need to consider for these?
What’s an example of a software to do this?
Which other software does it incorporate and for what?
What output format
entire genomes (or genomic scaffolds)
- must take into consideration inversion, translocation, duplication
- identify homologous blocks, then align these
MUGSY
* compute pairwise alignments (MUMmer)
* identify & collect collinear regions (graph-based)
* combine regions into MSAs (TCoffee)
* output in MAF format
What do we need to consider for MSAs of RNA families?
secondary structure conservation over sequence conservation
Multiple sequence alignments
Name 4 types of datatsets
- homologous coding sequences (DNA, protein)
- entire genomes (or genomic scaffolds)
- RNA families
- alignment-free sequence comparison
MSAs:
dataset: homologous coding sequences (DNA, protein)
what type of alignments do we want?
linear & mostly global alignments
MSAs:
dataset: entire genomes (or genomic scaffolds)
what do we need to consider?
what do we identify for alignment?
- take into consideration inversion, translocation, duplication
- identify homologous blocks, then align these
MSAs:
dataset/ aim to identify: RNA families
what do we need prioritise?
secondary structure conservation over sequence conservation
EXAM QUESTION
In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019)
2 topics where guide tree was mentioned and how (2020)
Guide trees for MSAs
Many progressive (iterative) alignment methods use guide trees to generate an MSA:
- compute a pairwise distance matrix
- use alignment scores to compute a guide tree which tells us which sequence to align next
Guide tree is not a phylogeny! guides the order in which sequences are being aligned
EXAM QUESTION
List the steps for progressive alignment, and its main disadvantage. Describe and improvement for it, and the software that uses such improvement. (2019)
Main steps basic progressive alignment + disadvantage, how to overcome (2020)
- compute pairwise distance matrix
- alignment scores –> compute guide tree
- align closely related seqs, progressively add more distantly related seqs
major weakness:
alignment errors cannot be corrected once introduced, because sub-alignments are “frozen”.
solution: repair errors during post-processing = iterative refinement
work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA.
implemented in most programs (eg MAFFT)