9. MSAs Flashcards

1
Q

What are the two different applications of substitution matrices?

A

1.using identities or a substitution matrix to detect similarities of closely or distantly related sequences

2.using a substitution matrix to optimize an alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Comparative sequence analysis:

starting with: seq A + seq B –>

A

similarity / homology? compute (optimal) alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Comparative sequence analysis:

starting with: one sequence + many sequences –>

A

find database sequences that are similar (homologous) to the query sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Comparative sequence analysis:

starting with: homologous sequences, not aligned –>

A

compute a multiple sequence alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Comparative sequence analysis:

starting with: homologous sequences, aligned + many sequences

A

model the alignment, find additional family members

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Multiple sequence alignments (MSAs)

How can we collect homologous sequences for this?

A

collect (putatively) homologous sequences
- BLAST
- clustering approaches, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What can an MSA be used for?

A

use the MSA to do further analysis
- description of variable & conserved regions
- phylogenetic inference of sequences
- test for signatures of selection
- predict protein structure and function
- PCR-primer design
- …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Multiple sequence alignments (MSAs)

what is this?

A

arranging sequences such that residues within a column
* result in an optimal or reasonable score for a given a scoring scheme
* show maximal similarity
* are homologous (positional homology)
* play a common functional role
* are in equivalent positions in the corresponding structures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is positional homology and how is this relevant for MSAs?

A

positional homology
* aligned residues share a common ancestral residue in the ancestral sequences
* changes in the columns correspond to mutations

MSA, in the context of evolutionary analysis:
* a hypothesis about the positional homology of residues in homologous sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Challenges for good MSAs

biological?

A
  • biological accuracy: criterion for accuracy?
  • reconcile multiple pw alignments into a MSA - different possible alignments
  • highly divergent sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Challenges for good MSAs

non biological?

A

large datasets
- fast heuristics are needed to align thousands (millions?) of sequences
- accuracy of large-scale approaches?

computational
- mathematical accuracy: no fast solution exists, all approaches use heuristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MSAs - de novo alignment approaches (+ examples)?

which ones most relevant to us?

A
  • multiple “local” alignments (Dialign)
  • *progressive (iterative) alignment (Clustal, MAFFT) *
  • divide & conquer for huge data sets (PASTA)
  • *meta-alignments / consensus methods (M-Coffee) *
  • (machine learning?)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of progressive alignment? + examples

A
  • consistency-based approach (T-Coffee, MAFFT)
    • phylogeny-aware alignments (Prank)
    • very fast heuristics (MAFFT)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

MSAs - reference or seed-based methods?

A
  • probabilistic approaches (HMMs: HMMer)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Progressive alignment

steps?

A

1.compute a pairwise distance matrix

2.use alignment scores to compute a guide tree (not a phlyogeny!)

3.align closely related sequences, progressively add more distantly related sequences
- sub-alignments are “frozen”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Progressive alignment

What do we need to do for subalignments?
What is often variable?

A

★compute profiles for subalignments - summary/statistical information about conservation/residues in each column
* often: variable substitution matrix - at each step: based on distance between sequences to be compared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

For a progressive alignment, what is the sum-of-pairs score?

What does it use?

What assumption does it make?

What is the WSP?

A

sum of scores of all induced pairwise alignments

assumes statistical independence for all columns

uses a substitution matrix

weighted sum of pairs (WSP): pw scores are adjusted for biased phylogenetic distribution
- identity: 1
- mismatch: -1
- gap: -2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Progressive alignment
- first implementation?
- frequently used implementations?

(* basic progressive alignment )

A
  • first implementation: 1987
  • frequently used implementations
  • 1994: CLUSTAL W
  • 1997: CLUSTAL X
  • 2011: CLUSTAL O
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Progressive alignment - improvements on original implementation:

challenge: errors that are frozen in subalignments

solution?

A

solution: iterative refinement (most programs)

20
Q

Progressive alignment: improvements on original implementation:

  • challenge: suboptimal pw alignments in MSA

solution?

(* basic progressive alignment )

A

solution: consistency scores (T-Coffee)

21
Q

Progressive alignment: improvements on original implementation:

  • challenge: evolutionary correct (aware) MSAs

(* basic progressive alignment )

A

solution: modify gap costs (Prank)

22
Q

Progressive alignment: improvements on original implementation:

  • challenge: guide tree takes long for huge data sets
A

solution: fast clustering of sequences (Clustal-O)

23
Q

Progressive alignment
major weakness?
solution?

A

major weakness:
* alignment errors cannot be corrected once they are introduced
solution: repair errors during post-processing
* iterative refinement
* implemented in most programs

24
Q

Progressive alignment

Iterative refinement?

A
  • remove and re-align a single sequence
  • partition the sequences (randomly or tree-based), re-align within groups, then align groups
  • re-align after each profile-alignment (seq-profile, profile-profile)
25
T-Coffee What type of approach / objective function? What can it also incorporate? Position-specific library?
consistency-based: uses consistency as an objective function best alignment: the one that agrees most with pw alignments (cf the one with the highest sum of pairs score in progressive alignment) Consistency: - evaluates consistency with pairs of residues found in optimal local alignments and heuristic global alignments - does not score gaps explicitly can also incorporate extraneous information (e.g., structural constraints) position-specific library * similarity of the pair of sequences (sequence fragments) the residue pair comes from * consistency of that residue pair with all other residue pairs * score for aligning xi and yj
26
Gaps in MSAs: To make biological sense, how should a deletion be penalised? insertion? What do most methods do? Why is this a problem? solution? What new problem does this present?
* deletion in a MSA: should be penalized only where it occurs * insertion in a MSA: should be penalized only once most methods don’t distinguish between insertions and deletions in MSAs: all gaps are considered deletions problem: high penalties for a single insertion solution: reduce the gap costs in regions already containing gaps, increase gap costs near existing gaps new problem! encouraging overlap of gaps: - collapse of independent and nearby insertions - can lead to alignment over-compression: 2 independent insertion events, alignment over compressed ➡ violates positional homology ➡ incorrect alignment
27
Gaps in MSAs: PRANK what result? how?
computes ancestral sequences, marks insertion so they will not be: - further penalized - (mis)matched during later alignment steps --> improved results with denser sampling, when guide tree = true tree --> better evolutionary awareness: eg for evolution through short insertions and deletions
28
gaps in MSAs ClustalW vs Prank
Clustalw shrinkage/expansion through overlapping point mutations Prank evolution through short insertions and deletions --> better evolutionary awareness
29
What did we learn about large alignments?
Alignment of sequences up to 1k / columns ca up to 8/10k: accurate alignments & phylogenies can be computed - if the best aligners are used and/or - evolutionary rate of indels is low Any more seqs/cols: most aligners failed to complete But low-accuracy methods complete --> alignments & trees are highly inaccurate eg: MAFFT, Clustal-Omega, PASTA * fast pairwise comparison using clustering, guide sub-trees * decreased accuracy: 60% agreement between methods (M Chatzou et al., assigned reading)
30
What did we learn about huge alignments? Applications? Approaches?
applications * MSA, phylogenetics, evolutionary analysis * in the context of protein structure prediction approaches * divide & conquer approaches (e.g., Sate, PASTA, SATCHMO-JS, PROMALS, MAPGAPS): divide sequences into a subset of at most size X, align sequences in each subset, merge subsets into a final alignment * seed-based approach (e.g., UPP, MAFFT-Sparsecode, regressive): select subset, align, compute pHMM, use pHMM to align all remaining sequences to it
31
Evaluation of alignment accuracy/usefulness dilemma? solution?
dilemma! * > 100 alignment programs are available * heuristics! co-optimal alignments! * errors in sequence alignments cannot be avoided solution: tolerate but quantify errors / uncertainty ➜ carefully select the alignment approach/software ➜ evaluate the computed alignment * how good is the entire alignment? are specific regions? * how useful is the alignment for the intended purpose? * does the alignment have to be reduced/masked?
32
Selecting a MSA method: what do we need to ask?
can the method reconstruct the (near) correct alignment? ➜ true alignment generally unknown! * method’s published strengths & weaknesses - faster or more accurate? - for few or lots of sequences? - designed for structural or evolutionary analysis? - tested against a benchmark data set? * e.g., Balibase (structure-based alignments) or simulated alignments?
33
Types of (problematic) alignments
* short, long * highly divergent * extensions * insertions * orphans * subfamilies * repeats * motifs * lots of sequences * ...
34
How do we evaluate alignments?
usually evaluate by column - (in)consistently aligned? - between methods: M-Coffee - (within methods: HoT, GUIDANCE) or evaluate by sequence or sequence region * non-homologous sequence? * homologous sequence, misaligned? * non-homologous sequence stretch (e.g., assembly or annotation error) evaluate by overall score?
35
What are meta-alignments? what problem do they solve? how? example?
different methods lead to different alignments (lead to different conclusions) we can compute several MSAs and select the “best”, or generate a consensus --> meta-methods: compute a MSA that is consistent with the original alignments (M-Coffee)
36
M-Coffee What is it an example of? What does it do? Idea? Approach?
Example of meta-alignment It combines alternative MSAs into one final output Idea: - errors produced by independent approaches should not be consistent - agreement suggests correctness - correlated methods violate M-Coffee’s assumption: method selection is important! Approach (T-Coffee based) - library = multiple sequence alignments - compile MSAs into into a single new MSA - score (color/numeric) as described for T-Coffee
37
evaluate the alignment: what is the consequence of low alignment scores?
for further analysis (e.g., phylogenetic inference): mask or remove sequences or alignment regions / columns that likely violate positional homology
38
Multiple sequence alignments: What kind of datatsets?
homologous coding sequences (DNA, protein) --> linear & mostly global alignments entire genomes (or genomic scaffolds) RNA families alignment-free sequence comparison
39
Multiple sequence alignments of whole genomes ? What do we need to consider for these? What's an example of a software to do this? Which other software does it incorporate and for what? What output format
entire genomes (or genomic scaffolds) - must take into consideration inversion, translocation, duplication - identify homologous blocks, then align these MUGSY * compute pairwise alignments (MUMmer) * identify & collect collinear regions (graph-based) * combine regions into MSAs (TCoffee) * output in MAF format
40
What do we need to consider for MSAs of RNA families?
secondary structure conservation over sequence conservation
41
Multiple sequence alignments Name 4 types of datatsets
* homologous coding sequences (DNA, protein) * entire genomes (or genomic scaffolds) * RNA families * alignment-free sequence comparison
42
MSAs: dataset: homologous coding sequences (DNA, protein) what type of alignments do we want?
linear & mostly global alignments
43
MSAs: dataset: entire genomes (or genomic scaffolds) what do we need to consider? what do we identify for alignment?
- take into consideration inversion, translocation, duplication - identify homologous blocks, then align these
44
MSAs: dataset/ aim to identify: RNA families what do we need prioritise?
secondary structure conservation over sequence conservation
45
EXAM QUESTION In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019) 2 topics where guide tree was mentioned and how (2020)
Guide trees for MSAs Many progressive (iterative) alignment methods use guide trees to generate an MSA: - compute a pairwise distance matrix - use alignment scores to compute a guide tree which tells us which sequence to align next *Guide tree is not a phylogeny! guides the order in which sequences are being aligned*
46
EXAM QUESTION List the steps for progressive alignment, and its main disadvantage. Describe and improvement for it, and the software that uses such improvement. (2019) Main steps basic progressive alignment + disadvantage, how to overcome (2020)
1. compute pairwise distance matrix 2. alignment scores --> compute guide tree 3. align closely related seqs, progressively add more distantly related seqs major weakness: alignment errors cannot be corrected once introduced, because sub-alignments are “frozen". solution: repair errors during post-processing = iterative refinement work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. implemented in most programs (eg MAFFT)
47