9. MSAs Flashcards

Question

T-Coffee What type of approach / objective function? What can it also incorporate? Position-specific library?

Answer 1

consistency-based: uses consistency as an objective function best alignment: the one that agrees most with pw alignments (cf the one with the highest sum of pairs score in progressive alignment) Consistency: - evaluates consistency with pairs of residues found in optimal local alignments and heuristic global alignments - does not score gaps explicitly can also incorporate extraneous information (e.g., structural constraints) position-specific library * similarity of the pair of sequences (sequence fragments) the residue pair comes from * consistency of that residue pair with all other residue pairs * score for aligning xi and yj

Answer 2

* deletion in a MSA: should be penalized only where it occurs * insertion in a MSA: should be penalized only once most methods don’t distinguish between insertions and deletions in MSAs: all gaps are considered deletions problem: high penalties for a single insertion solution: reduce the gap costs in regions already containing gaps, increase gap costs near existing gaps new problem! encouraging overlap of gaps: - collapse of independent and nearby insertions - can lead to alignment over-compression: 2 independent insertion events, alignment over compressed ➡ violates positional homology ➡ incorrect alignment

Answer 3

computes ancestral sequences, marks insertion so they will not be: - further penalized - (mis)matched during later alignment steps --> improved results with denser sampling, when guide tree = true tree --> better evolutionary awareness: eg for evolution through short insertions and deletions

Answer 4

Clustalw shrinkage/expansion through overlapping point mutations Prank evolution through short insertions and deletions --> better evolutionary awareness

Answer 5

Alignment of sequences up to 1k / columns ca up to 8/10k: accurate alignments & phylogenies can be computed - if the best aligners are used and/or - evolutionary rate of indels is low Any more seqs/cols: most aligners failed to complete But low-accuracy methods complete --> alignments & trees are highly inaccurate eg: MAFFT, Clustal-Omega, PASTA * fast pairwise comparison using clustering, guide sub-trees * decreased accuracy: 60% agreement between methods (M Chatzou et al., assigned reading)

Answer 6

applications * MSA, phylogenetics, evolutionary analysis * in the context of protein structure prediction approaches * divide & conquer approaches (e.g., Sate, PASTA, SATCHMO-JS, PROMALS, MAPGAPS): divide sequences into a subset of at most size X, align sequences in each subset, merge subsets into a final alignment * seed-based approach (e.g., UPP, MAFFT-Sparsecode, regressive): select subset, align, compute pHMM, use pHMM to align all remaining sequences to it

Answer 7

dilemma! * > 100 alignment programs are available * heuristics! co-optimal alignments! * errors in sequence alignments cannot be avoided solution: tolerate but quantify errors / uncertainty ➜ carefully select the alignment approach/software ➜ evaluate the computed alignment * how good is the entire alignment? are specific regions? * how useful is the alignment for the intended purpose? * does the alignment have to be reduced/masked?

Answer 8

can the method reconstruct the (near) correct alignment? ➜ true alignment generally unknown! * method’s published strengths & weaknesses - faster or more accurate? - for few or lots of sequences? - designed for structural or evolutionary analysis? - tested against a benchmark data set? * e.g., Balibase (structure-based alignments) or simulated alignments?

Answer 9

* short, long * highly divergent * extensions * insertions * orphans * subfamilies * repeats * motifs * lots of sequences * ...

Answer 10

usually evaluate by column - (in)consistently aligned? - between methods: M-Coffee - (within methods: HoT, GUIDANCE) or evaluate by sequence or sequence region * non-homologous sequence? * homologous sequence, misaligned? * non-homologous sequence stretch (e.g., assembly or annotation error) evaluate by overall score?

Answer 11

different methods lead to different alignments (lead to different conclusions) we can compute several MSAs and select the “best”, or generate a consensus --> meta-methods: compute a MSA that is consistent with the original alignments (M-Coffee)

Answer 12

Example of meta-alignment It combines alternative MSAs into one final output Idea: - errors produced by independent approaches should not be consistent - agreement suggests correctness - correlated methods violate M-Coffee’s assumption: method selection is important! Approach (T-Coffee based) - library = multiple sequence alignments - compile MSAs into into a single new MSA - score (color/numeric) as described for T-Coffee

Answer 13

for further analysis (e.g., phylogenetic inference): mask or remove sequences or alignment regions / columns that likely violate positional homology

Answer 14

homologous coding sequences (DNA, protein) --> linear & mostly global alignments entire genomes (or genomic scaffolds) RNA families alignment-free sequence comparison

Answer 15

entire genomes (or genomic scaffolds) - must take into consideration inversion, translocation, duplication - identify homologous blocks, then align these MUGSY * compute pairwise alignments (MUMmer) * identify & collect collinear regions (graph-based) * combine regions into MSAs (TCoffee) * output in MAF format

Answer 16

secondary structure conservation over sequence conservation

Answer 17

* homologous coding sequences (DNA, protein) * entire genomes (or genomic scaffolds) * RNA families * alignment-free sequence comparison

Answer 18

linear & mostly global alignments

Answer 19

- take into consideration inversion, translocation, duplication - identify homologous blocks, then align these

Answer 20

secondary structure conservation over sequence conservation

Answer 21

Guide trees for MSAs Many progressive (iterative) alignment methods use guide trees to generate an MSA: - compute a pairwise distance matrix - use alignment scores to compute a guide tree which tells us which sequence to align next *Guide tree is not a phylogeny! guides the order in which sequences are being aligned*

Answer 22

1. compute pairwise distance matrix 2. alignment scores --> compute guide tree 3. align closely related seqs, progressively add more distantly related seqs major weakness: alignment errors cannot be corrected once introduced, because sub-alignments are “frozen". solution: repair errors during post-processing = iterative refinement work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. implemented in most programs (eg MAFFT)