9. MSAs Flashcards

1
Q

What are the two different applications of substitution matrices?

A

1.using identities or a substitution matrix to detect similarities of closely or distantly related sequences

2.using a substitution matrix to optimize an alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Comparative sequence analysis:

starting with: seq A + seq B –>

A

similarity / homology? compute (optimal) alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Comparative sequence analysis:

starting with: one sequence + many sequences –>

A

find database sequences that are similar (homologous) to the query sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Comparative sequence analysis:

starting with: homologous sequences, not aligned –>

A

compute a multiple sequence alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Comparative sequence analysis:

starting with: homologous sequences, aligned + many sequences

A

model the alignment, find additional family members

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Multiple sequence alignments (MSAs)

How can we collect homologous sequences for this?

A

collect (putatively) homologous sequences
- BLAST
- clustering approaches, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What can an MSA be used for?

A

use the MSA to do further analysis
- description of variable & conserved regions
- phylogenetic inference of sequences
- test for signatures of selection
- predict protein structure and function
- PCR-primer design
- …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Multiple sequence alignments (MSAs)

what is this?

A

arranging sequences such that residues within a column
* result in an optimal or reasonable score for a given a scoring scheme
* show maximal similarity
* are homologous (positional homology)
* play a common functional role
* are in equivalent positions in the corresponding structures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is positional homology and how is this relevant for MSAs?

A

positional homology
* aligned residues share a common ancestral residue in the ancestral sequences
* changes in the columns correspond to mutations

MSA, in the context of evolutionary analysis:
* a hypothesis about the positional homology of residues in homologous sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Challenges for good MSAs

biological?

A
  • biological accuracy: criterion for accuracy?
  • reconcile multiple pw alignments into a MSA - different possible alignments
  • highly divergent sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Challenges for good MSAs

non biological?

A

large datasets
- fast heuristics are needed to align thousands (millions?) of sequences
- accuracy of large-scale approaches?

computational
- mathematical accuracy: no fast solution exists, all approaches use heuristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MSAs - de novo alignment approaches (+ examples)?

which ones most relevant to us?

A
  • multiple “local” alignments (Dialign)
  • *progressive (iterative) alignment (Clustal, MAFFT) *
  • divide & conquer for huge data sets (PASTA)
  • *meta-alignments / consensus methods (M-Coffee) *
  • (machine learning?)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of progressive alignment? + examples

A
  • consistency-based approach (T-Coffee, MAFFT)
    • phylogeny-aware alignments (Prank)
    • very fast heuristics (MAFFT)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

MSAs - reference or seed-based methods?

A
  • probabilistic approaches (HMMs: HMMer)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Progressive alignment

steps?

A

1.compute a pairwise distance matrix

2.use alignment scores to compute a guide tree (not a phlyogeny!)

3.align closely related sequences, progressively add more distantly related sequences
- sub-alignments are “frozen”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Progressive alignment

What do we need to do for subalignments?
What is often variable?

A

★compute profiles for subalignments - summary/statistical information about conservation/residues in each column
* often: variable substitution matrix - at each step: based on distance between sequences to be compared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

For a progressive alignment, what is the sum-of-pairs score?

What does it use?

What assumption does it make?

What is the WSP?

A

sum of scores of all induced pairwise alignments

assumes statistical independence for all columns

uses a substitution matrix

weighted sum of pairs (WSP): pw scores are adjusted for biased phylogenetic distribution
- identity: 1
- mismatch: -1
- gap: -2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Progressive alignment
- first implementation?
- frequently used implementations?

(* basic progressive alignment )

A
  • first implementation: 1987
  • frequently used implementations
  • 1994: CLUSTAL W
  • 1997: CLUSTAL X
  • 2011: CLUSTAL O
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Progressive alignment - improvements on original implementation:

challenge: errors that are frozen in subalignments

solution?

A

solution: iterative refinement (most programs)

20
Q

Progressive alignment: improvements on original implementation:

  • challenge: suboptimal pw alignments in MSA

solution?

(* basic progressive alignment )

A

solution: consistency scores (T-Coffee)

21
Q

Progressive alignment: improvements on original implementation:

  • challenge: evolutionary correct (aware) MSAs

(* basic progressive alignment )

A

solution: modify gap costs (Prank)

22
Q

Progressive alignment: improvements on original implementation:

  • challenge: guide tree takes long for huge data sets
A

solution: fast clustering of sequences (Clustal-O)

23
Q

Progressive alignment
major weakness?
solution?

A

major weakness:
* alignment errors cannot be corrected once they are introduced
solution: repair errors during post-processing
* iterative refinement
* implemented in most programs

24
Q

Progressive alignment

Iterative refinement?

A
  • remove and re-align a single sequence
  • partition the sequences (randomly or tree-based), re-align within groups, then align groups
  • re-align after each profile-alignment (seq-profile, profile-profile)
25
Q

T-Coffee

What type of approach / objective function?

What can it also incorporate?

Position-specific library?

A

consistency-based: uses consistency as an objective function

best alignment: the one that agrees most with pw alignments
(cf the one with the highest sum of pairs score in progressive alignment)

Consistency:
- evaluates consistency with pairs of residues found in optimal local alignments and heuristic global alignments
- does not score gaps explicitly

can also incorporate extraneous information (e.g., structural constraints)

position-specific library
* similarity of the pair of sequences (sequence fragments) the residue pair comes from
* consistency of that residue pair with all other residue pairs
* score for aligning xi and yj

26
Q

Gaps in MSAs:

To make biological sense, how should a deletion be penalised? insertion?

What do most methods do?

Why is this a problem? solution?

What new problem does this present?

A
  • deletion in a MSA: should be penalized only where it occurs
  • insertion in a MSA: should be penalized only once

most methods don’t distinguish between insertions and deletions in MSAs: all gaps are considered deletions

problem: high penalties for a single insertion
solution: reduce the gap costs in regions already containing gaps, increase gap costs near existing gaps

new problem! encouraging overlap of gaps:
- collapse of independent and nearby insertions
- can lead to alignment over-compression: 2 independent insertion events, alignment over compressed
➡ violates positional homology
➡ incorrect alignment

27
Q

Gaps in MSAs: PRANK

what result? how?

A

computes ancestral sequences, marks insertion so they will not be:
- further penalized
- (mis)matched
during later alignment steps

–> improved results with denser sampling, when guide tree = true tree
–> better evolutionary awareness: eg for evolution through short insertions and deletions

28
Q

gaps in MSAs

ClustalW vs Prank

A

Clustalw
shrinkage/expansion through overlapping point mutations

Prank
evolution through short insertions and deletions –> better evolutionary awareness

29
Q

What did we learn about large alignments?

A

Alignment of sequences up to 1k / columns ca up to 8/10k: accurate alignments & phylogenies can be computed
- if the best aligners are used and/or
- evolutionary rate of indels is low

Any more seqs/cols: most aligners failed to complete
But low-accuracy methods complete –> alignments & trees are highly inaccurate
eg:

MAFFT, Clustal-Omega, PASTA
* fast pairwise comparison using clustering, guide sub-trees
* decreased accuracy: 60% agreement between methods (M Chatzou et al., assigned reading)

30
Q

What did we learn about huge alignments?

Applications?

Approaches?

A

applications
* MSA, phylogenetics, evolutionary analysis
* in the context of protein structure prediction

approaches
* divide & conquer approaches (e.g., Sate, PASTA, SATCHMO-JS, PROMALS, MAPGAPS): divide sequences into a subset of at most size X, align sequences in each subset, merge subsets into a final
alignment
* seed-based approach (e.g., UPP, MAFFT-Sparsecode, regressive): select subset, align, compute pHMM, use pHMM to align all remaining sequences to it

31
Q

Evaluation of alignment accuracy/usefulness

dilemma? solution?

A

dilemma!
* > 100 alignment programs are available
* heuristics! co-optimal alignments!
* errors in sequence alignments cannot be avoided

solution:
tolerate but quantify errors / uncertainty
➜ carefully select the alignment approach/software
➜ evaluate the computed alignment
* how good is the entire alignment? are specific regions?
* how useful is the alignment for the intended purpose?
* does the alignment have to be reduced/masked?

32
Q

Selecting a MSA method:

what do we need to ask?

A

can the method reconstruct the (near) correct alignment?
➜ true alignment generally unknown!

  • method’s published strengths & weaknesses
  • faster or more accurate?
  • for few or lots of sequences?
  • designed for structural or evolutionary analysis?
  • tested against a benchmark data set?
  • e.g., Balibase (structure-based alignments) or simulated alignments?
33
Q

Types of (problematic) alignments

A
  • short, long
  • highly divergent
  • extensions
  • insertions
  • orphans
  • subfamilies
  • repeats
  • motifs
  • lots of sequences
34
Q

How do we evaluate alignments?

A

usually evaluate by column
- (in)consistently aligned?
- between methods: M-Coffee
- (within methods: HoT, GUIDANCE)

or evaluate by sequence or sequence region
* non-homologous sequence?
* homologous sequence, misaligned?
* non-homologous sequence stretch
(e.g., assembly or annotation error)

evaluate by overall score?

35
Q

What are meta-alignments?

what problem do they solve?
how?
example?

A

different methods lead to different alignments (lead to different conclusions)

we can compute several MSAs and select the “best”, or generate a consensus

–> meta-methods: compute a MSA that is consistent with the original alignments (M-Coffee)

36
Q

M-Coffee

What is it an example of?

What does it do?

Idea?

Approach?

A

Example of meta-alignment

It combines alternative MSAs into one final output

Idea:
- errors produced by independent approaches should not be consistent
- agreement suggests correctness
- correlated methods violate M-Coffee’s assumption: method selection is important!

Approach (T-Coffee based)
- library = multiple sequence alignments
- compile MSAs into into a single new MSA
- score (color/numeric) as described for T-Coffee

37
Q

evaluate the alignment:

what is the consequence of low alignment scores?

A

for further analysis (e.g., phylogenetic inference):

mask or remove sequences or alignment regions / columns that likely violate positional homology

38
Q

Multiple sequence alignments:

What kind of datatsets?

A

homologous coding sequences (DNA, protein) –> linear & mostly global alignments

entire genomes (or genomic scaffolds)

RNA families

alignment-free sequence comparison

39
Q

Multiple sequence alignments of whole genomes ?
What do we need to consider for these?

What’s an example of a software to do this?
Which other software does it incorporate and for what?
What output format

A

entire genomes (or genomic scaffolds)
- must take into consideration inversion, translocation, duplication
- identify homologous blocks, then align these

MUGSY
* compute pairwise alignments (MUMmer)
* identify & collect collinear regions (graph-based)
* combine regions into MSAs (TCoffee)
* output in MAF format

40
Q

What do we need to consider for MSAs of RNA families?

A

secondary structure conservation over sequence conservation

41
Q

Multiple sequence alignments

Name 4 types of datatsets

A
  • homologous coding sequences (DNA, protein)
  • entire genomes (or genomic scaffolds)
  • RNA families
  • alignment-free sequence comparison
42
Q

MSAs:
dataset: homologous coding sequences (DNA, protein)

what type of alignments do we want?

A

linear & mostly global alignments

43
Q

MSAs:
dataset: entire genomes (or genomic scaffolds)

what do we need to consider?
what do we identify for alignment?

A
  • take into consideration inversion, translocation, duplication
  • identify homologous blocks, then align these
44
Q

MSAs:
dataset/ aim to identify: RNA families
what do we need prioritise?

A

secondary structure conservation over sequence conservation

45
Q

EXAM QUESTION

In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019)

2 topics where guide tree was mentioned and how (2020)

A

Guide trees for MSAs

Many progressive (iterative) alignment methods use guide trees to generate an MSA:
- compute a pairwise distance matrix
- use alignment scores to compute a guide tree which tells us which sequence to align next

Guide tree is not a phylogeny! guides the order in which sequences are being aligned

46
Q

EXAM QUESTION

List the steps for progressive alignment, and its main disadvantage. Describe and improvement for it, and the software that uses such improvement. (2019)

Main steps basic progressive alignment + disadvantage, how to overcome (2020)

A
  1. compute pairwise distance matrix
  2. alignment scores –> compute guide tree
  3. align closely related seqs, progressively add more distantly related seqs

major weakness:
alignment errors cannot be corrected once introduced, because sub-alignments are “frozen”.

solution: repair errors during post-processing = iterative refinement
work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA.

implemented in most programs (eg MAFFT)

47
Q
A