4 - Multiple Sequence Alignment and profiles Flashcards

Question 1

Q

BLASTP and BLASTX filter out low complexity regions with the program SEG. Why?

Answer

A

Replaces them with X to prevent them from making the matches look better than they really are.

Question 2

Q

If you perform permutation tests and you get an distribution of scores way below the real score, is this alignment homologous?

Answer

A

Yes

Permutation are more sensitive in general and are good for very distant relationships

Question 3

Q

What is the twilight zone of evolutionary distance?

Answer

A

Around 15-23% identity

Question 4

Q

List the rules of thumb for the following alignments

A) Sequence > 100 AA, 25% identical

B) Sequence > 100 AA, 15-25% identity

C) <15% identity

Answer

A

A) Sequence > 100 AA, 25% identical
- Probably significantly homologous

B) Sequence > 100 AA, 15-25% identity
- Probably homologous, but need rigorous testing (including permutation tests)

C) s not significant, look for motifs in multiple alignments as well as tertiary structure

Question 5

Q

List two benefits to multiple alignments

Answer

A

Work better than pairwise alignment methods for detecting distant sequence relationships
Pre-requisite for estimating phylogenetic trees

Question 6

Q

Describe progressive multiple alignments

Answer

A

Eg. clustal

A heuristic method, and therefore not guaranteed to find the optimal alignment
Requires n choose 2 pairwise alignments as a starting point

Pairwise alignments: n!/2(n-2)!

Question 7

Q

Give the steps of ClustalW

Answer

A

Pairwise alignment to calculate distance matrix (distance between all pairs of sequences)
Neighbour joining tree
Aligns two most closely related pair using NW
Choose next most similar sequence or set of sequences according to the guide tree
The alignment is build up with each step being treated as a pairwise alignment, sometimes with each member of a pair having more than one sequence

Question 8

Q

Give 1 advantage and 3 disadvantages of ClustalW

Answer

A

Pros
- Fast

Cons

No objective function (optimality criterion)
No way of quantifying whether or not the alignment is good
Local minimum problem, if an error is introduced early, it is impossible to correct it later in the procedure.

Question 9

Q

How are sequences weighted with ClustalW?

Answer

A

Calculated from guide tree
Weights are normalized, so that the largest weight is 1
Closely related sequences have a large amount of the same information, so they are downweighted
These weights are used as simple multiplication factors when deriving the score of an alignment of groups or pairs

Weights allow you to take advantage of similar sequences when you already know the phylogeny or other information that is relevant to weighting.

Question 10

Q

How does clustal deal with penalties?

Answer

A

These are gap opening penalties and gap extension penalties.

These can be set by the user, but clustal will attempt to manipulate these according to the following criteria:

Dependence on the site properties
Dependence on the similarity of the sequences

The percent identity of the sequences is used as a scaling factor to increase the GOP for closely-related sequences and decrease it for more distantly-related sequences

Question 11

Q

Describe Clustal’s position-specific gap penalities and its reactions to gaps already present at a position

Answer

A

Before any pair of sequences are aligned, a table of GOPs are generated for each position in the two sets of sequences
The GOP is manipulated in a position specific manner, so that it can vary over the sequences

If there are already gaps at a position, the GOP is reduced in proportion to the number of sequences with a gap at this position and GEP is lowered by half.

Near gaps (within 8 residues) have an increased GOP

These rules discourage the opening of too many gaps close together but encourage them to exactly line up

Question 12

Q

Describe clustal’s treatment of gaps in protein loops

Answer

A

A run of hydrophilic (at least 5) residues has a decreased GOP because these runs usually indicate loop regions in protein structures
Any position with no gaps that are spanned by 5 hydrophilic residues have the GOP lowered by 3x

Question 13

Q

Why is it better to delay the alignment of divergent sequences when making multiple alignments?

Answer

A

The most divergent sequences are usually the most difficult to align.

The user has a choice of setting an identity cutoff to delay the alignment until the others have been aligned

Question 14

Q

What are the two major changes of clustal omega?

Answer

A

Faster distance matrix calculating method

- Incorporates a Hidden Markov Model into the main alignment engine

Question 15

Q

Why should the output of a multiple alignment algorithm always be checked?

Answer

A

Obvious mistakes can be made
Some sequences will ruin the alignment because they are too divergent
For phylogenetic inference, you should become familiar with a manual alignment editor.

Question 16

Q

How does database searching with conserved elements of multiple sequence alignment (motifs or patterns, or profiles) improve sensitivity of database searching?

Answer

A

Upweighting important (conserved) sequence elements and downweighting less important (less conserved) sequence features

A query is inherently similar to all sequences in an alignment, but not so similar to any one (less than 40% identity), therefore you need some way of summarizing information from all the sequences in the multiple alignment at once:

Profiles
PSSMs
HMMs
Sequence LOGOs etc.