7. pw, DP Flashcards
What is the difference between similarity/identity and homology?
homology: all-or-nothing condition (homologous or not homologous)
similarity / identity: quantitative measure, can be eg 20%
Can homology be observed?
cannot be observed or known, just inferred
Comparative sequence analysis: starting with seq A and seq B, what kind of analysis can we do?
similarity / homology?
compute (optimal) alignment
What does sytenic mean
(of genes) occurring on the same chromosome.
What is a dotplot?
What signals do they give?
In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment.
It is a type of recurrence plot.
signal
- identity, similarity
- length of consecutive signals
Define the pairwise sequence alignment (genes)
the comparison & arranging of two sequences by
* searching for pairwise matches and “good
mismatches” between their characters
* possibly inserting gaps in each sequence
What do we need to consider when obtaining a scoring matrix?
- observe trusted alignments of related proteins
- which residues are paired? (i.e., which substitutions have occurred?)
- different values for sequences of different evolutionary divergence!
- different scoring matrices for further diverged sequences!
Name two approaches for amino acid scoring matrices
What are their origins?
PAM (compiled by Margaret Dayhoff and her colleagues in the 1970s - very little data)
BLOSUM (Steven and Jorja Henikoff in 1992)
PAM matrices
what are they based on?
what does PAM1 imply?
Point Accepted Mutation
- based on observed amino acid substitutions in families of evolutionarily related proteins
- PAM1 implies 1 substitution per 100 amino acid,accepted by the processes of natural selection
PAM matrices
how do we get PAM250?
extrapolation of values for more distantly related proteins:
PAM250 = (PAM1)250
PAM matrices
What are the guidelines for which PAM matrix to choose?
PAM250 for proteins of 20% identity
PAM120 for proteins of 40% identity
PAM60 for proteins of 60% identity
BLOSUM matrices
what does BLOSUM stand for
What is it based on
BLOcks amino acid SUbstitution Matrices
based on local alignments of divergent sequences
BLOSUM matrices
How do we get different BLOSUM matrices?
eg BLOSUM50?
different BLOSUM matrices are not extrapolated but based on observed alignments
eg BLOSUM50 matrix is derived from alignments of sequences that are 50% identical
You want to compare two sequences that you believe may be distantly related.
How would you choose a BLOSUM matrix? a PAM matrix?
Choose a BLOSUM with a lower number
Choose a PAM with a higher number
(Maybe start with BLOSUM62 and then adjust)
BLOSUM matrices
What are the guidelines for how to choose a BLOSUM matrix?
eg when would you choose BLOSUM50?
guideline: a BLOSUM matrix index should approximately match the percent identity of the sequences to be aligned
–> BLOSUM50 matrix is best used for sequences
that are 50% identical
What is a good all purpose substitution matrix for proteins?
BLOSUM62
- all purpose - whether sequences are conserved or divergent.
- best for testing - then change parameters according to results
Scores in substitution matrices:
What do they mean?
How are they calculated?
which amino acids occur together in the alignment columns more often than expected by chance?
s(a, b) = log (pab)/(qabqb)
pab: observed frequency of residues a and b aligned
qab, qb : frequencies of residues a and b
Explain affine gap penalties
- score depends on the length of the contiguous gap
- gap opening penalty is larger : d
- gap extension penalty is smaller: e
In which different ways can an alignment be ‘optimal’?
Which kind of optimality are we aiming for? Which can we actually achieve?
- functionally
- structurally
- evolutionary
- algorithmically
Aim for evolutionary, but algorithmically is the only one we can really achieve.
What does it mean if an alignment is functionally optimal ?
aligned residues have the same function
eg functional domains
What does it mean if an alignment is structurally optimal ?
aligned residues play a similar role / are in corresponding positions in the 3D structure
eg hydrophobic residues
What does it mean if an alignment is evolutionary optimal ?
aligned residues are homologous, i.e. share a
common ancestry
What does it mean if an alignment is algorithmically optimal ?
the highest-scoring alignment for a given substitution model and gap penalties
What problem does dynamic programming solve for pairwise alignments?
GOAL: optimal (highest-scoring) pairwise alignment
PROBLEM:
- As length of sequences increases, number of possible alignments increases exponentially!
- constructing and scoring all possible alignments and picking the best one is not an option!
What kind of problems is dynamic programming used for?
What is the basic principle?
optimization
- problems are broken into smaller, nested subproblems
- solutions to subproblems are computed and stored
- these are used to construct solutions to larger and larger portions of the original problem
How is DP applied to alignment?
build up the best alignment by using optimal alignments of smaller subsequences
What are 3 steps for DP in optimal pairwise alignment?
- initialization: of score matrix
- scoring: matrix fill (calculate alignment score)
- traceback: and deduction of alignment
What was the original algorithm designed for sequence alignment?
What kind of alignment did it do?
Needleman Wunsch
for global
Which algorithm was designed for local alignments?
Smith Waterman
(based on Needleman-Wunsch)
How does traceback work in local alignments?
local pairwise alignment
- cells with negative scores are set to zero
- traceback starts at the highest scoring cell
- stops when 0 is encountered
What is the consequence of affine gap penalties when using DP?
consequence for dynamic programming implementation:
have to keep track of 3 scores and pointers at each cell
What is the effect of increasing the word size when generating dotplots?
reduces the noise, as short matches are removed.
However it also reduces the signal for the areas that appear homologous.
What program from which package can be used for generating dotplots?
What is this useful for ?
polydot or dotmatcher (more sensitive, uses scoring matrix) from
EMBOSS - European Molecular Biology Open Software Suite
Good way to get an overview of similarities of sequences
How can you get a dotplot? What parameters are there?
Combine (concatenate) sequences in one fasta file.
Use polydot. Parameters:
- word size
- type of output
Use dotmatcher. Parameters:
- window size
- threshold size
- scoring matrix
EXAM QUESTION
Sequence A and B have a length of 1000aa. Seq A has N-terminal region (front), with high similarity to a tandemly duplicated region in the middle of sequence B. Draw a dotplot presenting the similarities. (2019)
First 250 amino acids are tandem duplicated in middle of B (2020)
Dotplot 2 sequences (2020)