7. pw, DP Flashcards

1
Q

What is the difference between similarity/identity and homology?

A

homology: all-or-nothing condition (homologous or not homologous)

similarity / identity: quantitative measure, can be eg 20%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Can homology be observed?

A

cannot be observed or known, just inferred

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Comparative sequence analysis: starting with seq A and seq B, what kind of analysis can we do?

A

similarity / homology?

compute (optimal) alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does sytenic mean

A

(of genes) occurring on the same chromosome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a dotplot?

What signals do they give?

A

In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment.

It is a type of recurrence plot.

signal
- identity, similarity
- length of consecutive signals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define the pairwise sequence alignment (genes)

A

the comparison & arranging of two sequences by
* searching for pairwise matches and “good
mismatches” between their characters
* possibly inserting gaps in each sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do we need to consider when obtaining a scoring matrix?

A
  • observe trusted alignments of related proteins
    • which residues are paired? (i.e., which substitutions have occurred?)
  • different values for sequences of different evolutionary divergence!
    • different scoring matrices for further diverged sequences!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name two approaches for amino acid scoring matrices

What are their origins?

A

PAM (compiled by Margaret Dayhoff and her colleagues in the 1970s - very little data)
BLOSUM (Steven and Jorja Henikoff in 1992)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

PAM matrices

what are they based on?

what does PAM1 imply?

A

Point Accepted Mutation

  • based on observed amino acid substitutions in families of evolutionarily related proteins
  • PAM1 implies 1 substitution per 100 amino acid,accepted by the processes of natural selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

PAM matrices

how do we get PAM250?

A

extrapolation of values for more distantly related proteins:
PAM250 = (PAM1)250

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

PAM matrices

What are the guidelines for which PAM matrix to choose?

A

PAM250 for proteins of 20% identity
PAM120 for proteins of 40% identity
PAM60 for proteins of 60% identity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

BLOSUM matrices

what does BLOSUM stand for

What is it based on

A

BLOcks amino acid SUbstitution Matrices

based on local alignments of divergent sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BLOSUM matrices

How do we get different BLOSUM matrices?

eg BLOSUM50?

A

different BLOSUM matrices are not extrapolated but based on observed alignments

eg BLOSUM50 matrix is derived from alignments of sequences that are 50% identical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

You want to compare two sequences that you believe may be distantly related.

How would you choose a BLOSUM matrix? a PAM matrix?

A

Choose a BLOSUM with a lower number

Choose a PAM with a higher number

(Maybe start with BLOSUM62 and then adjust)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

BLOSUM matrices

What are the guidelines for how to choose a BLOSUM matrix?

eg when would you choose BLOSUM50?

A

guideline: a BLOSUM matrix index should approximately match the percent identity of the sequences to be aligned

–> BLOSUM50 matrix is best used for sequences
that are 50% identical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a good all purpose substitution matrix for proteins?

A

BLOSUM62
- all purpose - whether sequences are conserved or divergent.
- best for testing - then change parameters according to results

17
Q

Scores in substitution matrices:

What do they mean?
How are they calculated?

A

which amino acids occur together in the alignment columns more often than expected by chance?

s(a, b) = log (pab)/(qabqb)

pab: observed frequency of residues a and b aligned
qab, qb : frequencies of residues a and b

18
Q

Explain affine gap penalties

A
  • score depends on the length of the contiguous gap
  • gap opening penalty is larger : d
  • gap extension penalty is smaller: e
19
Q

In which different ways can an alignment be ‘optimal’?

Which kind of optimality are we aiming for? Which can we actually achieve?

A
  • functionally
  • structurally
  • evolutionary
  • algorithmically

Aim for evolutionary, but algorithmically is the only one we can really achieve.

20
Q

What does it mean if an alignment is functionally optimal ?

A

aligned residues have the same function
eg functional domains

21
Q

What does it mean if an alignment is structurally optimal ?

A

aligned residues play a similar role / are in corresponding positions in the 3D structure
eg hydrophobic residues

22
Q

What does it mean if an alignment is evolutionary optimal ?

A

aligned residues are homologous, i.e. share a
common ancestry

23
Q

What does it mean if an alignment is algorithmically optimal ?

A

the highest-scoring alignment for a given substitution model and gap penalties

24
Q

What problem does dynamic programming solve for pairwise alignments?

A

GOAL: optimal (highest-scoring) pairwise alignment

PROBLEM:
- As length of sequences increases, number of possible alignments increases exponentially!
- constructing and scoring all possible alignments and picking the best one is not an option!

25
What kind of problems is dynamic programming used for? What is the basic principle?
optimization * problems are broken into smaller, nested subproblems * solutions to subproblems are computed and stored - these are used to construct solutions to larger and larger portions of the original problem
26
How is DP applied to alignment?
build up the best alignment by using optimal alignments of smaller subsequences
27
What are 3 steps for DP in optimal pairwise alignment?
1. initialization: of score matrix 2. scoring: matrix fill (calculate alignment score) 3. traceback: and deduction of alignment
28
What was the original algorithm designed for sequence alignment? What kind of alignment did it do?
Needleman Wunsch for global
29
Which algorithm was designed for local alignments?
Smith Waterman (based on Needleman-Wunsch)
30
How does traceback work in local alignments?
local pairwise alignment - cells with negative scores are set to zero - traceback starts at the highest scoring cell - stops when 0 is encountered
31
What is the consequence of affine gap penalties when using DP?
consequence for dynamic programming implementation: have to keep track of 3 scores and pointers at each cell
32
What is the effect of increasing the word size when generating dotplots?
reduces the noise, as short matches are removed. However it also reduces the signal for the areas that appear homologous.
33
What program from which package can be used for generating dotplots? What is this useful for ?
polydot or dotmatcher (more sensitive, uses scoring matrix) from EMBOSS - European Molecular Biology Open Software Suite Good way to get an overview of similarities of sequences
34
How can you get a dotplot? What parameters are there?
Combine (concatenate) sequences in one fasta file. Use polydot. Parameters: - word size - type of output Use dotmatcher. Parameters: - window size - threshold size - scoring matrix
35
EXAM QUESTION Sequence A and B have a length of 1000aa. Seq A has N-terminal region (front), with high similarity to a tandemly duplicated region in the middle of sequence B. Draw a dotplot presenting the similarities. (2019) First 250 amino acids are tandem duplicated in middle of B (2020) Dotplot 2 sequences (2020)