7. pw, DP Flashcards

1
Q

What is the difference between similarity/identity and homology?

A

homology: all-or-nothing condition (homologous or not homologous)

similarity / identity: quantitative measure, can be eg 20%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Can homology be observed?

A

cannot be observed or known, just inferred

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Comparative sequence analysis: starting with seq A and seq B, what kind of analysis can we do?

A

similarity / homology?

compute (optimal) alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does sytenic mean

A

(of genes) occurring on the same chromosome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a dotplot?

What signals do they give?

A

In bioinformatics a dot plot is a graphical method for comparing two biological sequences and identifying regions of close similarity after sequence alignment.

It is a type of recurrence plot.

signal
- identity, similarity
- length of consecutive signals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define the pairwise sequence alignment (genes)

A

the comparison & arranging of two sequences by
* searching for pairwise matches and “good
mismatches” between their characters
* possibly inserting gaps in each sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do we need to consider when obtaining a scoring matrix?

A
  • observe trusted alignments of related proteins
    • which residues are paired? (i.e., which substitutions have occurred?)
  • different values for sequences of different evolutionary divergence!
    • different scoring matrices for further diverged sequences!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name two approaches for amino acid scoring matrices

What are their origins?

A

PAM (compiled by Margaret Dayhoff and her colleagues in the 1970s - very little data)
BLOSUM (Steven and Jorja Henikoff in 1992)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

PAM matrices

what are they based on?

what does PAM1 imply?

A

Point Accepted Mutation

  • based on observed amino acid substitutions in families of evolutionarily related proteins
  • PAM1 implies 1 substitution per 100 amino acid,accepted by the processes of natural selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

PAM matrices

how do we get PAM250?

A

extrapolation of values for more distantly related proteins:
PAM250 = (PAM1)250

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

PAM matrices

What are the guidelines for which PAM matrix to choose?

A

PAM250 for proteins of 20% identity
PAM120 for proteins of 40% identity
PAM60 for proteins of 60% identity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

BLOSUM matrices

what does BLOSUM stand for

What is it based on

A

BLOcks amino acid SUbstitution Matrices

based on local alignments of divergent sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BLOSUM matrices

How do we get different BLOSUM matrices?

eg BLOSUM50?

A

different BLOSUM matrices are not extrapolated but based on observed alignments

eg BLOSUM50 matrix is derived from alignments of sequences that are 50% identical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

You want to compare two sequences that you believe may be distantly related.

How would you choose a BLOSUM matrix? a PAM matrix?

A

Choose a BLOSUM with a lower number

Choose a PAM with a higher number

(Maybe start with BLOSUM62 and then adjust)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

BLOSUM matrices

What are the guidelines for how to choose a BLOSUM matrix?

eg when would you choose BLOSUM50?

A

guideline: a BLOSUM matrix index should approximately match the percent identity of the sequences to be aligned

–> BLOSUM50 matrix is best used for sequences
that are 50% identical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a good all purpose substitution matrix for proteins?

A

BLOSUM62
- all purpose - whether sequences are conserved or divergent.
- best for testing - then change parameters according to results

17
Q

Scores in substitution matrices:

What do they mean?
How are they calculated?

A

which amino acids occur together in the alignment columns more often than expected by chance?

s(a, b) = log (pab)/(qabqb)

pab: observed frequency of residues a and b aligned
qab, qb : frequencies of residues a and b

18
Q

Explain affine gap penalties

A
  • score depends on the length of the contiguous gap
  • gap opening penalty is larger : d
  • gap extension penalty is smaller: e
19
Q

In which different ways can an alignment be ‘optimal’?

Which kind of optimality are we aiming for? Which can we actually achieve?

A
  • functionally
  • structurally
  • evolutionary
  • algorithmically

Aim for evolutionary, but algorithmically is the only one we can really achieve.

20
Q

What does it mean if an alignment is functionally optimal ?

A

aligned residues have the same function
eg functional domains

21
Q

What does it mean if an alignment is structurally optimal ?

A

aligned residues play a similar role / are in corresponding positions in the 3D structure
eg hydrophobic residues

22
Q

What does it mean if an alignment is evolutionary optimal ?

A

aligned residues are homologous, i.e. share a
common ancestry

23
Q

What does it mean if an alignment is algorithmically optimal ?

A

the highest-scoring alignment for a given substitution model and gap penalties

24
Q

What problem does dynamic programming solve for pairwise alignments?

A

GOAL: optimal (highest-scoring) pairwise alignment

PROBLEM:
- As length of sequences increases, number of possible alignments increases exponentially!
- constructing and scoring all possible alignments and picking the best one is not an option!

25
Q

What kind of problems is dynamic programming used for?

What is the basic principle?

A

optimization

  • problems are broken into smaller, nested subproblems
  • solutions to subproblems are computed and stored
  • these are used to construct solutions to larger and larger portions of the original problem
26
Q

How is DP applied to alignment?

A

build up the best alignment by using optimal alignments of smaller subsequences

27
Q

What are 3 steps for DP in optimal pairwise alignment?

A
  1. initialization: of score matrix
  2. scoring: matrix fill (calculate alignment score)
  3. traceback: and deduction of alignment
28
Q

What was the original algorithm designed for sequence alignment?
What kind of alignment did it do?

A

Needleman Wunsch

for global

29
Q

Which algorithm was designed for local alignments?

A

Smith Waterman

(based on Needleman-Wunsch)

30
Q

How does traceback work in local alignments?

A

local pairwise alignment
- cells with negative scores are set to zero
- traceback starts at the highest scoring cell
- stops when 0 is encountered

31
Q

What is the consequence of affine gap penalties when using DP?

A

consequence for dynamic programming implementation:

have to keep track of 3 scores and pointers at each cell

32
Q

What is the effect of increasing the word size when generating dotplots?

A

reduces the noise, as short matches are removed.

However it also reduces the signal for the areas that appear homologous.

33
Q

What program from which package can be used for generating dotplots?

What is this useful for ?

A

polydot or dotmatcher (more sensitive, uses scoring matrix) from

EMBOSS - European Molecular Biology Open Software Suite

Good way to get an overview of similarities of sequences

34
Q

How can you get a dotplot? What parameters are there?

A

Combine (concatenate) sequences in one fasta file.

Use polydot. Parameters:
- word size
- type of output

Use dotmatcher. Parameters:
- window size
- threshold size
- scoring matrix

35
Q

EXAM QUESTION

Sequence A and B have a length of 1000aa. Seq A has N-terminal region (front), with high similarity to a tandemly duplicated region in the middle of sequence B. Draw a dotplot presenting the similarities. (2019)

First 250 amino acids are tandem duplicated in middle of B (2020)

Dotplot 2 sequences (2020)

A