Sequence allignment Flashcards

1
Q

What is the goal of sequence alignment?

A

The goal of sequence alignment is to identify similarities between sequences, whether it be a global similarity or local similarity. This can be used to infer the homology, function, and structural traits of a sequence. This can be done with DNA, RNA, and proteins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between global and local alignment?

A

Global alignment compares the entire length of the sequences, while local alignment only compares regions of a sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When is a global alignment preferred, and when is a local alignment preferred?

A

A global alignment compares the whole sequence. Therefore global alignment is best used on sequences with similar lengths and an overall higher similarity.

Local alignments are better suited for sequences that are evolutionarily distantly related and may therefore have a very low overall similarity. The local alignment is better suited to find conserved parts of the sequences since the low overall similarity would mask any conserved part when trying to align globally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which algorithms are used for global and local alignment?

A

For global alignment, the most common used algorithm is the needleman-wunsch algorithm. For local aligments, the most common is the smith-waterman algorithm. BLAST is also a commonly used algorithm for local alignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the log-odds ratio?

A

The log-odds ratio is a statistical method that calculates the probability that a given pair of residues will align based on the frequencies of those residues in a set of aligned sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a gap in a sequence alignment?

A

A gap represents indels (insertions and deletions) in a sequence and is often negatively scored when trying to find the optimal alignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between affine and linear gap penalty?

A

A linear gap penalty is a single negative score given to gaps in an alignment. This method does not differentiate between multiple and single residue gaps. This is preferred when single gaps are expected at a higher frequency than longer gaps.

The affine gap penalty introduces a gap opening and gap extension score. The extension score is often lower than the opening. This method treats multiple residue gaps worse than a single residue gap. However, the extension of a gap is treated more lightly than opening a new gap. This is preferred when longer gaps are expected to as frequent as single gaps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between the DNA/RNA gap penalty and the protein gap penalty?

A

For DNA sequences, the gap penalty is typically based on the number of nucleotides being inserted or deleted. For protein sequences, the gap penalty is typically based on the number of amino acids inserted or deleted and the sequence context. Some amino acids are more likely than others to appear in gaps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is dynamic programming in general?

A

Dynamic programming solves optimization problems by breaking them down into smaller subproblems and building up the solution from there. It is based on the idea of recursion, in which a problem is solved by solving smaller subproblems and then combining the solutions to the subproblems to solve the larger problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are heuristic models in general?

A

Heuristic methods are techniques that provide a good solution to a problem in a reasonable amount of time but may not guarantee the optimal solution. Heuristic methods are often based on empirical data or experience. They are designed to find a solution that is close to the optimal solution in a relatively short amount of time. Heuristic models are often faster than dynamic programming but require some additional assumptions and will sometimes miss the best match for some sequence pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are scoring/substitution matrices?

A

A scoring matrix is a table that assigns a score to each possible pair of amino acid residues based on their similarity and likelihood of being substituted by each other.

The matrix is used to score the alignment and identify the optimal alignment based on the highest score. Different scoring matrices may be used depending on the type of aligned sequences and the desired sensitivity level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Big O notation?

A

Big O notation is a mathematical notation used to describe the performance of an algorithm in terms of how the running time or space complexity grows as the input size increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does it mean to say that an algorithm has a running time of O(n)?

A

An algorithm with a running time of O(n) has a complexity that grows linearly with the size of the input (n). This means that the running time of the algorithm will increase at a rate that is proportional to the size of the input. For example, if the input size doubles, the running time of the algorithm will also double.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the big O-notation for a global alignment with the Smith-Waterman algorithm?

A

The Big O notation of the Smith-Waterman algorithm for global alignment is O(mn), where m and n are the lengths of the two sequences being aligned. Since the sequences are approximately the same length O(N^2) can also be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between the dynamic programming matrices from global and local alignment?

A

The best local alignment can end anywhere in the matrix. The best global alignment will always start from the bottom. In the local alignment, the cells in the matrix can take the value of 0 if all other options have a value less than 0. Taking option 0 corresponds to starting a new alignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does BLAST work?

A

BLAST is a heuristic tool to find a good local alignment but not necessarily the best.

  1. BLAST creates a list of seeds or words from the query sequence. These words are usually 3 amino acids long or 8-12 base residues long. These seeds are spaced out along the query sequence.
  2. BLAST searches a database for matches to the seeds. If a seed matches a portion of a database sequence, the algorithm extends the seed on both sides to create a longer alignment.
  3. Continue extending the alignment until the best match is found: Unlike the FASTA algorithm, which stops the alignment once it reaches a certain threshold (e.g., 50% identity), BLAST continues to extend the alignment until it finds the best possible match. This can involve inserting gaps (indels) into the alignment to improve the overall match
  4. Once all of the seeds have been processed, the algorithm ranks the alignments based on the quality of the match, which is typically measured by the alignment score which is calculated by using a scoring matrix.
  5. The algorithm then outputs a list of the best alignments, ranked by the quality of the match.
17
Q

How does FASTA work?

A
  1. FASTA creates a list of seeds from the query sequence. These seeds are spaced out evenly along the query sequence.
  2. FASTA then searches the sequences database for matches to these seeds. If a seed matches a portion of a database sequence, the algorithm extends the seed on both sides to create a longer alignment
  3. Unlike the BLAST algorithm, which continues to extend the alignment until it finds the best possible match, the FASTA algorithm stops the alignment once it reaches a certain threshold, such as 50% identity.
  4. Rank the alignments based on the quality of the match: Once all of the seeds have been processed, the algorithm ranks the alignments based on the quality of the match, which is typically measured by the percentage of identical residues in the alignment.
  5. The algorithm then outputs a list of the best alignments, ranked by the quality of the match.
18
Q

What is the difference between FASTA and BLAST?

A

Sensitivity: BLAST is generally more sensitive than FASTA, meaning that it is more likely to find matches between a query sequence and a database of sequences.

Speed: FASTA is generally faster than BLAST

19
Q

What is the BLAST E-value?

A

The BLAST E-value is the number of expected hits of similar quality (score) that could be found just by chance.

E-value of 10 means that up to 10 hits can be expected to be found just by chance, given the same size of a random database.

The smaller the E-value, the better the match.

20
Q

What is a dot-plot used for?

A

To visually compare the similarity between two sequences. Each dot on the dot-plot represents a match between the residue in sequence x and y.

21
Q

What are a benefit and a drawback of dot plots?

A

A dot plot can identify intrasequence repeats, meaning subsequences that appear multiple times in the sequence.

A dot plot suffers from background noise, making it hard to distinguish dot patterns arising from background noise from significant dot patterns.

22
Q

How can the problem of background noise in a dot plot be solved?

A

The easiest way to overcome the background noise problem is to apply a filter that requires comparison to achieve some minimum identity score. This makes it so that only diagonals of a certain length will survive the filter and be shown in the final plot.

23
Q

What is the minimum percentage of identity that can reasonably be accepted as significant as a measure of homology?

A

90% of sequence pairs with identity at or greater than 30% over their whole length were pairs of structurally similar proteins. Therefore 30% is a general threshold for an initial presumption of homology.

Between 30% and 20% is called the twilight zone, where homology may exist but cannot reliably be assumed.

Below 20% is the midnight zone, where homology is very unlikely.

24
Q

What is the genetic or evolutionary distance?

A

A measure of the difference between two homologous sequences from different species.

25
Q

What is the PAM substitution matrix?

A

PAM is used to score amino acid alignments and is based on the observed frequency of amino acid substitutions in proteins that have evolved over time.

26
Q

What is the BLOSUM substitution matrix?

A

BLOSUM is used to score amino acid alignments and is based on the observed frequency of amino acid substitutions in regions of amino acid with a predefined percentage identity.

The score for the BLOSUM matrix was generated by aligning highly conserved short regions and then clustering the aligned sequences into groups according to similarity so that groups were clustered together if they exceeded a specific threshold for percentage identity. Substitution frequencies for all possible pairs of amino acids were then calculated between the clustered groups.

27
Q

What is the difference between PAM120 and PAM250?

A

The number refers to the number of accepted point mutations (PAM) per 100 residues. A PAM250 substitution matrix gives the substitution frequencies based on sequences that have been fixed on 250 mutations per 100 residues on average.

PAM120 is, therefore, only 120 mutations per 100 residues on average.

PAM250 is therefore better suited to distantly related sequences, whereas PAM120 is better suited for closer related sequences.

28
Q

How can a sequence alignment be improved?

A

By incorporating expert knowledge such as known structural proerties of one or more sequences.

If the structure of one of the proteins is known, the the gap penalty can be increased for the regions of known secondary structure such as alfa-helices and beta-sheets, as these regions are less likely to suffer indels.

29
Q

What is a position specific substitution matrices and how does it differ from a standard substitution matrices?

A

A position specific substitution matrices (or weight matrices) is a variant of a standard substitution matrices in which the substitution score is based on each residue position of how conserved or non conserved the residue is between proteins of the same family

30
Q

What is a PSI-BLAST?

A

A PSI-BLAST is a BLAST variant that searches a database to find alignments of protein sequences based on a position specific substitution matrices. PSI-BLAST first runs a cycle of a regular BLAST search, usually with the BLOSUM-62 substitution matrix. This results in an initial set of related sequences, based on a predetermined E value that is usually set very stringent.

This set is used to construct a PSSM, that can substitute the BLOSUM-62. PSI-BLAST can now run a search with the PSSM, updating the PSSM for every cycle that new proteins in added to the set of related protein sequences.

31
Q

What is a protein sequence alignment logo?

A

A protein sequence alignment logo is a graphical representation of the conservation of amino acid residues in a multiple sequence alignment of proteins.

It is usually represented as a stack of letters, where each letter represents an amino acid residue, and the height of each letter is proportional to the relative frequency of that residue at that position in the alignment. The sequences are usually aligned so that the residues most similar to each other are in the same column.

It allows to quickly identify conserved and variable positions in the alignment and can be used to infer functional and structural constraints on a protein’s sequence and potentially identify functional sites.