Sequence allignment Flashcards
What is the goal of sequence alignment?
The goal of sequence alignment is to identify similarities between sequences, whether it be a global similarity or local similarity. This can be used to infer the homology, function, and structural traits of a sequence. This can be done with DNA, RNA, and proteins.
What is the difference between global and local alignment?
Global alignment compares the entire length of the sequences, while local alignment only compares regions of a sequence.
When is a global alignment preferred, and when is a local alignment preferred?
A global alignment compares the whole sequence. Therefore global alignment is best used on sequences with similar lengths and an overall higher similarity.
Local alignments are better suited for sequences that are evolutionarily distantly related and may therefore have a very low overall similarity. The local alignment is better suited to find conserved parts of the sequences since the low overall similarity would mask any conserved part when trying to align globally.
Which algorithms are used for global and local alignment?
For global alignment, the most common used algorithm is the needleman-wunsch algorithm. For local aligments, the most common is the smith-waterman algorithm. BLAST is also a commonly used algorithm for local alignment.
What is the log-odds ratio?
The log-odds ratio is a statistical method that calculates the probability that a given pair of residues will align based on the frequencies of those residues in a set of aligned sequences.
What is a gap in a sequence alignment?
A gap represents indels (insertions and deletions) in a sequence and is often negatively scored when trying to find the optimal alignment.
What is the difference between affine and linear gap penalty?
A linear gap penalty is a single negative score given to gaps in an alignment. This method does not differentiate between multiple and single residue gaps. This is preferred when single gaps are expected at a higher frequency than longer gaps.
The affine gap penalty introduces a gap opening and gap extension score. The extension score is often lower than the opening. This method treats multiple residue gaps worse than a single residue gap. However, the extension of a gap is treated more lightly than opening a new gap. This is preferred when longer gaps are expected to as frequent as single gaps.
What is the difference between the DNA/RNA gap penalty and the protein gap penalty?
For DNA sequences, the gap penalty is typically based on the number of nucleotides being inserted or deleted. For protein sequences, the gap penalty is typically based on the number of amino acids inserted or deleted and the sequence context. Some amino acids are more likely than others to appear in gaps.
What is dynamic programming in general?
Dynamic programming solves optimization problems by breaking them down into smaller subproblems and building up the solution from there. It is based on the idea of recursion, in which a problem is solved by solving smaller subproblems and then combining the solutions to the subproblems to solve the larger problem.
What are heuristic models in general?
Heuristic methods are techniques that provide a good solution to a problem in a reasonable amount of time but may not guarantee the optimal solution. Heuristic methods are often based on empirical data or experience. They are designed to find a solution that is close to the optimal solution in a relatively short amount of time. Heuristic models are often faster than dynamic programming but require some additional assumptions and will sometimes miss the best match for some sequence pairs.
What are scoring/substitution matrices?
A scoring matrix is a table that assigns a score to each possible pair of amino acid residues based on their similarity and likelihood of being substituted by each other.
The matrix is used to score the alignment and identify the optimal alignment based on the highest score. Different scoring matrices may be used depending on the type of aligned sequences and the desired sensitivity level.
What is Big O notation?
Big O notation is a mathematical notation used to describe the performance of an algorithm in terms of how the running time or space complexity grows as the input size increases.
What does it mean to say that an algorithm has a running time of O(n)?
An algorithm with a running time of O(n) has a complexity that grows linearly with the size of the input (n). This means that the running time of the algorithm will increase at a rate that is proportional to the size of the input. For example, if the input size doubles, the running time of the algorithm will also double.
What is the big O-notation for a global alignment with the Smith-Waterman algorithm?
The Big O notation of the Smith-Waterman algorithm for global alignment is O(mn), where m and n are the lengths of the two sequences being aligned. Since the sequences are approximately the same length O(N^2) can also be used.
What is the difference between the dynamic programming matrices from global and local alignment?
The best local alignment can end anywhere in the matrix. The best global alignment will always start from the bottom. In the local alignment, the cells in the matrix can take the value of 0 if all other options have a value less than 0. Taking option 0 corresponds to starting a new alignment.