Part 2 - Lecture 1 - Global Pairwise Sequence Alignment Flashcards
In global pairwise sequence alignment what does the sequence mean?
could be RNA or Amino acids or DNA (ACGT) - alphabet assumed
What are the four criteria to define alignment?
- one sequence is positioned above the other
- spaces may be inserted into the sequences
- spaces may not appear on top of each other
-after inserting spaces the sequences must have the same length
What does global mean in global pairwise sequence alignment?
that we are aligning the entire sequence
What does pairwise mean in global pairwise sequence alignment?
we restrict our attention to 2 sequences at a time (in other methods there can be more sequences)
Out of these four examples which are alignments and which are not?
-all are alignments except for the top one in the left hand corner cause there are spaces on top of each other
What should alignments reveal?
biological relationships
Why might we align sequences?
-Do known sequences align well with ours? - check if we discovered a new gene
-What about parts that do not align at all?
-Can gather biological and evolutionary insights from parts that align well
What is sequence similarity a strong evidence of?
similar biological function
What are some sources of biological differences?
-substitution (point mutation)
-insertion of short sequence/deletion of short sequence (indel) do not know whether something has been inserted in one or deleted in another so call it indel
What is a segmental duplication?
duplicated blocks of genomic DNA ranging in size from 1-200kb
What is an inversion?
when a section of DNA breaks off and reattaches to the chromosome in reversed order
What is a transposition?
a discrete section of DNA is moved from one location in the genome to another
What is a translocation?
On piece of chromosome breaks and attached to another chromosome
What are some sources of technical differences in alignment?
-sequencing machines make mistakes
-different technologies lead to different errors (illumina has fewer indels an more SNVs and substitutions from PCR)
(PacBIO and Nanopore have greater indels cause they are long strand sequencing)
-PCR is a major factor
How to score alignments and what doe high scores indicate?
-high scores indicated better alignments
-each score is assigned to a position separately
-identity (match) = +1
substituion or mismatch = -u
indel = -S
What defines the best when we are trying to get the best alignmnt?
modeling and probability and statistics define best
What finds the best alignment?
algorithms
What is an alignment matrix?
if two sequences have length n and m then we have nm rows and cols with one more row and one more col for an added space at the beginning of each sequence
What do alignments correspond to?
paths
What does an alignment path look like with scoring?
match is +1
indel is -delta or -1
substitution is -u or -.15
How do you calculate the best score for alignment?
- calculates the best score for prefixes of two sequences
- update incrementally from there
If T is a function that gives the score of the best alignment sequence what is the first step to formalization?
What is some formal notation for general recurrence relation?
What is the average size of the human genome?
3 billion bases
When we calculate the score in each cell of the matrix how do we keep a record of which neighboring cell we used?
we can represent this as an arrow pointing back, up, or diagonally back and up
What does following the arrows allow us to do?
write out the alignment
What does moving upward mean?
inserting a space in the sequence written in the top
What does moving backward mean?
inserting a space of the sequence on the left
What does moving diagonally mean?
no space one letter is on top of the other
How do you perform a traceback?
- start in the bottom right
- follow the arrows to the top left
- each arrow adds a position to the alignment
- moving past a row or column consumes that row or column
What is time complexity?
-a function of sequence length - how does the amount of work scale for each individual cell
What is the time complexity for global pairwise alignment?
time complexity is O(n^2) if you compute n^2 entries in the matrix for length n sequences
-the amount of work for each individual cell is constant we need to look at all three instance and decide score based on that for each cell
What is space complexity?
the space we need is proportional to the size of the alignment matrix all values must be stored
What is the space complexity for global pairwise alignment?
the required space is quadratic because it is a function that scales like the square of the length of the sequences O(n^2)
What is never less than the space complexity?
time complexity - since every time we do work we store it and take up space