Bio Background Flashcards
DNA
Deoxyribose nucleic acid, encodes genetic program of prokaryotes and eukaryotes.
Long polymer made from nucleotides or bases.
Four DNA bases
C ytosine
G uanine
A denine
T hymine
adhered to the sugar/phospate to form the nucleotide
Purines
Adenine + Guanine (AG) - pair of connected rings
Pyrimidines
Cytosine, Thymine and Uracil (RNA) - single ring
Base pairing
Double helix stabilized by:
- Hydrogen bonds
- Base stacking interactions among nucleotides
A-T (2 hydrogen bonds)
C-G (3 hydrogen bonds)
Structure
- Backbone is alternating sugars/phosphates
- Center are hydrogen bonds
- Space between strands are binding sites for transcription
- Strands are antiparallel
- 5’ start - phosphate group
- 3’ end hydroxyl group
Top strand: 5’ -> 3’ (watson)
Bottom strand: 3’ -> 5’ (crick)
Replication
Occurs from 3’ to 5’ direction
Proteins
Linear polymer of aminoacids linked by peptide bonds
20 different types of aminoacids
Aminoacids
Alanine, Cysteine, Aspartic Acid (Asp D), Glutamic Acid (Glu E), Phenylalanine, Glycine, Histidine, Isoleucine, Lysine, Leucine, Methionine, AsparagiNe, Proline, Glutamine, Arginine, Serine, Threonine, Valine, Tryptophan, Tyrosine
Protein structure
Primary - sequence of aminoacids
Secondary - Local spatial arragment due to backbone interactions, short stretches alpha helices, beta helices
Tertiary - long range 3D chain side-to-side interactions
Quaternary - Chains fold around one another
Ribbon diagrams
3D structures adopted by aminoacids
Coiled ribbon = alpha helix
arrow ribbon = beta strand
thin string = loops
Central Dogma Molecular Biology
DNA - (Transcription) > RNA - (Translation) > Protein
Aminoacid seq in RNA is determined by nucleotide seq in DNA
Gene
Region of DNA controls a hereditary characteristic.
Corresponds to single mRNA which will be translated to a protein
Eukaryotes have exons interrupted by introns (no code seq)
RNA
DNA but sugar is ribose
Instead of Thymine is Uracil
Single stranded
mRNA transcribed from DNA -> translated into protein
tRNA used in translation
rRNA helps ribosomes assemble proteins
Transcription
Initiation - RNA polymerase binds to promoter site on DNA and unzips double helix
Elongation - free nucleotides bind to template strand and thymine is changed by uracil
Termination - seq signal termination RNA transcript is released and DNA zips up again
TAA, TAG, TGA -> Stop seq
ATG -> begin seq
Pattern matching
- Naive
- Finite Automata
- KMP
- Boyer Moore
- Suffix Tree
- Suffix Array
- Generalized suffix tree and array
Pattern matching - naive
Brute force algorithm
sliding pattern over the text
n = len(T)
m = len(P)
Time complexity O((n-m+1) * m) worst case
Pattern matching - Finite automata
Sigma = alphabet
Time complexity O(n)
preprocessing: O (m|sigma|)
pattern matching -> O(n+m)
Finite-Automaton-Matcher(T,d,m)
begin
q := 0;
for i:= 1 to n do
begin
q := delta(q,T(i));
if q = m then
print “P occurs from position”+ (i-m+1)
end;
end;
worst time O(m^3 * |alpha|)
build_delta(P, alpha)
begin
for q := 0 to m do
for each a in alphabet do
begin
k := min(m+1, q+2);
repeat
k := k-1;
until P[1..k] ] (is a suffix of) P[1..q]a;
delta (q, a) := k;
end;
end
Suffix function logic:
get the longest prefix of P which is a suffix of current input like
P = at
suffix(atcat) = 2
Pattern matching - KMP
Prefix
Time complexity O(n)
prefix function O(m)
Avoid testing useless shifts, avoid precompute delta function
Run linear complexity
Good for short patterns with repetitions
Pattern matching - Boyer Moore
bcr function O(|alphabet| + m)
gsr function O(m)
total: O( (n-m+1) * m + |alphabet|)
Bad Character Rule - check the cases that allow a shift of n characters on the pattern, move to right as much as possible
Good suffix Rule - use knowledge of how many matched characters in the pattern suffix
Case 1 - a complete match exists as another prefix in P, then shift based on delta array
Case 2 - there is not a complete match, use prefix function almost shift by all remaining chars
Run sublinear complexity
Usually faster for large P and T with repetitions
Suffix tree
preprocessing O(n)
search O(m+k)
total: O(n+m+k)
Ukkonen
Implicit suffix tree
- Suffix links, connect substrings using links between internal nodes
- Edge-labels compression, use indices instead of substrings O_space(1)
Generalized Suffix Tree
Time complexity: O (|alphabet of two strings| * n)
Concatenate two strings with unique separators and apply ukknonen algorithm to build the suffix tree T
Used for common substrings
text comparison
palindromes
find largest suffix of S1 which is also prefix of S2, viceversa
Pattern matching - Suffix array
Time complexity O(n)
Traverse depth first lexical order
Binary search to find match occurrences
Complexity O (m log n)
random strings O ( m + log n)
m = |P|
mlr accelerator (minimum left right)
Assumption of sequence alignment
Life is monophyletic
Biological entities share common ancestry
Phylogenetic similarity
Homology: similarity due to evolution
Analogy: similarity due to analogous function
Homology
A is homologus to B if they relate from divergence from a common ancestor
Paralogues -> different but related functions in one species
Orthologues -> Same function in different species
Analogy
A is analogous to B if similar function but different origins
Convergent evolution -> different species in a similar env adapt to same function
Types of Mutations
Transition
A <-> G between two purines
C <->T between two pyrimidines
Transversion
(mixed one purine one pyrimidine)
C <-> G
A <-> T
Deletions
Insertions
Inversions
A-T adenine turns to thymine
G-C Guanine turns to cytosine
Maximal parsimony hyphotesis
Among all explanations, the simplest one is preferred. Explain the absence/presence of nucleotides with the min number of evolutional changes
Easier to delete n bases in one site than one base in n sites
Types of alignments
Global alignment
- comparative and evolutionary studies
Local alignment
- database searching and retrieval
- ignores distantly related biological regions and focuses on evolutionarily conserved signals of similarity
Sequence alignment
Pairwise alignment - two sequences
- exact solution
Multiple sequence alignment - 3 or more
- approximated (heuristic) solutions
Gaps
ends - missing data
internal - deletion or insertion
Types of alignment
Manual alignment
Dot Matrix
Dot Matrix
match -> diagonal step through dot
mismatch -> diagonal step through empty
gap on top seq -> vertical step
gap on left seq -> horizontal step
W/ multiple window size, stringency, alphabet size
stringency = if at least h chars are identical
adv:
- simple
- trial and error to explore
disadv:
- expensive for large seq
- may not find best
- qualitative analysis
Scoring matrices and Gap penalties
- scoring system
: gap penalty -> gap-opening penalty, gap-extension penalty
: scoring Matrix M(a,b) -> based on the additive property of the score, implies poistion independence
match (a=b)
mismatch(a!=b)
Gap penalty
Fixed gap-penalty system -> 0, 1, or a constant
Linear gap-penalty system -> gamma(g) = -g*d = gap length g by a constant d
Affine score ->
opening cost (d)
extension cost (e)
gamma(g) -d - (g-1) * e, with e < d
Logarithmic gap-penalty system -> gap-ext increases with the logarithm of the gap length
Scoring matrices
- Identity scoring -> match 1, mismatch 0
- DNA scoring -> match 3, transition 2, transversion 0
- Chemical Similarity Scoring, higher scores for amino acids based on chemical similarity: size, charge, hydrophobicity
- Observed matrices: analyze substitution frequency
Scoring matrices
DNA
M(a,b) > 0 if match <=0 if mismatch
Aminoacids
PAM -> Percent/Point Accepted Mutation
Possibility of pair caused to homology and not by chance
BLOSUM -> Substitution matrices for aminoacids
direct observation of blocks of proteins having similar functions
PAM
related proteins up to 85% very similar
PAM1 - 1 substitution in 100 amino acid residues 1%
Going through N percent mutations
PAM-N Matrix
PAM250 -> 250 evolutionary steps
Pos score common replacement
neg score unlikely replacement
Short -> short seq, strong local similarities
Long -> Long seq, weak similarities
PAM60 60% close relations
PAM120 general use 40% identity
PAM250 distant 20% identity
BLOSUM
Blocks database, based on local alignments or blocks / observed
Families of proteins with identical function
highly conserved protein domains
Identify motifs -> blocks of local alignments
BLOSUM 62 is the default matrix in BLAST 2.0
BLOSUMn based on seq that are at most n percent identical
higher n more closely related
BLOSM62 general use
BLOSUM80 close relations
BLOSUM45 distant relations
PAM vs BLOSUM
top = distantly related proteins
PAM100 ~= BLOSUM90
PAM120 ~= BLOSUM80
PAM160 ~= BLOSUM60
PAM200 ~= BLOSUM52
PAM250 ~= BLOSUM45
bottom = closely related sequences
PAM vs BLOSUM best ones
BLOSUM -> best for local alignments
BLOSUM62 -> majority of weak protein similarities
BLOSUM45 -> long and weak alignments
PAM250 -> seq 17-27% identity
BLOSUM62 -> moderately distant proteins
BLOSUM50 -> FASTA searches
Hamming Distance
permits only substitutions (positive cost)
if |A| = |B|, 0<= d(A,B) <= |A|
X =aaaccd, Y=abcccd
d(X,Y) = 2 two substitutions necessary
Levenshtein Distance (Edit distance)
permits insertions, deletions and substitutions (positive costs)
0 <= d(A,B) <= max(|A|, |B|)
X=aaaccd, Y=abccd
d(X,Y)=2 , one substitution and one deletion
Episode distance
permits only insertions (positive cost)
d(A,B) = |B| - |A| or inf
X=aaccd,Y=abbaccd
d(X,Y) =2 two insertions
Dynamic programming - Pair alignment - Edit distance
D[i,0] = i
D[0,j] = j
D[i,j] = min(
D[i-1, j-1] + f(i,j),
D[i-1, j] + 1,
D[i, j-1] + 1
)
f(i,j) = 0 if match else 1
Needleman-Wunsch - Pair alignment
Complexity - Space (nm)
Time for build O(nm)
backtrack O(n+m)
D[i,0] = i
D[0,j] = j
D[i,j] = min(
D[i-1,j-1] + f(i,j),
D[i-1, j] + 1,
D[i, j-1] + 1
)
f(i,j) = 0 if match else 1
follow path back from D[n,m] to D[0,0]
vertical step -> align a symbol in A with gap in B
horizontal step -> align a symbol in B with gap in A
diagonal step -> match or mismatch
Similarity Score - Semiglobal alignment
y = gap penalty
D[i,0] = iy
D[0,j] = jy
D[i,j] = max(
D[i-1,j-1] + ro(i,j),
D[i-1, j] + y,
D[i, j-1] + y
)
ro(i,j) = scoring matrix
B top
A left
D[0,j] = 0 gap beginning of B first column
D[n,j] = 0 gap tail of B last column
D[i,0] = 0 gap beginning of A first row
D[i,m] = 0 gap tail of A last row
Smith-Waterman - Local alignment
D[i,0] = 0
D[0,j] = 0
D[i,j] = max(
D[i-1,j-1] + ro(i,j),
D[i-1, j] + y,
D[i, j-1] + y,
0
)
ro(i,j) = scoring matrix . f.e. match 1 mismath -1 y gap -1
y = gap penalty
Needleman-Wunsch vs Smith-Waterman
SW finds segments in two seq that have similarities
First row and first column = 0
neg score = 0
begin with highest score end at 0, top left
NW aligns two complete sequences
first row and column = gap penalty
can be negative
begin with cell at (n,m) end at (0,0)