Chapter 5-Bioinformatics Flashcards
Explain the molecular clock hypothesis
- Mutations accumulate randomly over time and there is a relatively constant rate of mutation of N base pairs per year
- Most of the mutations are neutral are natural selection would neither favour or disfavour them
- When an individual has progeny, these mutations are passed on to the next generation.
- Genetic difference between any two species is proportional to the time since these species last shared a common ancestor
What is the mutation rate in humans?
0.5-1 mutations/ 1 gigabasepair in one generation
Define homologues
Homologues are sequences that share a common evolutionary history (originated from a common ancestor)
Define orthologues
Sequences in different species that arose from a common ancestral gene during speciation and are responsible for the same function
Define paralogs
Homologous sequences within a single species that arose by gene duplication
Define sequence alignment
Process of lining up 2 or more sequences to achieve maximal levels of identity for the purpose of assessing the degree of similarity and the possibility of homology
Functions of pairwise sequence alignment
- Identify common and differing aa/nucleotides in equal positions
- Identify domains or motifs shared between proteins
- Evaluate if two proteins or genes have a similar sequence
Difference between pairwise sequence and multiple sequence
Pairwise: compares 2 sequences
Multiple: compares 3 or more sequences
What is local sequence alignment
- optimal similarity scores of 2 sequences determined over numerous subregions along the length of the 2 sequences
- useful in identifying protein domains
What is global sequence alignment
- optimal similarity score is determined over the entire length of the 2 sequences
- useful in assessing whether genes or proteins are homologous
How to calculate alignment score
Alignment score= match scores + gap penalties + mismatch scores
-gap penalties and mismatch scores have negative values
What is Needleman-Wunsch alignment?
Global sequence alignment
What is the Smith-Waterman alignment?
Local alignment
What do the values assigned to PAM mean?
- refers to the number of aa substitutions per 100 aa
PAM family extrapolation formla
PAM-n=(PAM-1)^n
PAM vs BLOSUM
- PAM matrices are based on global alignments of closely related proteins while BLOSUM matrices are based on local alignments
- All PAM matrices are extrapolated from PAM-1 while all BLOSUM matrices are based on observed alignments
Arrange the following in increasing order of divergence
BLOSUM-62, BLOSUM-45, BLOSUM-80
BLOSUM 80, BLOSUM 62, BLOSUM 45
How should you choose the appropriate aa substitution matrix?
- expected degree of sequence divergence
Which algorithms are exact?
Needleman-Wunsch and Smith-Waterman
-optimal but time consuming to compute
Which alignment algorithms are heuristic methods?
Pairwise: BLAST
Multiple alignment: all widely used multiple alignment programmes e.g. MAFFT
- based on common sense and assumptions
- not optimal, but fast
Function of BLAST
- a heuristic local alignment programme
- allows scientists to compare new sequences with databases containing many characterised genes
- results can provide valuable functional and evolutionary info
How to interpret the E-vale in BLAST
- statistical interpretation of how likely it is to get the alignment score by chance
- smaller E indicates a more significant alignment
- E<0.02 seq is probably homologous
- 0.021 this match is probably by chance
Importance of multiple sequence alignment
- Links proteins at the aa level, making it possible to identify conserved features, predict functionally impt residues and identify locations which affect the biochemical properties of the protein
- Basis for phylogenetic tree construction
- Allows generalisation of sequences to profiles
Uses of phylogeny
- Identifying orthologs and paralogs in gene families
- Discover population history and species history
- Estimate divergence times assuming molecular clock
Problem with multiple sequence alignment and how it can be overcome
- high dimensionality of solution space—> makes optimal solution hard to calculate with dynamic programming
- solution: use iterative algorithms
Steps in CLUSTAL
- Calculate pairwise alignments for every sequence pair and calculate sequence distances
- Using distances, estimate a guide tree using neighbour joining
- Using the guide tree, align the sequences in that order
Steps in neighbour-joining
- Find 2 nodes with minimal relative distance from each other compared to the distance to the others
- Replace these 2 nodes with a common ancestral node
- Compute all pairwise distances between the remaining node and the ancestral node
- Repeat steps 1-3 until only 2 nodes remain and connect them with an edge
Types of multiple sequence alignments
Progressive and iterative