Molecular phylogenies 1 Flashcards
The history of taxonomy and phylogeny
1735: Heirachical tables based on morphological characteristics by Linnaeus
1868: Ladder of nature by Ernst Haeckel
1859: Phylogeny in the origin of species
Phylogeny of mathematics
1900s: The availability of DNA sequence data led to the modern era of phylogenetics.
Homology
Characteristics/ loci are homologous id they are similar and have descended from a common ancestor
This is why DNA must be aligned so that homologous sequences can be compared.
Molecular phylogenetics
Molecular phylogenetics compares DNA sequences to resolve the phylogeny of a species.
This information is scrambled, fragmented, hidden or lost so maths and statistical methods are used to recover information.
Types of molecular phylogeny sequence comparisons
Orthologous: Sequences from different species to study speciation and extinction
Homolous: Sequences from the same species to look at population genetics
Paralogous: Sequences from the same genome to look deletions and duplications.
Benefits of molecular characters over morphological characters
Advantage
- Very common (every locus in the genome can be its own characteristic)
- Objective, easy to quantify
- Available when morphology is uninformative (e.g. micro-organisms which look similar)
- Cheap and fast
- Can be obtained without specialist training
Disadvatage
- Unavailable for extinct species
- Ancient DNA is the exception as DNA can be extracted from remains
Example of a discovery following evolution of molecular phylogony
3 domains not 2
Types of mutations
Transition or transversion mutations
- transition are more common as they occur between bases of Similair shape (one or two ring structure)
- transfertion are less likely to conserve biochemical properties of original amino acid.
HIV study found transversions had much greater negative relative fitness effect -> Lyon’s et Al
silent/ synonymous mutations
non-synonymous mutations
insertions
deletions
Overview or the process for analysing sequences
Molecular sequence
alignment
genetic distances
Evolutionary tree of genetic distance
Evolutionary tree of time
Analyses:
- population-level processes
- Species-level processes
Sequence alingnment
- Sequences must be aligned to allow for positional homology and sequences to be compared.
- During alignment, positional homologies are proposed for each site, inserting gaps where needed.
- In analyses you have to set penalties for gaps and extensions to determine the precision of the alignment (do not want over fitting so gap penalty should be higher than alignment penalty)
- penalties set for different sequence differences (e.g. more for transversion than transition)
- The best alignment is chosen (the alignment with the lowest total cost)
- clustal is common tool
Genetic distances
Once they are alignment, the genetic differences (distance) between the sequences must be measured.
Hamming distance: p = number of different nucleotides sequence length
BUT cannot count the mismatch sites due to convergent evolution (Multiple hits problem) but we can assume that:
- Low divergence: observed number close to actual
- High divergence: observed number smaller than actual
Nucleotide substitution models use this assumption to workout the actual number of mismatched. -> distances between sequences
Simplest model: Jukes- Cantor model
- assumes each type of mutation occurs at constant rate
- Each nucleotide equally likely to transition into any of them
Transversion and transition different: HKY
Transversion and transition different: Kimura 2-parameter model
- Two rate parametres -> ALpha and Beta
- Calculate P and Q -> fraction of transition and fraction of transversion
Al rates are different: 12-parameter model
CAn add 13th parameter that takes into consideration change in mutation rate at GC rich regions -> but due to more assumptions these model perform worse than more simple models
Juke-Canor model
Simplest model which used algebra to calculate genetic distance from number of mismatches
The model makes many assumptions.
- Evoltuion at each site occurs at the same rate -> incorperate gamma model increases accuracy greately
- Nucleotide base frequencies are the same for all sequences
- Evolution at each site is independent
- The different types of mutations occur at the same rate.
Models can be made more sophisticated, and statistical models can be incorporated.
Different models have very different estimated of genetic distance.
Common phylogenetic methods
Algorithmic methods: Cluster algorithms are used to transform genetic distances into a tree
- Neighbour-joining trees
- UPGMA
Optimality methods: a score is defined to the tree and the highest score is selected.
- Maximum parsimony
- Maximum liklihood
UPGMA
A matrix of genetic distances is made and the two closest taxa are clustered to create one node
The matrix distance is recalculated and the next closest taxa is clustered.
This process is repeated
Limitations
- Assumes constant rate of substitution -> molecular clock hypothesis
Neighbour joining tree able to accomodate differences in rates -> branch length proportional to the amount of change.
Maximum parsimony
The tree which requires the fewest evolutionary changes to explain the observed sequences is the best tree.
This is determined by the parsimony score which is calculated for each character and summed.
not Suitable for fast-evolving or highly divergent populations with many evolutionary changes -> small differences unlikely to be significant
Parsimony score -> minimum number of evolutionary changes required to explain the observed characters.
- Score calculated seperately for each character and then summed
Maximum liklihood
The tree which is probabilistically most likely to have given rise to the observed sequences is the best tree.
CAlculate P(seqs|T,B,Q) :
- tree topology
- Branch legnth
- Rate parameters of substitution model (Q)
Slower and Bias for small samples and computationally extensive.
Requires substitution model which can introduce bias.
A proability is calculated for each tree and then the tree with the highest probability is chosen:
- Exhaustive search (Not possible when there are many tree options)
- Hill climbing