S2: W6 (Dr. Hanlie) Flashcards
Thing to note on genome size?
Size of genome doesn’t reflect the ¹complexity & ²size of organisms.
Whole genome DNA extraction attributes? (2)
• From various tissues like blood, body fluids.
• Extract whole genome of organism.
Why extract the whole genome/DNA?
It’s because you want to examine a piece of the DNA.
How to examine piece of DNA?
PCR.
What does PCR do?
It amplifies (clones) the desired piece/region of DNA.
PCR process steps? (3)
• Denaturation.
• Annealing.
• Elongation.
Primers attributes? (3)
• 30-50 bp.
• Forward primer (5’—3’).
• Reverse primer (3’—5’).
Purpose of primers?
To go anneal to your DNA region of interest just before your region.
Gel electrophoresis attribute?
Bands producing ladder is the no. of bp.
Types of sequencing? (2)
• DNA sequencing (Sanger method).
• Next-Generation sequencing.
DNA sequencing (Sanger method) attributes? (2)
• For standard PCR.
• Electrogram results at the end.
Next-Generation sequencing?
= amplifying multiple gene regions simultaneously.
What does the choice of sequencing method depend on? (2)
• Research goals.
• Budget.
Why use molecular characters to study evolutionary patterns? (2)
• Homoplasy.
• Rare events.
Explain Homoplasy as a reason we use molecular data to study evolutionary patterns?
Not misled by convergence in morphological characters.
Explain Rare events as a reason we use molecular data to study evolutionary patterns? (2)
• Show duplications, insertions/deletions, rearrangements.
• Very informative as they are not seen in ecology & morphology.
Why is homoplasy problematic in molecular data? (2)
• Only 4 bases (A, T, G, C).
• Mutations are common (and you have to track them).
Gene/Region alignment?
= statement of homology.
Why do we do Gene/Region alignment?
It’s because we are looking to group similar things together.
Why could deletions be informative? (2)
• Give information on gene region of interest.
• Give information on species.
Gene vs Gene region vs Gene fragment?
● Gene
= theoretical region.
● Gene region
= can incorporate many genes & is relative.
● Gene fragment
= piece of a gene.
- When in doubt, just use the word “region”.
Kinds of DNA to consider using for phylogenetic study? (3)
• Mitochondrial.
• Chloroplast.
• Nuclear.
mtDNA attributes? (3)
• Maternally inherited (usually).
• Circular.
• Evolves faster than nuclear genes.
cpDNA attributes? (2)
• Maternally inherited (usually).
• Circular.
nDNA attributes? (3)
• Biparentally inherited.
• Linear chromosome.
• Evolve very slowly, thus have slow evolutionary patterns.
Why are mtDNA usually maternally inherited?
It’s because sperm lose their mitochondria during fertilization.
What does your choice of DNA depend on?
Your research question.
Eg of where choice of DNA is important?
If you want to get the family history of dogs, use nuclear DNA.
mtDNA in plants attributes? (4)
• Circular.
• Maternally inherited.
• Very unstable.
• Highly variable (many rearrangements).
Why are mtDNA of plants very unstable?
It’s not informative.
Which DNA type is informative in plants?
cpDNA.
rRNA genes attributes? (3)
• Highly conserved coding region.
• Useful at family level & higher.
• Contains small subunit & large subunit.
How do you choose which region to use?
Depends on what research question you want to use (eg in Forensics, Conservation).
Thing to note on “How do you choose which region to use”?
When examining the speed of mutations, look at the gene region, particularly the mtDNA & nDNA. The deeper the relationships, the slower the mutations.
When examining speed of mutations, what do we look at?
The gene regions, particularly the mtDNA & nDNA.
mtDNA attributes in terms of speed of mutation? (2)
• Faster mutation rates.
• Individual & population level.
nDNA attributes in terms of speed of mutations? (2)
• Slow mutation rates.
• Family level to deeper relationships (very old origin).
Applications of molecular phylogenetic studies? (6)
• Clarify monophyly & the classification/delimitation of taxa (eg. genera, species).
• Interpretation of morphological evolutionary patterns.
• Trace the evolutionary history a species & explain current distribution (phylogeography).
• Provide a basis for interpretation of modes of speciation.
• Provide a basis for conservation prioritization.
• Enable tracing of sources of human diseases.
Eg of 1st Application of molecular phylogenetic studies?
Olive Ridley sea turtles vs Kemp’s sea turtle.
What could it mean when you have incongruence between nuclear & organelle (mt/cp) genomes?
Could be hybridization and/or introgression.
What do you mean when you say “Incongruence between nuclear & organelle genomes”?
We mean that the phylogenies of nuclear DNA & mt/cp DNA don’t match.
Why are more sequences not enough?
Chloroplast capture hypothesis.
Chloroplast capture hypothesis?
= where you have inter-species hybridization & subsequent backcrosses.
Result of Chloroplast capture hypothesis?
You have a new combination of nuclear & chloroplast genomes.
Thing to note on Incongruency?
Just know that there are many reasons for incongruency & know what to do when you do get incongruency.
What do different Phylogenetic Inferences (PI) depend on? (2)
• Data that you’re working with.
• Research question.
Types of PI methods? (4)
• Neighbour joining (NJ).
• Maximum Parsimony (MP).
• Maximum Likelihood (ML).
• Bayesian Inference (BI).
PI methods attributes? (5)
• Different techniques to reconstruct evolutionary relationships.
• Employ different algorithms.
• Common inference methods include: NJ, MP, ML & BI.
• Different criteria, assumptions & interpretations.
• Each with own pros & cons.
What do you by “Employ different algorithms”? (2)
We mean that they are either:
• based on underlying principles.
• based on computational requirements.
NJ attributes? (5)
• Distance method.
• Accounts for the rate of evolution.
• Based on substitution models (at nucleotide/amino acid sites) to estimate genetic distance from sequence data.
• Outdated.
• Branch lengths represent distance, not specific mutations.
“Neighbour” from NJ?
= involves closely or distantly related species.
Types of distance? (2)
• NJ.
• UPGMA.
UPGMA?
= distance method that accounts for a constant rate of evolution (all branches of equal length).
Types of clustering? (2)
• K-means clustering.
• Hierarchical clustering.
K-means clustering?
• requires the no. of pre-defined clusters (you tell it).
Hierarchical clustering?
• gives you a dendrogram (it tells you the number of clusters).
Distance method vs Clustering method?
● Distance method
= takes into account the evolutionary processes.
● Clustering method
= doesn’t take into account the evolutionary processes.
NJ Pros? (5)
• Speed & efficiency.
• Robustness to model assumptions.
• Versatility.
• Interpretability.
• Ease of implementation.
Explain speed & efficiency as NJ pro?
It’s suitable for large datasets, heuristic nature & can handle high dimensional distance matrices.
Explain Robustness to model assumptions as NJ pro?
It’s less sensitive to model misspecification or violations due to there being no complex evolutionary models used.
Explain versatilility as NJ pro?
You have a wide range of distance matrices.
Explain Interpretability as NJ pro?
Branch length is proportional to the estimated distances between taxa.
Explain Ease of implementation as NJ pro?
It’s some to interpret & understand with limited computational resources/expertise.
NJ Cons? (5)
• Sensitive to long branch attraction.
• Lack of statistical support.
• Inability to incorporate evolutionary models.
• Limited accuracy.
• Dependence on distance metrics.
Explain Sensitive to long branch attraction as NJ con?
It’s a potentially biased phylogenetic inference as it doesn’t consider mutations individuals.
Explain Lack of statistical support as NJ con?
There’s no confidence in the inferred tree topologies.
Explain Inability to incorporate evolutionary models as NJ con?
It doesn’t consider ¹substitution rate heterogeneity among sites, ²potential biased estimates of branch lengths & ³evolutionary relationships.
Explain Limited accuracy as NJ con?
Seen in datasets with complex evolutionary patterns/high levels of sequence divergence.
Explain Dependence on distance metrics as NJ con?
Dependence on distance metrics is unreliable as these vary depending on the biological system under study.
MP attributes? (6)
• Character-based approach.
• Based on optimality criterion of Parsimony (minimum tree length).
• Branch lengths represent the no. of mutations.
• Unique mutations are not informative.
• Prone to long branch attraction.
• Only synapomorphies are used for Parsimony information (relationships become important).
Steps involved in MP? (2)
• Searches for the tree topology with the lowest parsimony score (unrooted).
• Optimizes character states across taxa to construct the phylogenetic tree.
Thing to note on Steps involved in MP?
Wants the shortest route possible.
MP assumption?
We have evidence of every mutation.
Long branch attraction attributes? (2)
• Homoplasy on long branches looks like shared mutations.
• The more data you collect, the more the wrong tree.
MP pros? (5)
• Intuitive interpretation.
• Robustness to model assumptions.
• Applicability to diverse data types.
• Computationally efficient for small-medium sized datasets.
• Ease of interpretation.
Explain Intuitive interpretation as MP pro?
It’s straightforward & aims to reconstruct the tree topology with the fewest evolutionary change.
Explain Robustness to model assumptions as MP pro?
It’s less sensitive to model misspecification as there are no complex evolutionary models.
Explain Applicability to diverse data types as MP pro?
It can analyze different types of data.
Explain Ease of interpretation as MP pro?
Accessible to research using computational resources/expertise.
MP Cons? (5)
• Sensitivity to Homoplasy.
• Potentially suboptimal solutions.
• Inability to incorporate evolutionary rates.
• Limited statistical support.
• Not suitable for evolutionary testing.
Explain Sensitivity to homoplasy as MP con?
It assumes that similar character states are due to shared ancestry.
Explain Potentially suboptimal solutions as MP con?
Converge due to the heuristic nature of the search process, especially for large or complex datasets.
Explain Inability to incorporate evolutionary rates as MP con?
It doesn’t consider the ¹rate of evolution/variation among sites, ²potential for biased estimates of branch lengths & ³divergence times.
Explain Limited statistical support as MP con?
It’s challenging to assess the robustness of the inferred phylogeny.
Explain Not suitable for evolutionary testing as MP con?
It’s due to the reliance on parsimony score.
Differences between NJ & MP? (2)
• Branch lengths.
• Node support.
NJ vs MP in terms of node support?
● NJ
= bootstrap values indicating the proportion of times a particular clade is recovered in bootstrap replicates.
● MP
= assess support for nodes based on alternative analyses or additional statistical tests.
NJ node support?
Bootstrap values indicate the proportion of times a particular clade is recovered in bootstrap replicates.
MP Node support?
Assesses support for nodes based on alternative analyses or additional statistical tests.
NJ vs MP in terms of branch lengths?
● NJ
= represents distances, not specific mutations.
● MP
= represents the no. of mutations.
Which is better NJ or MP?
It depends on your research objectives.
Why does choosing NJ or MP depend on my research objectives? (4)
You have to consider:
• Underlying principles & model assumptions.
• Data characteristics.
• Computational requirements.
• Trade-offs between accuracy & efficiency.
NJ vs MP in terms of the underlying principles & model assumptions?
● NJ
= distance based.
● MP
= character based.
NJ vs MP in terms of computational requirements?
● NJ
= computationally efficient.
● MP
= computationally intensive.
Why is MP computationally intensive?
It’s because it involves complex heuristic search strategies.
Model evolution?
= estimates of the relative probability of substitutions.
What do the estimates of model evolution need information on? (4)
• Relative proportion of nucleotides.
• Relative frequency of transitions & transversions.
• Frequency of invariant sites.
• Differences in mutation rates between sites.
ML & BI attributes? (6)
• Incorporate model selection.
• Optimize models while constructing the trees.
• Operate based on likelihood & probability.
• Incorporate complex substitution models.
• Not sensitive to long branch attraction.
• Always the best to use.
Phylogenetic Inference = …?
A hypothesis no matter how complex it is, it still remains a hypothesis.
NJ use “criteria”? (2)
• Distance related.
• If you can account for long branch attraction.
ML node support attributes? (2)
• Value tells you how closely related taxa are (sister taxa).
• Uses bootstrap method (value out of 100).
BI node support attributes? (4)
• Probabilities.
• Value of 1 is great.
• 0.9-0.95 is trustworthy.
• <0.9 is unresolved.
BI node support of < 0.9?
Unresolved. You cannot trust the relationship between species.
BI node support of 0.9-0.95?
You can trust the relationship between species.
ML & BI in terms of branch length?
Branch length represents the properties of substitutions.
Likelihood in ML?
= used to evaluate different trees.
Probability?
NJ uses which model?
Jukes-Cantor (JC) model (simple substitution level).
Criteria for estimating the relative probabilities? (4)
Must have information on:
• Relative proportion of nucleotides (A:C:G:T).
• Relative frequency of transitions & transversions.
• Frequency of invariant sites.
• Differences in mutation rates between sites can vary (model dependent).
What make ML & BI special? (7)
• Probabilistic methods.
• Maximize the likelihood of observed sequence data under specified substitution model.
• Substitution models are more complex & flexible.
• Directly optimize model parameter during tree construction.
• Can handle wide range of evolutionary scenarios.
• Preferred for accurate model parameter estimation.
• Computationally intensive, especially for large datasets & time-consuming.
ML “method”/equations? (3)
● You’d like to know:
P (tree|data)
- i.e., what is the likelihood of tree give the data.
● To do this, you need to consider ALL possible trees
= not feasible for >10 taxa.
● So, we calculate:
P (data|tree)
- I.e., probability of this data given a specific tree (which tree suits the data).
ML equations to note? (2)
● P (tree|data)
● P (data|tree)
ML attributes? (5)
• Powerful method used extensively in statistics.
• Prefers hypotheses (tree) with the highest probability given the observed data.
• Very computationally intensive for phylogenies.
• Corrects multiple hits & removes the danger of long branch attraction.
• Accurately reconstructs relationships in diverged groups or groups evolving rapidly.
What must you be given for ML to produce the preferred tree? (2)
• Dataset (an alignment).
• A model of character evolution.
What is the preferred tree in ML?
= the tree that has the highest probability of having generated the observed data.
Probability of data given the tree definition?
= when the best tree is the one that maximizes the likelihood of the data given the ¹tree topology, a ²set of branch lengths & an ³evolution model.
BI attributes? (4)
• Probability of the tree given the data.
• Estimates trees & obtains measures of uncertainty for each branch.
• Optimal hypothesis maximizes the posterior probability (by measuring all its uncertainties).
• Posterior probability for a hypothesis is proportional to likelihood multiplied by the prior probability of that hypothesis.
Posterior probability?
= the end goal.
Prior probability?
= a scientist’s beliefs before having seen the data.
Eg of prior probability?
Roll of dice.
How does ML optimize parameters?
Optimizes parameters using numerical optimization algorithm.
How does BI optimize parameters?
Optimizes parameters using MCMC sampling.
Bayesian approach attributes? (9)
• Allows complex models of sequence evolution to be implemented.
• Doesn’t need bootstrapping to assess confidence in the nodes.
• Reports on posterior probabilities for branches.
• Feed it lots of prior information to get posterior probability.
• Specifies a model & prior distribution.
• Integrates the product of these qualities over all possible parameter values to determine posterior probability for each tree.
• Relies on MCMC to approximate probability distribution.
• Chain is constructed that moves through different trees & evolution models.
• Estimates probability that any particular tree is the true evolutionary tree for the observed data.
MCMC stands for?
Markov Chain Monte Carlo.
BI equations? (3)
● Eqn 1
P [tree|data] = P [tree & data] / P [data]
● Eqn 2
P [tree|data] = P [tree] × P [data|tree]
● Eqn 3
• obtained by substituting Eqn 2 into Eqn 1.
P [tree|data] = ( P [tree] × P [data|tree] ) / P [data]
P [tree|data] ?
= probability of tree is true given the data.
P [tree & data] ?
= joint probability of the particular tree & data (alignment).
Joint probability equation?
P [tree|data] = P [tree] × P [data|tree]
Joint probability?
= the product of the probability of the tree & conditional probability of the data given that tree.
Bayes theorem equation simply?
Bayes theorem = posterior probability × prior probability.
Joint probability equation simply?
Joint probability = posterior probability × prior probability = Bayes theorem
P [tree] ?
= the posterior probability.
P [data|tree] , i.e., conditional probability?
= prior probability.