Module 3 Flashcards
What is DNA barcoding?
A way to use short standardised gene-regions to identify samples to species. A yard stick, Based on CO1 in animals (a mitochondrial gene), successrate of 96-100 % of correctly assigning samples.
Match to a database. If not in there potentially a way of identifying a new species (sort of).
Divergence rate in the CO1 gene at a sweet spot to be able to distinguish most species from each other. Easy to amplify and relatively fast evolving.
IT IS IMPORTANT TO NOTE that because there is an overlap between intra and interspecific genetic distances DNA barcodes cannot be used as the sole criterion for describing new species.
What can we do with DNA barcoding?
Identify animals at all life stages, fragments or products, stomach contents (food chain analysis), cryptic species.
Applications in food control, customs, invasive species control, disease vector. Police, agriculture, forestry, conservation, education
IT IS IMPORTANT TO NOTE that because there is an overlap between intra and interspecific genetic distances DNA barcodes cannot be used as the sole criterion for describing new species.
What are some critiques of DNA barcoding?
Can’t detect
- ancestral polymorphisms
- male-biased gene flow (mitochondrial is maternal)
- Gene flow after hybridisation
- Transfer of mtDNA loci to the nuclear genome
- Recent speciation
- Slowed rates of molecular evolution (like in corals)
What is the barcoding gap?
An underlying assumption that barcoding is based on, saying that there is a detectable gap between the genetic distance between species and the genetic distance within species.
Visualised as to normal distributions with an x axis of “genetic distance” that do NOT overlap.
IN REALITY there is an overlap between intraspecific and interspecific genetic distance, making it harder to distinguish the two.
Why is it necessary to resolve the taxonomic status of animals in wildlife management?
We have to know the tools at our disposal and the status of the things we are trying to save.
- Undiagnosed species can be left to go extinct or denied legal protection
- Incorrectly identified species can hybridize leading to outbreeding depression
- We can waste resources in populations wrongly identified as endangered or on hybrids that are not actual distinct species
- Populations that could be used to improve fitness of inbred populations can be overlooked
Causes of taxonomic uncertainty
- Morphological data is inadequate but has been widely used in the past
- Use of different species definitions means that there is no common “yardstick”
- Divergent populations part way through speciation
- Secondary contact and hybridisation
The tuatara example of taxonomic uncertainty
The tuatara are the only survivors of an ancient reptilian order and was believed to be a single species.
Studies using morphological, immunological, allozyme and mtDNA data showed that there are 3 distinct groups (ESUs), 1 of which were at risk of extinction with no legal protection and management.
After the taxonomic resolvement the three groups are now being managed independently and effectively.
The puma example from north america, taxonomic uncertainty
The puma was thought to be 8 subspecies based on morphological data, but later found out to be only 1 ESU since none were genetically distinct.
Using both microsatellites and mtDNA.
Here they were wasting resources on trying to manage 8 subspecies when really there was only one species.
Example Grey wolf, red wolf and coyotes, taxonomic uncertainty
Red wolf was found to be a hybrid between gray wolf and coyote and thus needed to be recognised as such and managed accordingly.
What is the biological species concept? (BSC)
Mayr 1963
A species is a group of interbreeding populations with unique genetic identities that are reproductively isolated from other populations.
Reproductive isolation allows species to evolve independently of other species
The foundation of many other species definitions.
Limitations:
Can’t
- Identify whether geographically isolated populations belong to the same species
- Classify species in extinct populations
- Account for asexually reproducing organisms
- Clearly define species when barriers to reproduction are incomplete (hybridisation)
THE BEST way to define species in terms of conservation is likely within an evolutionary context, i.e. a species exchanges genes and share an evolutionary trajectory
What is taxonomic inflation?
When the number of species artificially increase and increase the cost of conservation efforts
What is the best way of defining a species for conservation?
THE BEST way to define species in terms of conservation is likely within an evolutionary context, i.e. a species exchanges genes and share an evolutionary trajectory
What is an ESU?
An evolutionary significant unit is a population or group that has high genetic and ecological distinctiveness and should therefore be managed separately.
The ESU is informed by
- Life History (e.g. timing of reproduction)
- Behaviour (e.g. courtship displays)
- Morphology (e.g. colour pattern, horn shape, bill shape, flower shape etc.)
- Environment (e.g. habitat type, climate)
- Geography (e.g. vast distances, physical barriers)
- Molecular genetics (e.g. alleles frequencies, genetic distance, phylogepgraphy, discordance between divergence at candidate “adaptive” genes and neutral markers)
What are the three widely used frameworks for resolving taxonomic status?
- Reproductive isolation and adaptation (Waples 1991)
- Reciprocal monophyly (Moritz 1994)
- Exchangeability of populations -> genetic/ecological and recent/historical (Crandall et a. 2000)
What are other names for “clades”?
Natural group
Monophyletic group
All the tips and all the ancestors of the tips.
What is a paraphyletic group
Not a natural group, like fish. If the fish clade was to be considered monophyletic every terrestrial vertebrate would have to also be fish.
What does the length of branches mean in a phylogram?
Can indicate the rate of change of the DNA in a phylogram. The longer the branch the higher the level of change.
A phylogram shows more than just relationships (which is called a phylogeny)
What is a chronogram?
Also known as a timetree or as ultrametric.
A chart that shows absolute time in Mya based on fossil data to calibrate the data so we can infer when divergence events happened.
What is the root in a rooted tree?
Most recent common ancestor of the entire tree, indicates the ancestral lineage, the starting point of the tree.
What are the steps to building a molecular phylogeny?
- Gather/generate DNA data
- Sequence alignment
- Substitution model
- Phylogenetic analysis
- TADA Phylogeny
What does it mean if characters are homologous?
They are similar by descent
Also known as synapomorphies
E.g. comparison of structures in the limbs of humans, dogs, horses, bats, birds and seals
Can be used to reconstruct a systematic hierarchy of shared characters. These can be predictive of genomic differences.
In the DNA sequence alignment we distinguish homologies from analogies. Bad alignments lead to bad trees and interpretations.
What does it mean if characteristics are analogous?
They are similar but have evolved independently
What is essentially the point of DNA sequence alignment?
To distinguish between analogies and homologies
i.e. characteristics that look similar but aren’t due to common descent and characters that actually is due to common descent
The example of the Quagga, homology
An effort to revive an extinct species through selective breeding.
The last Quagga died in Amsterdam zoo in 1883, now scientists have set out to “revive” the Quagga by selective breeding of zebras to create animals with the same colouration.
Problem: The outcome is not a new species. It’s just a fancy looking zebra. If you perform a genetic analysis you find that the new “Quagga” is more related to zebras than to the ancestral Quagga.
What is an autoapomorphy?
Singletons, changes in the sequence of one species that can’t help understand relationships, but they can help understand the rate of change
How do distance based (Algorithmic) phylogenetic methods work?
They use a measure of similarity to group different OTU’s together and then pair those individual groups to each other to form a hierarchy.
Are fast and computationally easy
Reduce character states down to distances
Assumes a constant rate of change/evolution
Based on “Observed Distance” (“p”)
MAY NOT REFLECT THE “ACTUAL GENETIC DISTANCE” “d”
BLAST uses this approach
Often a good first approximation of the data
Name some distance methods in phylogenetics
Within Clustering algorithm:
1. UPGMA - Unweighted Pair Group Methods with Arithmetric Means (Clustering OTUs then clusters those new groups)
- Neighbour Joining, NJ, Shortest tree - Sequentially finds pairs of neighbours connected by a single node, aims to reduce the overall length of a tree
Within optimality criterion:
3. Minimum Evolution, ME, Shortest branch length - Reconstructs the tree with the shortest branch length, minimum distance
What is the relationship between observed distance, p, and actual genetic distance, d?
The observed distance can underestimate the true distance, especially when the degree of divergence is high (e.g. old splits)
Multiple substitutions accumulate and sequences become random/saturated
What is the difference in change rate between the 1st and 2nd codon position when compared to the 3rd?
The 3rd codon position has less impact on the outcome of mutations (it often doesn’t change the amino acid) and therefore it changes faster than the 1st and 2nd codon positions and should be treated differently when modelling evolutionary change
What are the differences in the change rates of purines (A and G, two rings) and pyrimidines (C and T)?
Transitions between the same type of nucleic acid are more frequent
Transversions between different kinds are less frequent
Name some different models of molecular evolution and what they assume
WHich model you use will affect the outcome of your analysis
- Jukes-Canter model (JC69)
All changes are equally likely, and we assume equal frequencies of all nucleic acids - Kimura model (K80)
Transition rate (between similar nucleic acids) differs from transversion rate (between different) and we assume equal frequencies - Hasegawa-Kishino-Yano (HKY85) and Felsenstein (F81) models
Ts and Tv rates differ
and we assume unequal frequencies - Tamura-Nei model (TN93)
Ts and Tv rates differ
and we assume unequal frequencies AND
Transistions between purines (A/G) and pyrimidines (C/T) differ - Generalised Time Reversible model (GTR) or T86 after Simon Tavaré
“Whole hog”, seperate rates for every single transition or transversion
Name the two types of Discrete Character Based phylogenetic methods (Tree-searching)
- Maximum Parsimony
- Maximum Likelihood (Hereunder Bayesian methods)
What does the Maximum Parsimony method do?
It evaluates alternative trees based on the character data, compares the number of changes and selects the tree with the fewest changes
What are some of the principles of parsimony?
(Mostly applied to non-molecular datasets and Pete does not usually apply this to larger datasets)
- Complex morphologies must reflect homology
- Evolution is rare
THis means that Parsimony does not tell us which tree is more likely to be true, but which tree is the simplest
Ignores multiple hits
Generally not used beyond initial distance trees!
What is the Maximum Likelihood method and what does it do?
Instead of comparing the tree to the data we’re comparing the data to the tree.
Ideally, this method takes your data and your evolutionary model (the one you pick) and then tells you what the probability is for each possible tree given the data.
Now, because there are so many possible trees this is not actually what happens.
(Pr(H I D) where H is the tree (and the model) and D is the sequences.)
INSTEAD what does happen is
Pr(D I H) or the probability of the data given the tree (and model).
We then prefer the tree with the highest value/likelihood
What is likelihood?
The probability of the data given a particular model of evolution
Reported as Natural log likelihood or log-Likelihood score (negative value written as -lnL)
Closer to zero means better fit of the data
WHat are the steps to generating a maximum likelihood tree?
- Alignment of the data
- Generate a tree and compare the aligned data to it (we’re also asking which model of evolution is best in this step)
- Optimise the likelihood of this tree
- Rearrange tree to generate new tree
- Compare, does new tree have higher likelihood than old tree?
- Yes: Keep tree or No; Keep old tree
- Keep going until you can’t find tree with better likelihood
What are the advantages and disadvantages of maximum likelihood?
Advantages:
- Reliance on explicit model of molecular evolution - adaptable
- Results are conditional on model
- Test different models - Likelihood ratio test
Disadvantages:
- Takes a long time - iterative (but CPUs are getting faster)
- More sequences, more problems
What is bootstrapping?
Bootstrap support tests the strength of a phylogenetic signal in an alignment by reshuffling your data to see if you get the same result
Resamples your data to make alternate datasets known as pseudoreplicates, so the data has the same number of sites but different sequence.
If you get a different topology the first tree has marginal support
The bootstrap value lets you know the number of times out of 100 or 1000 the same node was recovered. Above 90 and close to 100 is what we want, but above 70 is acceptable
Name thee ways of rearranging a tree to generate a new one within the maximum likelihood methodology
- Nearest Neighbour interchange (NNI)
- Subtree Pruning + Regrafting (SPR)
- Tree-Bisection + Reconnection (TBR)
How does the maximum likelihood approach work?
“Hill-climbing”/heuristic approach, it keeps climbing the likelihood “hill”, can’t step down and can get stuck on a local optimum in tree space.
How do we overcome the limitation of getting stuck on local optima in tree space?
Metropolis coupled Markov Chain Monte Carlo (MC^3)
Markov Chain Monte Carlo (MCMC) - A set of algorithms that walk randomly through tree space
- Markov Chain = Movement through states (Not influenced by past states, i.e. trees)
- Monte Carlo = Random sampling of numbers
Metropolis-Coupled
Cold chain robot = The one that’s actually looking for the best tree
Hot chain robot = Helpers that scout through tree space, can jump downhill and inform the cold chain if it finds a better area.
What is the main feature of Bayesian statistics and phylogenetics?
The main feature of Bayesian Statistics/Phylogenetics is that it takes into account prior knowledge of the hypothesis (the tree).
Prior information:
- Tree topologies
- Each branch length
- Substitution rates/Model of evolution
- Rate heterogeneity parameter
- Nucleotide frequencies
We can either give it realistic/informative prior information (stuff we know before investigating the data) OR flat prior information (which is kind of like giving it no information, all parameter values within a bound, like 0-1, are equally likely)
Bayes theorem - Describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
What is the burn-in period of a Bayesian phylogenetic analysis?
The beginning of the MCMC analysis where it’s trying to find a good area in tree space
You have to let the analysis run for a couple of generations so “the robots” can converge on the same area in tree space. The final results will then discard X amount of generations as “burning” (The first million steps)
Which topology do we pick for our tree?
After the Bayesian Phylogenetic analysis we end up with a posterior distribution of tree topologies and branch lengths. We can pick between the following topologies:
- MAP (Maximum a posteriori tree)
The topology with the maximum posterior probability (similar to ML tree) - Majority rule consensus tree
the tree constructed so it contains all of the clades that occur in at least X % of the trees in the posterior distribution - 95 % credible set
The set of all tree topologies that accounts for 95 % of the posterior probability
Node support is displayed as the posterior probability (PP) 1.0 or 100 = full PP support for the node
What is posterior probability?
A type of conditional probability that results from updating the prior probability with information summarised by the likelihood through an application of Bayes’ theorem
Bayes theorem - Describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Advantages and disadvantages of Bayesian phylogenetics
Advantages:
- You can pick between a range of models with variable rates of genetic changes
- It accounts for uncertainties in the tree topologies
- Gives you not one tree but a set of most probable trees
- Good for dating lineages, can be calibrated with the fossil record
Disadvantages:
- Takes a long time
- The quality depends on how well you sample tree space
- Can be really complex and complicated to set up
- Based on prior beliefs that could be wrong
What is the molecular clock hypothesis?
For a given gene region, the rate of molecular sequence evolution (amino acid replacement, nucleotide substitution, etc.) is stochastically constant through time and across lineages
This means that sequence divergence is proportional to time giving us a uniform mutation/substitution rate and that we can calibrate a clock based on this so we can infer evolutionary history from molecular data.
What is “strict clock” model and what is the problem with it?
Assumption that all sequences in an alignment have the same underlying rate of substitution.
Some datasets are “clock-like”, e.g. closely related species, but we should expect rates of evolution to vary between lineages, i.e. rates of evolution speed up/slow down across lineages and genes over time, also known as rate heterogeneity
What is “rate heterogeneity” and what could be causes of it?
Rates of evolution differ between lineages and genes over time.
This could be due to differences in DNA repair efficiency, metabolic rates, generation times, population sizes for lineages
and for genes it could be variation in gene and protein structure and functions
How can we account for rate heterogeneity in our data when doing the analysis?
“Local clock”:
We allow lineages within a phylogeny to have different substitution rates and can assign lineages (branches) to different rate categories
“Relaxed clock”:
Every branch can have its own rate
It’s impossible to estimate for every branch, so rates are predicted along the phylogeny based on a model of molecular evolution
This is what the “Beast2” program does
Name examples of things we can calibrate our phylogeny with
Legacy rates from similar taxa/genes
Legacy calibration from previous studies
Biogeographic events
- Barrier formations (closure of the Isthmus of Panama, closure of the Tethys pathway)
- Climatic events (Mass extinctions)
Fossil data (strata, fragments, full fossil specimens)
How can we calibrate a node in a phylogeny?
- Point calibration (If we found one fossil specimen at 50 mya we set the node there.’
- Max/min age constraint (The fossil was found in a deposit that was between X and Y mya, you set the node as a range)
- Parametric Distribution (Multiple fossils relating to a node, we can create a distribution relating to the probability of the node, exponential, lognormal, normal)
What is Phylogenetic Comparative Methods (PCMs) a combination of?
Population and quantitative genetics
Paleobiology
Phylogenetics
PCM is the analytical study of species, populations, and individuals in a historical framework to elucidate the mechanisms at the origin of the diversity of life
What is the relationship between genome size and mutation rate?
The bigger the genome, the lower the mutation rate. Smaller genomes have higher mutation rates.
What is the relationship between generation time and molecular evolution rate? What is the cause for this relationship?
Shorter generation time = Faster rate of molecular evolution.
Genomes get copied more frequently for shorter generation times and therefore collect more errors per unit time
Example, rates of evolution between trees/shrubs and herbs
A paper published in science looked at 5 clades of flowering plants to see whether there was a link between molecular rate of evolution and life history (generation time). Built phylogenetic trees, ML,100 bootstraps. For each branch they calculated number of substitutions per nucleotide per mio years.
They found that herbs (short generation times) had 2.7-10 times higher rates of molecular evolution compared to trees and shrubs
What are phylogenetically independent sister pairs?
Phylogenetic non-independence: Related species may share the same traits, so e.g. one single heritable trait that arose one time can be inherited by many descendants, and this can show up as a bunch of data points, even though it should only technically count as ONE data point. I.e. we risk counting one instance of change multiple times. This gives us an artificially high level of certainty about the relationship we see.
Making sure to have phylogenetically independent sister pairs means that the analysis is NOT artificially robust, but more likely to show a true relationship
Example, generation time and molecular evolution rate in invertebrates
143 species of invertebrate, 14 genes (mt and n), phylogenetically independent sister pairs.
Found a relationship that shorter generation times had higher rates of molecular evolution.
BEAR in mind that longer generation time is related to larger body size
Example, relationship between rate of molecular evolution rate, body size (weight) and poikilotherms/homeotherms
Martin and Palumbi confirmed that larger body size is related to lower rate of molecular evolution. BUT ALSO they found that homeotherms (i.e. warm blooded animals that can regulate their own body temperature) have higher rates of evolution compared to poikilotherms (i.e. cold blooded animals that rely on their environment to heat them up)
We see in the literature that both generation time and body size are indicative of the rate of evolution. But what is the driver?
It is assumed that having a large body with more cells requires higher DNA fidelity and repair. HOWEVER, correlation between DNA repair efficiency and body size has not yet been established.
In invertebrates, there is no evidence for influence of body size on substitution rate.
Metabolic rate might be the cause because of the mutagenic oxygen radicals produced
How is metabolic rate linked to molecular evolution rate?
- Metabolism, i.e. aerobic respiration, produces oxygen radicals that are mutagenic. We would expect mitochondrial DNA to be the most impacted because that is where 90 % of the oxygen in a cell is used.
- Higher metabolic rates mean that we have more energy to do things with, e.g. DNA synthesis and nucleotide replacement. Lower metabolic rate = lower turnover, less frequent repair
How could temperature be linked to molecular evolution rate?
Mearnsia, a type of plant, from many locations (Phillipines down to New Zealand) was sequenced and a ML phylogenetic tree was built.
The temperatures related with the branchlengths of the tree, indicating that the temperature impacts the rate of molecular evolution rate. Greater biologically available energy and the correlate productivity.
What is the neutral theory of evolution?
Molecular evolution is caused primarily by neutral mutations that randomly drift to fixation in a population resulting in nucleotide substitutions
How could landmass size impact rate of molecular evolution?
Apparently, the rate of molecular evolution is slower in birds living on islands/in small areas compared to birds living on the mainland or in large areas.
This is taken to mean that population size impacts the rates of both population-level mutagenesis and gene fixation. More reproducing individuals means more DNA replication error.
THIS HAS IMPORTANT IMPLICATIONS for biodiversity conservation:
If we put species in limited refugia we slow the tempo of their microevolution which can limit the potential for adaptive shifts in response to changing environments
What is the simple equation for calculating species richness?
Species richness = Speciation - Extinction
How does species richness change with latitude?
Latitudinal gradient hypothesis: There are more species in the tropics than in the polar regions even when compensating for reduced area.
However, Pete contributed to a paper on marine fishes that concluded that speciation rates were higher in the polar regions.
How does species richness change with altitude?
There are fewer species higher up, i.e. species richness decreases with increasing altitude.
Maybe partially due to the energy available to the different ecosystems, warmer lower down = more energy.
Why do extinction and speciation rates differ?
DIfferences in
- Population size
- Generation time
- Mechanisms of pollination and seed dispersal
- Strength of sexual selection
- Climatic effects
- Landscape heterogeneity
- etc.
What is tree topology?
The particular branching pattern of a tree
How do we read the balance of a phylogenetic tree?
If the tree is very balanced, i.e. has a balanced topology, it can indicate competition between close relatives; more competition, and less diversification in large clades.
If the tree is very imbalanced it can indicate that there are heritable characters that effect diversification e.g. key innovations that expand the available niche space and thereby increase the diversification
e.g. island colonisers with no competition (these are often more diverse than mainland relatives)
e.g. sexual selection leads to increased speciation (broadcast spawners show low diversification)
How do we read the timing of diversification?
Early diversification (long branch lengths) can indicate adaptive radiation e.g. Cambrian explosion. Evolution of ecological and phenotypic diversity within a rapidly multiplying lineage. (e.g. eyes, armor or increase in oxygen levels)
Late diversification can be a case of island clades being more diverse than their closest mainland relatives, just diversifying like mad because they have no competition
What does a Lineage Through Time (LTT) plot show?
How many new lineages have arisen over time in a phylogeny with time on the x-axis and ln(no of lineages) on the y-axis. This allows us to see whether the diversification rate has been constant over time or not
Abrupt changes could indicate e.g. a climatic event or a key innovation
What is the gamma statistic?
A method of measuring if per-lineage speciation and extinction rates have remained constant through time
Rejection might provide evidence of adaptive radiations of key adaptations
Assumes that you have a complete phylogeny in the clade you are interested in.
Null hypothesis = Constant rate of diversification
Null is rejected at the 5 % level if gamma is less than -1.645
What is incomplete taxon sampling?
The assumption of the constant rate test (gamma statistic) is complete taxon sampling, where you have every single species within a clade represented in your phylogenetic tree. This is very hard to do. Yet it’s so important because it may change the outcome of your analysis
How do we address incomplete taxon sampling?
We can address it with a simulation. Make 1000 trees with 18 taxa under a constant rate of diversification (If you believe that 18 is the number of species in your clade) and then randomly prune 7 taxa from each of the trees. THis allows us to build confidence intervals around our constant rate LTT plot. and see if the observed LTT plot falls within the confidence interval.
Example leaf beetle diversity
Authors tried to collect as many species of leaf beetle as possible, but could only find 83 out of 202 known extant species, which is 41 %.
Generated sequences from both mt and nDNA, chose a substitution model, and built a lot of trees using different methods (parsimony, ML, and Bayesian inference), used penalised likelihood (relaxed clock method) to generate an ultrametric tree (time tree), used fossils to callibrate tree and estimated gamma statistics.
Generated LTT plots generated 1000 replicate trees with 202 trees and pruned 119 from each to get a mean LTT + 95 % confidence interval.
Identified points of significant diversification rate shifts and tried to account for this change by using high latitude sea surface temperatures as a proxy for global climate.
Conclude that the KT boundary opened up a lot of niches which made it possible for the leaf beetles to diversify (Adaptive radiation). Furthermore, global warming made it possible for the beetles to expand latitudinally, making them more diverse, because of the following taxonomic diversification of tropical plant lineages that they use as hosts
Slowing of diversification observed after the warming period as niches have been saturated with lead beetles also, tropical plants retreated back to lower latitudes due to global cooling