Phylogenetics Final Flashcards
systematics
the inference of phylogenies, the genealogy of species, focus on the species tree (reconstructing lineage history)
coalescence
point of common ancestry of two alleles
they both come from the same parental allele
lines of descent in diploid sex pop
the most recent at the top and goes back in time at the bottom
the graph is a section of the genome
the pairs a individual genes
the circles are alleles of the gene
Assumptions of diploid sex pop allele descent
- equal probabilities of alleles being passed from one gen to the next (no selection, random mating)
- population size is constant over time
- alleles (from gen t) are drawn randomly with replacement from the previous generation (t-1)
probability of coalescence
given N diploid individuals in each generation, the probability that 2 alleles coalesce in the previous generation (t-1) is 1/2N.
it is just the random probability with replacement.
time to coalescence
the expected time to coalesce for two alleles is geometrically distributed with a mean of E(t) = 2N generations. N is the number of individuals in a generation.
coalescence is slower in…
larger populations compared to smaller populations
expected time to coalescence for many alleles
4N generations
properties of coalescent trees in a constant population
- coalescence is rapid with many alleles, decreases over time as n decreases
- have long trunks and short terminal branches
recombination causes
different genes in the same individuals to have different gene tree topologies
recombination in two ways
- independent assortment of individual chromosomes (the single strand) allows chromosomes to pair with those from the other pair
- crossing over between two chromosomes allows parts of the chromosomes to swap locations. how recomb can happen on a single chromosome.
recombinational gene
a block of adjacent nucleotides that share the same gene tree
this is the idea locus for phylogenetic inference
branching or splitting
the subdivision of an ancestral population by barriers to gene flow
population tree
a tree that contains many branching populations. all of the gene trees are embedded in this tree
reasons gene trees dont match pop tree
- deep coalescence
deep coalescence
also incomplete lineage sorting
alleles fail to coalesce at their species tree point and instead have coalesced earlier in time, before the ancestral polymorphism.
ancestral polymorphism
the mutated trait that led to the split of the species
pop tree length and width
length = generations (time)
width = effective population size (idealized population with same coal. props)
combining both:
coalescent unit = 1 unit is 2N the expected time to coalescence
long and narrow, coal more likely
short and wide, coal less likely
deep coalescence frequency of gene trees
that the major (most frequent ) topology of gene trees will match the population tree and the minor (less frequent) topologies are randomly discordant and are equal in frequency.
what is a phylogeny debate
- phylogeny as a cloud: phylogeny is a statistical distribution with a central tendency but variance because of the diversity of gene trees that are all included
anomaly zone
where the population tree does not match the most probably gene tree
in pectinate tree if internal branches are very short, all 4 taxa coalesce before first split, then the 3 symmetric gene tree possibilities are more probable than the pectinate tree gene tree
due to pectinate trees only having one possible coalescence and symmetric has 2 possibilites (in a rooted 4 taxon tree)
pectinate tree (unbalanced)
tree with each taxon being individually sister to the rest of the tree
only one possible sequence of coalescence
symmetric tree (balanced)
the clades split equally
has two possible sequences of coalescence:
- one sister group coalesces and then the other
- vice versa
anomalous gene tree (AGT)
a gene tree that is more probable than the pop tree
Total evidence phylogeny estimate approach
concatentation approach
combine all of the gene data in a single matrix (therefore all the gene trees)
make a single tree from all genes to map the alleles
- results well-resolved
- assumption: all genes share same history which is unlikely
- if pop tree has one or more anomaly zones, this data is wrong
- can incorrectly support branches
Multi-species coalescent approach
inferring a population tree based on a few genes
assumption: all gene tree discordance is due to deep coalescence
3 approaches within:
1. parsimony: minimize deep coalescences
2. full-likelihood co-estimation of pop and gene trees
3. approximate or summary methods
Multiple-species coalescent Parsimony
- count minimum number of deep coalescent events on species tree required to explain gene trees
- search among candidate species trees to lower the number
drawback= no estimates of branch length or width
inconsistent when there is anomaly zone
Likelihood multiple-species coalescent
given a candidate population tree, what is the likelihood of the gene data
measure coalescent times in units of mutations per site per gen (miu)
measure populations as mutations per site (theta)
theta = 4N miu
2 sequences coalesce at 2/theta with coalescence time theta/2
Pr( G|S) = probability of a gene tree given the population tree
Single branch probability depends on parameters
number of alleles exiting and entering that branch, times of coalescent events, branch length and width
Probability that a gene evolved in population tree
Pr(Xi|S) = integral of Pr(Gi|S) prob of gene tree embedded in pop tree x Pr(Xi|G) prob of sequence data given gene tree
do summary methods instead:
unrooted quartets all have same topology (no pectinate/symmetric), large number of gene trees = pop tree topology will have highest probability
ASTRAL
takes the gene trees and decomposes them into quartets
quartets heuristically reassembled into the optimal population tree
Host-associate paradigm in 3 different contexts
trees within a tree
1. gene tree within a species tree
2. parasite cospeciating with its host
3. organisms diverging resulting from geological events
the associate tracks the host
Paralogy
gene duplication :
- one gene starts then is duplicated and each gene copy diverges
- genes may be lost resulting in incomplete complements of paralogs
may cause a species to show up multiple times in phyl in diff clades, one per paralog
if unrecognized, can lead to discordance among gene trees and to pop tree
Orthologs
genes in different organisms that serve the same function and are in the same locus
inferring phylogeny depends on accurately identifying orthologs, hard when duplications and losses are pervasive
Paralogs
genes related by duplication, having different functions and at different loci
Horizontal gene transfer
genetic transfer without mating
can happen between close or sometimes very distant relatives
(common with host and parasite)
happens rarely: will result in alternate topology that is very unlikely
Introgression
hybridization (diff close species mating) followed by back-crossing (mating with a previous generation)
primary concordance tree with a secondary concordance tree and additional very poorly supported trees
Hybrid speciation
formation of new species from equal contribution of genes from either parent from different species (closely related)
expect 50% of loci of the new species to resolve as sister group to one parent species and the other 50% sister group to the other parent species
will result in two primary concordance trees that have equal CF
- .50 and .50 all the way down the tree to the common ancestor of the two species that made the hybrid
How to assess congruence between host/associate trees?
- are tree topologies more similar than expected by chance
- measure congruence in ages of associated clades
- likelihood that they evolved on the same tree?
- do data matrices with both fail to reject common tree
- fit associate tree with host tree, accounting for some incongruence
Processes causing incongruence in host/associate trees
- duplication (speciation) and incomplete sorting of associated lineages
- host switching or/or host range expansion
- unequal and/or different rates of molecular evo in host and assoc.
- horizontal gene transfer
Reconciliation analysis
parsimony mehtod to find the min number of events causing incongruence between host and associate trees
relative costs of duplication, sorting (loss) and host switching taken into account
branch length not
vicariance
the geographical separation of a population, typically by a physical barrier resulting in two species
cladistic or vicariance biogeography
reconstructing geogrpahic history from species cladograms
- assumptioin: areas have a treelike history of successive fragmentation events
- disjunction: occurence of related taxa is widely separated regions (due to vicariance)
Biological species concept
(BSC)
an interbreeding community of populations that is reproductively isolated from all other communities by its physical properties (incompatibility of parents, sterility of the hybrids or both)
- darwin was not a fan, says fertility could be individual and should not define species
Genotypic/phenotypic cluster concept
species are defined by identifying phenotypic or genotypic clusters of individuals who are more similar to each other, less similar to others
mallet said in 1990s (going back to darwin):
strong support if varieties exist in close proximity for a long time without combining
- geographic proximity, genetic clusters (based on close genetic distances), close phylogenetic distances
Species defined by pattern
- groups of indivs retain ecological/morph distinctions in sympatry (living very close)
- genotypic clusters (being similar to eachother and different from others)
Species defined by process
- reproductive isolation
-hybridization (not working? sterile?)
-natural selection
-inherited variation
-biological species concept (BSC)
Phylogenetic species concept
monophyly of a species taxa, all descendants of a common ancestor at the species level
used to be defined as terminal splits that all trees agree upon , Baum 2009 relaxes this by defining exclusivity
Exclusivity
That species are exclusive groups :
- a set of contemporaneous (existing at same time) organisms that form a clade for more of the genome than any conflicting part of genome (have higher concordance factor CF)
-allows for taxa with concordance factors <50%
Species as taxa
- products of history/evolution
-defined only by the past
-“phylogenetic” species concept
Species as functional units
- participants in evolution
- predictive about the future
-trait-based species concept
-biological species concept
Naming a clade as species is semisubjective
based on semisubjective criteria:
- biological significance
-utility
-predictive power
-robustness
-precedent
Explain my own meaning of the word species
be creative but consistent
Causes of discordance
- incomplete lineage sorting (deep coalescence)
- duplication and extinction of gene copies (paralogy disruption)
- gene flow
- horizontal gene transfer
- hybridization/introgression
Reticulate evolution
formation of a species through the partial merging of two ancestor lineages
Primary concordance tree
a tree composed of clade with higher concordance factors than any alternative clade (shows up more)
Concordance factors
the proportion of the genome for which a given clade is true (shown at the nodes or branches)
estimated separate numbers for sample of genes you sequenced and for the genome as a whole
How to calculate tree frequencies
learn in OH
D-statistic test process
a way to tell if discordance due to incomplete lineage sorting (ILS) or introgression
involves measring proportions of two state snps that have abba or baba patterns
- there is a primary assumed correct topology (p1 and p2 are sister) on 4 taxon pectinate tree
- create two alternative topologies of the in-group branches
- BABA is p1 and p3 are sister
-ABBA is p2 and p3 are sister - if p(ABBA) = p(BABA) then it is due to ILS, if they are not equal its due to introgression
calculation of d-stat from formula, sig by bootstrap
D- statistic from a matrix
species are rows and nucleotides are columns
read the columns of nucleotides and see which follow the ABBA and BABA patterns
then use the d-statistic formula to calculate D
D-statistic formula
ABBA - #BABA / #ABBA + #BABA
valus near 0 support ILS, values approaching 1 or -1 support introgression
Two classes of characters
discrete and continuous
Testing hypotheses of character evo with history vs. models
- History: questions concerning specific ancestors (species evolve before other species)
- Models: questions concerning general trends (do bilaterally symmetric flowers evolve from radially symmetric?)
Chronicle vs. Narrative
Chronicle: how did a trait evolve
Narrative: why did it evolve?
Exaptation
a trait whose evolutionary origin has no relation to its current utility
(wings are this in penguins, no longer used for flying but for swimming)
Steps of testing hypotheses of adaptation on a phylogeny
- infer the tree (parsimony or L)
- score the traits of interest (character states must be homologous, share common ancestor)
- score selective regimes
- reconstruct history of character changes and of selective regimes
- assess current utility relative to ancestral state (and performance) - measure fitness difference, compare performance in focal clade to sister group that has same regime and diff state
(example: are red flowers an adaptation for bird pollination?)
shifts in character and regime is less frequent than branching events
Selective regime
all abiotic and biotic factors that determine how natural selection will act on character variation
Markov models to test categorical trait evolution
Mk2 models
testing the joint evolution of two binary states (character and regime)
- consider all possible combinations of states in a matrix (4x4)
- assume: only one trait can change at a time, transition rates with 2 changes = 0
- have a hypothesis (L1= likelihood of data and tree with hyp) and null (L0)
- hypothesis is supported if the rate is greater in the way we wanted and the likelihood ratio test is significant 2(logL0 - logL1)
High diversification comes from
higher rates of speciation and/or lower rates of extinction
2 predictions of competing hypotheses for greater diversity
- the trait confers “species selection” = higher net diversification
- if rate of change to state 1 is greater than the reverse = higher diversity
BISSE
binary-state speciation-extinction model
parameters:
q01 or q10 - rates of state change
lambda is the rate of speciation
miu is the rate of loss
Phylogenetic covariance
the time (in branch length) during which two taxa have been evolving together
if it is low, the species are pretty independent of eachother, have been evolving separately for longer
if high, the species are expected to be similar, cant treat them as independent data points
Brownian motion
a stochastic process (occuring randomly within a period of time) where a trait takes lots of random steps
- clades can act as traits and drift as a group
the traits will land in different places in phylogenetic space depending on when they are split from eachother
if they split early, their positions are variable
if they split late, expected to be near
Phylogenetic independent contrasts
PIC
a statistical tool to address dependence of traits
reduces N obs to N-1 rescaled contrasts
(Av2 -Bv1) / (v1 + v2) = X1, new value at the node between A and B which is the weighted average of the tips
keep doing this process until there is only one X value left.
This X = the estimated ancestral state for whole tree under brownian
The weights downweigh the taxa on longer branches (expected to be dependent) so they have less influence
makes the observations all have means of 0
Results in plots with no correlations
A contrast is…
the difference between two tip states (sister states)
the normalized contrast
corrects for changes in variance between sets of sister groups based on difference covariances.
it does this by taking the contrast divided by the length of the branches combining the two sister taxa.
Under brownian motion, the constrast is :
expected to be 0, because the two offspring of the ancestor are expected to vary randomly around the same mean
Generalized least squares regressions (GLS)
a method of regression for fitting a model in which data are correlated
it allows
1. unequal variances
2. nonindependence
has two parameters:
a hat = the phylogenetic mean, the ancestral state estimated under brownian
σ2 = phylogenetic variance, rate of character evo under brown
covariance matrix (C) set up
diagonals have variances
non diagonals have covariances
also has the effect of downweighting the more covarying data
variance can also be calculated and is normalized by C
You can also use GLS to test
rate-shift hypotheses at different points in the tree
Phylogenetic evidence for viral transfer
being in a common clade or nearby clade
Processes that influence shape of virus phylogenies
- directional selection
- spatial structure and spread dynamics
- changes in population size
Serial samples, “tip-dated” tree
advantage viral has over normal
1. known dates of samples calibrate clock accurately
2. yields improved estimates of nucleotide substitution rate
samples are not all the tip, some are further back towards the root, due to fast generation time
Tree balance in viruses
unbalanced: recurrent ongoing selection, rapid spatial spread
- selection will drive evolution in a certain clade and cause other clades to drop off
balanced: lack of directional selection
- no preference for certain clades
Spatial structure
structured host pop = phylogenetic clades aligning with geographic position
unstructured = all mixed
Contemporaneous samples, ultrametric tree
ultrametric = all of the tips align and are the same distance from the root
the molecular clock estimates of node ages are increasingly uncertain toward the root
contemporaneous = occuring at the same time
Exponential inc in pop makes a tree with…
shorter trunks and longer tips
coalescence happens faster in a smaller population, the tips will have the largest pop so have the longest branches
Constant population size leads to a tree with…
long trunk and short tips
the amount of alleles at the tips are the most (going back in time) the ones at the root are only 2. trying to find your one other allele to coalesce with in a big constant population is harder than finding one of many alleles in the same pop.
Species diversity
same thing as species richness
the count of how many different species are in an area
Functional diversity
(trait diversity)
how much ecological variation among the species in an area, traits used as a proxy
Phylogenetic diversity
how much phylogenetic history is represented by the species in an area
measured on a tree by the branch length connecting the species you are considering.
the root node may be included or not (likely not), so just connect the species to eachother
the PD can vary when species richness is constant bc species will be further or closer on the tree
Phylogenetic clustering
that a community (species in area or a niche) is filled by species in the same clade
If traits aren’t available for diversity…
use phylogenetic diversity as a proxy
because closely related species tend to be similar in their traits
Phylogenetic overdispersion
that communities are made up over different clades
Mean nearest taxon distance (MNTD)
the average distance between each tip in the tree to its closest relative (the whole closest clade)
measured from one species to any of the species in the closest descendant clade
then average all of those distances to get the MNTD
Evolutionary distinctiveness (ED)
how different the species is from everything else, more isolated = higher ED
for a single branch: ED = length of branch divided by the number of descendant species coming off of that branch
for a species: = the sum of the branch ED scores along the path from the tip to the root
Global extinction risk
GE
Made by the IUCN, ranking species on their extinction risk based on a few factors
EDGE
ED + GE
takes into account the uniqueness of the species and the global extinction risk to make one score that can be used for protection
Traits tracking habitat or phylogeny more
traits tracking habitat more:
-clustered on the phylogeny by their traits
-more convergent evolution, traits came from different ancestors
traits tracking phylo more:
- overdispersed by trait
-evolution is conserved, from common ancestors
Phylogenetic clustering or overdispersion matrix
conserved, clustering of traits : phylogenetic clustering
convergent, clustering: phylogenetic overdispersion
conserved, overdispersion:
phylo overdispersion
convergent, overdispersion: phylo clustering or random dispersion
Baum’s species criteria
- biological significance
- utility
- predictive power
- robustness
- precedent
semi-subjective because these criteria might conflict with eachother