Module 3 Flashcards
What is Principal Component Analysis (PCA)?
- Simplify a large data set into a smaller set while still maintaining significant patterns and trends
- Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables
- These combinations are done in such a way that the principal components are uncorrelated and most of the information within the initial variables is compressed into the first components.
PCA vs population genetic modeling
- PCA is a descriptive tool
- Statistical modelling allows us to tease apart how different processes are shaping the data
- Ex. recombination variation effects, DFE, etc.
Aspects of population history
Population sizes through time, Changes in population size, Population splits, Migration
Why demographic inference is important
- A large fraction of mutations are effectively neutral and hence involve under genetic drift
- The majority of newly arising mutations that affect fitness are deleterious
- Natural populations have undergone complex demographic histories. The combined effects of population size changes, structure, and migration all shape patterns of within-species variation
- The efficacy of both mutation and recombination are mediated by the effective population size.
Coalescent theory
- Considers the genealogical history of genes in populations
- Uses DNA sequences data to make inference about population size, genetic structure, and evolutionary processes
- Coalescent processes are backward in time
- Analytical approximation of neutral processes, thus extremely fast for simulation purposes.
Coalescent processes
- Coalescent events happens rapidly when there are many lineages
- Coalescent events happen much more slowly when there are few lineages
Coalescent vs diffusion approximation
- Diffusion approximation tracks allele frequency changes through time
- Coalescent theory focuses on tracing the genealogical history of sampled gene lineages backward in time
An example demographic inference pipeline
STEP 1: Mask coding regions, and regions linked to coding regions
STEP 2: Identify the number of populations in your dataset, via software such as STRUCTURE of ADMIXTURE
STEP 3: Identify a set of demographic models to use for inference
STEP 4: Run dadi to infer which demographic model fits our observed SFS best
STEP 5: Run dadi to infer the best-fitting parameters for our best-fitting model (this time assessing log-likelihood only
STEP 6: Simulative the best-fitting model with best-fitting parameters using a coalescent-based simulator and compare the fit of the simulated SFS with the observed SFS
STEP 7: Assess how realistic the model and parameters are in the context of the population in question
Assessing the best fit, using the log-likelihood and the Akaike information criterion (AIC)
The log-likelihood tells us how likely the model is, given the data
The AIC assesses the relative amount of information lost by a given model…
FSC2 - fastsimcoal2
Demographic inference
Likelihood-based
SFS-based
Assessing likelihood on a variety of models based on the model fitting the data
The challenges of ancient DNA (aDNA) analysis
- Deamination of cytosine (i.e., causes ‘artificial’ C to T transition in DNA)
- Most of the DNA is not from the sample you want (e.g., it is rather from microbes that colonized the bone sample after death)
- Human contamination (anthropologists, lab techs, etc.)
The primary result here is that Neanderthals are more closely related to…
Modern non-African populations (i.e., non-African populations are less diverged from Neanderthal)
Recurrent Positive Selection
- Multiple selective sweeps
- Using divergence data
dN/ds
- Non-synonymous sites / synonymous sites
- A value that is >1 for dN/ds is evidence for recurrent positive selection/multiple selective sweeps.
Recurrent positive selection things to worry about and how to check
- Check quality of sequence alignment
- Relaxed constraint
- Check alignment
- Check for premature stop codons
-Check for duplicates
Incomplete Selective Sweep
A sweep that hasn’t yet reached fixation in a population
Patterns of variation
- High LD
- Mutations at intermediate frequency
- Long haplotypes
Incomplete selective sweep things to worry about and how to check
- Bottleneck
- Population structure
- Low recombination could result in - - - LD/long haplotypes
- It could be balancing selection
- SNP ascertainment
- Meiotic drive
- Estimate a population history from neutral data
- Estimate a recombination map
Complete selective sweep things to worry about
- Bottleneck
- Structure
- Allele surfing of low recombination
Allele surfing
- Whatever alleles on the “edge” of a range are expanding, those SNPs are more likely to be in the next generation and the expanding population as they “expand westward” (or span in one direction). An allele that happened by chance to be prominent is “surfing” through the generations
- “New colonization” type of model; doesn’t work very well in colonized populations where there could be admixing
- Recurrent bottlenecks in one spatial direction
Balancing selection
- Selection to maintain variation
- Selects to keep variation instead of eliminate it
- Heterozygote advantage
- Temporally-varying selection
- Frequency-dependent selection
- The individuals with the rarest allele have a fitness advantage
Soft Selective Sweep
- At the end of the sweep, there are at least two haplotypes
- The sweep has generated variation rather than reduced it
2+ haplotypes possible
Selection On Common Standing Variation model
- A neutral/nearly neutral allele became beneficial
- Possible in humans
- Considered by those who think soft selective sweeps are frequent
Selection On Recurrent Identical Beneficial Mutations model
- Two separate events mutations happening at the same site
- Very unlikely in humans
- Present in organisms with very high mutation rates (like HIV)