Measures of Diversity Flashcards
What different ways can you measure diversity - statistics?
- Allelic diversity
- Observed and expected heterozygosity
- Measures of identity - Nei’s gene diversity
- Measures of nucleotide diversity - population mutation parameter (theta), no. segregating sites, nucleotide diversity
- Other - mismatch distribution, site frequency spectrum
How do you measure allelic diversity?
Mean number of alleles per locus
How do you measure observed and expected heterozygosity?
Observed heterozygosity: Observed proportion of heterozygous genotypes at a locus - can be at one or across many - can be thought of a proportion
Expected Heterozygosity: expected heterozygosity at a locus expected under HW equilibrium: He = 1 - SUMpi^2 - pi = frequency of allele ‘i’
- If in HW equilibrium - observed and expected heterozygosity should be the same/very close
- If very different - we can infer things about population
Explain Nei’s gene diversity
Nei’s diversity, H, is the probability that two alleles drawn at random from the population will be different from each other
- Can be used for diploids and haploids: Diploids - heterozygosity - scaled by sample size
- Haploids - virtual heterozygosity
- H = n(1 - SUMxi^2)/n-1
- n = sample size (number of alleles)
- xi = frequency of the ith allele
- Uses haplotypes - good for mitochondrial data
Explain what population mutation parameter (theta) is
- Theta = population mutation parameter = mutation-drift parameter = neutral parameter
- Theta is the expected value of diversity under the neutral model
- Theta = 4Nu - in diploids
- Can be estimated via different ways and allows hypothesis testing
- Can be basis for selection tests - e.g., Tajima’s D
What is a feature of the assumptions of different theta estimators?
- They all have different assumptions
- But should give same result under neutrality
What different ways can the population mutation parameter (theta) be estimated ?
- Number of segregating sites (S)
- Site frequency spectrum
- Number of singletons (n)
- Mismatch distribution
- Mean number of pairwise differences (PI)
Explain ‘Number of segregating sites’
Number of segregation sites is the number of nucleotide positions that vary within a set of DNA sequences - Sn (when n is sample size)
- Weak measure because depends on length of sequence
- Can be converted into proportion of segregating sites - which is less dependent on the length of sequence but still depends on sample size
- Can use with the coalescence - can use Tajima’s test to compare value with one obtained from heterozygotes
Explain Nucleotide Diversity
The average proportion of nucleotide differences between all possible pairs of sequences in the sample
- Analagous to Nei’s Gene diversity measure - but is applied to polymorphisms
- Sample size independent
- If nucleotides evolve under neutrality, PI should be the same as theta
Explain the Mismatch Distribution
Is the distribution of pairwise differences (histogram)
- Requires discrete differences in the data
- Shape of mismatch distribution is very helpful - indicative of history of population - e.g., if a particular size of differences dominates - happens when lots of sequences coalesce at same time point
- Is a plot histogram as proportion of sequence pairs with that number of nucleotide differences between them
What are the different shapes of the mismatch distributions and what do they infer about the populations history?
- Constant sized population: has ‘spikey’ mismatch distribution - high raggedness statsistic (r) + long internal branches on trees - spikes mean many sequences coalesce at the same time point
- Expanding population: characterised by modal mismatch distribution and low raggedness score + short internal branches on tree - because variation occurs after expansion so variable sites are distributed on the terminal ‘tip’ branches - so most sequences will have a similar number of differences between them
- Curve moves further right as population expands
- e.g., recent expansion = not been a lot of time for new differences to accumulate - so nucleotides are only differing by one or two nucleotides
How do nucleotide diversity and coalescent theory link mathematically?
- T x u - probability that a mutation happens at a specific branch
- (T1 + T2 + T3 … ) x u - total number of mutations in the tree
- u = 4Nu - average length between two twigs = PI
- Under neutral model 4Nu = PI = theta
Explain Site Frequency Spectrum
Comapre the number of times a segregating site is present within a set of sequences
- Looking at the relative occurrence of variance with a particular frequency
- Some appear just once (singletons), others are in multiple sequences - Single site that occurs only once = ‘singleton’
How do site frequency spectrum differ depending on the population?
- Recent growth/ population expansion or positive selection - variants arose recently so most sites are present at low frequencies - excess of rare variants - e.g., positive selection (selective sweep) can mimic what may happen with population expansion (potentially after a founder effect for example) - need to be able to differentiate
- Balancing selection - causes an excess of more frequent variants - or genetic subdivision - due to lack of low frequency loci
- Example of how demographic and selection processes can lead to the same effects - need to think about how we can differentiate between the difference in cause for these effects
How can you differentiate between the effects of demographic and selection processes?
- Demographic processes - e.g., pop size - will effect all individuals in population
- Whereas selection - will only effect a smaller number of loci - so can test for this
- So can calculate number of singletons - that can be present only in the terminal branches