Genomics and Evolution Flashcards
Chromosome number changes via which two mechanisms?
Fusion (reduction)- Muntjacs, for example have just 4 chromosomes due to fusion,
Fission (increase).
Chromosome structure changes via which mechanisms?
Inversions,
Translocations,
Segmental duplications.
What is a ‘pseudoautosomal region’?
Small region of homology between sex chromosomes. Humans have two, one at each end of chromosomes.
Platypus has 5 pairs of X and Y chromosomes, how does such an arrangement emerge?
Translocation of regions between sex chromosome and autosomes, this creates a small region of the sex chromosome which is homologous to an autosome and vice versa. During meiosis, these regions chain together, causing chromosomes to segregate as a group.
How do sex chromosomes typically evolve?
From a pair of autosomes that acquire a sex determining gene. Recombination is suppressed in the region surrounding the gene in the heterogametic sex. The non-recombining region can expand by inversions, resulting in nearly entire Y(or W)-chromosome becoming non- recombining
What is the impact of non-recombination on the Y chromosome?
Rapid (almost instantaneous in an evolutionary sense) degeneration and gene loss, with only a few indispensable genes remaining functional.
What do the loss of the ability to synthesise vitamin C in primates and the loss of teeth in birds have in common.
Both represent the general tendency for genes which become unnecessary to be lost.
Unnecessary genes become lost or defunctionalised all the time, but where do new genes come from?
Exon shuffling- exons are recombined into genome at new positions,
Gene duplication,
Retroposition- genes are reverse transcribed into new positions,
gene fusion/fission,
De novo origination.
What is alternative splicing?
When the same pre-mRNA is spliced in an alternative way to produce a new protein.
How did alternative splicing emerge?
Alternative splicing is thought to be a by-product of splicing noise – imperfect or incorrect splicing that occasionally occurs.
What are the two main theories of intron evolution?
Introns early- evolution of introns in RNA-world, and gradually lost in prokaryotes.
Introns late- introns evolved in the ancestor of eukaryotes.
What is the 2R hypothesis?
Vertebrates originated following two rounds of whole genome duplication
How does colour vision in primates illustrate evolution by gene duplication and sub-functionalisation?
Evolution of trichromatic colour vision in primates occurred as a result of gene duplication: the L- gene (for Long wave length) was duplicated and the resulting genes diverged little bit, resulting in L- and M-genes (for Medium wave length).
What is the C-value paradox?
Why do larger genomes not correlate with higher complexity in eukaryotes, as they do in viruses and prokaryotes?
How is the C-value paradox resolved?
The high abundance of non-coding DNA in eukaryotic genomes.
How can genome size be determined for extinct animals such as dinosaurs?
The size of pores inside the bones – the larger the genome, the larger the cell and therefore the larger the pore.
How does rate of DNA loss impact genome size?
The frequency and size of deletions occurring in the genome determines how efficient is genome downsizing. For example: * This study demonstrated that half-life of a piece of junk DNA (e.g. a pseudogene [broken gene]) in Drosophila is only 14 million years, while in the cricket it is over half a billion years – effectively junk DNA is never removed from Laupala genome. This results in hugely different genome sizes
Why is mtDNA useful for phylogenetic reconstructions in humans
mtDA mutates more frequently than nuclear DNA and it does not have recombination.
When was the mitochondrial “Eve” likely to have lived?
Molecular clock suggests mitochondrial Eve lived somewhere in Africa ~170,000 years ago.
When are humans likely to have first migrated out of Africa
75kya
How can we distinguish between migration and selective sweeps when using mtDNA to understand human origins?
It is impossible using only mtDNA. So it is important to look at other parts of the genome unlinked to mtDNA to reconstruct an unbiased picture of human pre-history.
Why is it challenging to use nuclear autosomal genes when reconstructing the history of human populations?
They recombine, which makes them poorly suited for phylogeny reconstructions. Instead, Principle Component analysis (PCA) is used for the analysis of polymorphism in autosomal DNA sequences.
Why does analysis of autosomal genes provide a more robust history of human populations?
Recombination leads to independence of evolutionary histories of different genes, as such, analysis of recombining nuclear autosomal genes provides a more complete picture compared to non-recombining markers.
How are traditional land inheritance practices reflected in human evolutionary genetics?
Comparison of mtDNA (mother-to-daughter) and Y-linked markers (father-to-son) reveals a lot more isolation by distance for the Y-linked markers, indicating much lower mobility of men compared to women, reflecting daughters moving to marry into different families.
What point serves as the limit on how far back in time a phylogeny can look?
In any phylogeny one can go back in time until the most recent common ancestor (MRCA) is reached, but no deeper than that. MRCA is effectively a ‘horizon’ for evolutionary genetic inference one cannot look beyond as no information is present about older lineages.
What evidence is there of human-neanderthal interbreeding
No evidence for interbreeding in mtDNA. However once nuclear Neanderthal genome was sequenced, it was estimated that ~4% of our genes are of Neanderthal origin.
Where did human-neanderthal interbreeding take place?
The signal of hybridisation between humans and Neanderthals was found only in Europeans and not in Africans, which makes sense given Neanderthals lived in Europe and were absent in Africa.
How did differences in human skin pigmentation provide adaptive advantages to humans in different environmental conditions?
There is still no clarity in what exactly is advantageous in having lighter or darker skin. It is thought that darker skin reduces photolysis of folic acid in high-UV environment, while lighter skin helps production of vitamin D3.
What are two key signatures of selection?
- Loss of genetic variation (polymorphism) around the target of selection.
- The new mutations accumulate in the region of low diversity all start at very low frequency (1/population size), meaning that after a sweep genetic diversity present in the region is likely to be represented by polymorphisms at unusually low frequency. This can be detected by several statistics, the most common of which is Tajima’s D.
What determines the size of the low variation region around the locus of a selective sweep?
The size of the region affected by the sweep depends on:
- local recombination rate- greater recombination results in a shorter region of low variation.
- Speed of sweep- slow sweeps allow for more mutation and recombination during the course of the sweep, resulting in a shorter low variation region.
How can population differentiation (the degree to which populations are subdivided) be quantified?
Population differentiation can be quantified using Fst statistic (=[Ht – Hs]/Ht, where Ht is total heterozygosity across all populations and Hs is average heterozygosity within populations).
What are genetic markers?
Genome regions (from single nucleotides to whole chromosomes) that are useful for measuring and investigating genetic variation in populations.
How have genetic markers used by population geneticists changed over the course of the field’s history?
The quantity and resolution of genetic markers has improved (exponentially) with the development of genetics:
- Proteins:
i. blood groups (1900),
ii. allozymes (electrophoretically-distinct proteins; 1966) - DNA (from 1970):
i. Sequence variations (SNPs, insertions/deletions of nucleotides)
ii. Structural variations (gene duplications/losses, chromosomal arrangements)
iii. Ever-increasing array of techniques to analyse DNA: PCR, gel electrophoresis, sequencing technologies
When is a population considered polymorphic at a specific genetic locus?
If more than one allele is commonly found (typically > 1- 5%) at that locus
How is the proportion of variable or “segregating” sites defined in population genetics?
What is ‘h’ in this equation?
Equation means: h = 1 – the sum of frequency^2 of all alleles
Heterozygosity (h) is the fraction of individuals in a population that are expected to be heterozygous
h is equivalent to the probability that any two alleles randomly sampled from the population are different. It is greatest when there are many alleles, all at equal frequency
The Hardy-Weinberg principle almost never reflects the reality of a population, why are they still useful?
- Predicts genotype frequencies based on allele frequencies, when stable across generations in a stable population.
- The H-W Principle is an example of a null model. It describes the state of population when nothing interesting is happening.
What are the assumptions of the Hard-Weinberg principle?
- Diploid organism with sexual reproduction (random and independent chromosome transmission to offspring)
- Non-overlapping generations
- Infinite population size (no random genetic drift)
- Random mating (no inbreeding)
- Males and females have equal allele frequencies
- A closed population (no migration)
- No mutation
- No selection
What does the Hardy-Weinberg principle teach us about genetics in the absence of evolutionary forces?
- Genotype frequencies are in equilibrium, i.e. they remain unchanged indefinitely
- This equilibrium is reached after only one generation of random mating
- If genotype frequencies are different from those predicted, then at least one evolutionary force is acting
What is linkage disequilibrium?
Linkage disequilibrium (LD) arises between genes on the same chromosome: their transmission is not independent
What are two forms of non-random mating?
Inbreeding- individuals mate with relatives more often than would occur by chance
Positive assortative mating- individuals breed preferentially based on a similar phenotype.
How does effective population size differ from census population size?
The effective population only includes those individuals who contribute to reproduction in a given generation.
How can effective population size be calculated based on sex ratio
If a population has an unequal sex ratio, the rarer sex will contribute more offspring per capita
Why can migration between subpopulations result in lower heterozygosity than would occur in a single population ander H-W equilibrium?
If subpopulations A and B are at Hardy-Weinberg equilibrium with different allele frequencies, the average heterozygosity will always be lower than the equivalent in a mixed population
What is the fixation index and how is it calculated?
The fixation index is the fraction of total genetic diversity that is due to differences between subpopulations (demes)
What is the selection coefficient of an allele?
The increase or decrease in fitness conferred by that allele compared to another
How can the change in the frequency of an allele between two generations be expressed for haploid organisms?
∆q ≈ spq
p, q are the frequencies of the alleles P and Q, respectively
The Q allele has a fitness of 1 + s (selection coefficient
∆q is the change in q from one generation to the next.
q increases when s is positive, and decreases when s is negative
The rapidity of the allele frequency change is proportional to the absolute value of s
How can the change in the frequency of an allele between two generations be expressed for diploid organisms?
∆q ≈ spq [ph + q(1-h)]
In diploids, fitness is influenced by the degree of dominance (h) of an allele, as follows:
PP = 1 ; PQ = 1 + hs ; QQ = 1 + s
h ranges from 0 to 1
(s= selection coefficient)
How does the degree dominance (dominant, additive, recessive) of an allele affect the rate of allele frequency change in diploid organisms?
Dominance
- If the selected allele is dominant (orange line), change is initially rapid but very slow as it nears fixation
- A new rare allele initially creates mostly heterozygotes. Selection can only favour these if the allele is dominant
- Near fixation, dominance allows the less-fit allele to hide in heterozygotes, making it difficult to remove - If the selected allele is recessive (amber line), change is very slow initially but accelerates near fixation
Additivity
Change is initially rapid and reaches fixation very rapidly (green line). This is because less-fit alleles are more effectively selected against (they cannot hide)
What forms of selection help to maintain genetic variation in a population?
- Balancing selection: Both alleles stably coexist with frequency that is proportional to the relative fitnesses of the two homozygotes. Typified by the case of heterozygote advantage.
- Frequency dependent selection: allele fitness is high when the allele is rare, low when common
- Fluctuation selection: allele fitness depends on an aspect of the environment that is rapidly and constantly changing
In molecular phylogenetics, how are orthology, homology and paralogy defined?
- Orthologous sequences are from different species
- Homologous sequences are from the same species
- Paralogous sequences are different genes in the same genome
What is the difference between a transition mutation and a transversion mutation?
Transition- purine-to-purine or pyramidine-to-pyramidine (A -> G, C -> T etc.)
Transversion- purine-to-pyramidine or vice versa
(A -> C, G -> T etc.)
Outline the most common method of choosing the best alignment between two sequences in Molecular phylogenetics.
- Assign differing costs to each type of sequence difference (i.e. insertions, deletions, transitions, transversions)
- Add up these costs for each possible alignment, and identify the alignment with the lowest cost. (Applications such as Clustal and Muscle do this)
What is the issue with simply using the proportion of sites that are mismatched (p-distance) when measuring the genetic distance between two genetic sequences?
These two scenarios would appear identical.
What is the multiple hits problem?
When the divergence between sequences is high, the number of differences between them will underestimate the true distance due to convergence.
How do nucleotide substitution models aim to solve the issues of the multiple hits problem?
Estimate the true genetic distance by mathematically representing the stochastic process of sequence evolution over time.
What is the Jukes-Cantor nucleotide substitution model, and how does it differ from the HKY and GTR models?
The simplest nucleotide substitution model. It differs from the HKY and GTR models because it assumes the rate of all forms of mutation are the same.
What do the letters a-f represent in this visualization of a nucleotide substitution model?
The relative rates of different types of mutation.
How does an amino acid substitution model differ from a nucleotide substitution model?
Models sequence variation at the level of amino acids rather than individual nucleotides. 20 possible states each amino acid can move between, rather than 4, and so a 20x20 matrix is used. The rate of movement between amino acids is obtained through large surveys of protein variation.
What assumptions are made by nucleotide substitution models?
- Evolution at each site occurs at the same rate.
- Nucleotide base frequencies are the same for all sequences.
- Evolution is independent at each site.
The assumption that evolution occurs at the same rate at all sites is a major inaccurate assumption of nucleotide substitution models. How can this be corrected?
Using models of among-site rate heterogeneity such as the gamma-distribution model
What is the difference between a rooted and unrooted phylogenetic tree?
What three groups of methods are used in constructing phylogenetic trees?
ALGORITHMIC METHODS: These methods begin with a genetic distance for each pair of sequences. A ‘clustering algorithm’ then transforms the genetic distances into a tree.
OPTIMALITY METHODS: These methods define some kind of score for each possible tree.An optimisation algorithm is then used to find the tree with the highest score.
STATISTICAL METHODS: These methods calculate a probability for each possible tree.They frame phylogeny estimation as a formal statistical problem
How does an UPGMA algorithmic method of phylogeny construction work?
How does an UPGMA algorithmic method of phylogeny construction work?
What are the three optimality methods of phylogeny construction?
- Maximum Parsimony: The tree which requires the fewest evolutionary changes to explain the observed sequences is the best tree. Fast, but inapplicable to fast-evolving or highly-divergent sequences.
- Maximum Likelihood: The tree which is probabilistically most likely to have given rise to the observed sequences is the best tree.
Slower.The probabilities are given by a nucleotide substitution model. Most common approach for sequence data. - Bayesian Inference: Each tree has a probability given the data. We should consider the whole probability distribution, not just focus on the single most probable tree. Slowest. Closely related to Maximum Likelihood. Most useful for testing evolutionary hypotheses.
How is a phylogeny constructed using maximum parsimony?
- For any given tree and set of characters, the parsimony score is the minimum number of evolutionary changes required to explain the observed characters.
- The most parsimonious tree is that with the lowest parsimony score. However, there may be very many trees that share this distinction.
How is a phylogeny constructed using maximum likelihood?
Nucleotide (or amino acid) substitution models enable us to calculate P(seqs|T,B,Q), that is, the probability of the observed sequences given:
- a tree topology (T)
*a set of branch lengths (B), each of which represents a genetic distance - rate parameters of the substitution model (Q)
The tree likelihood is proportional to this probability*. Calculating this requires some fairly heavy-duty maths.
When constructing a maximum likelihood phylogeny, a tree search is used to determine the tree with the highest likelihood. If the tree has many taxa, how can a search be conducted without searching through every possible tree?
Hill climbing: searches through trees via iterative trial and error. Does not search through all possible trees, and isn’t guaranteed to find the most likely tree.
How is the uncertainty of a phylogenetic tree most commonly measured?
Bootstrapping:
- Bases are grouped by nucleotide position, and added to a ‘pot’
- A random group is chosen for each nucleotide position, with replacement, meaning the same group can be drawn multiple times
- This is repeated 100s or thousands of times to produce many pseudo replicates
- Generate a tree (usually NJ or ML) from each bootstrap replicate. The frequency with which a cluster occurs in these replicates is a measure of its reliability.
- Tree which appears in >70% is considered robust.
On what observation did Zuckerkandl and Pauling base their original molecular clock model?
Number of amino acid differences between animal hemoglobins was proportional to species divergence time, as defined by the fossil record.
What are the Neutralist and Selectionist models of molecular evolution? How are they reconciled?
Now understood that these two are not mutually exclusive: molecular evolution is driven by different forces in different regions of the genome and under certain conditions.
How does substitution rate differ from mutation rate?
The substitution rate is the rate at which sequences in different populations diverge through time. The mutation rate is the rate at which individuals incorporate errors during replication.
How does the fitness impact of a mutation alter its substitution rate and the overall substitution rate of the gene its part of?
The overall substitution rate of a gene will depend on the proportion of sites in each of these categories.
How does population size alter substitution rates?
Ns- Population size x selection coefficient
How does generation time (time between germ line replications) impact substitution rate?
For neutral mutations, substitution rate is significantly impacted by generation times, as faster generation times provide more opportunities for mutations to accumulate.