MT Host + Genomics Flashcards
Intro to Genome Evolution
How do chromosomes evolve?
- more chromosomes or less chromosomes by fusions or fissions
- can vary in lengths by duplications
- acquire multiple sex chromosomes by translocations
- translocation, duplication, insertion, deletion can all effect shape of chromosome by interacting for a tighter/looser shape
Intro to Genome Evolution
What does the human cell genome conatin, what is it split up to?
Human cell genome has a small mitochondrial genomne, and a large nulcear genome
1. nuclear genomes are either Transcribed or Not transcribes into RNA, those transcribes can either become protein coding genes (mRNA 1%), or Non-coding genes like tRNA, rRNA, snRNA, snoRNA, siRNA (3%)
2. The not-transcribe genome can be structural or non-structural
3. structural ones are either telomeres (1%), or centromeres (10%)
4. Non-structural ones are further split into Unique or Repetitive DNA
5. Unique contains: conserved non-coding elements, or non0conserved non-coding elements
6. repetitive ones are: satelite DNA, Retro, transposons, LINES, SINES
Intro to Genome Evolution
How come closely related species: like humans and chimps, and even humans (23) and drosophila (4) have such different numbers of chromosomes?
This is because chromosome evolution. Within chromosomes 90% of the genome is non-coding. Hence the difference in chromosome number is likely to increase in the ‘junk’ DNA
Intro to Genome Evolution
What are some questions which need to be answered about genome evolution?
- how do chromosomes evolve
- what is the origin of introns
- how do (what are the methods) genome size evolve
- where do new genes come from
- why is there so much JUNK DNA
Intro to Genome Evolution
What ways do new chromosomes evolve?
- Chromosomes can evolve through chromosome fusion/fission, which lead to the reduction or increase in number of chromosomes (sex chromosomes + autosomal chromosomes)
- Chromosomes can evolve via translocation (evolution of 2nd pseudoautosomal regin in humans)
- Chromosomes can evolve via inversions and segmental duplications (chromosome shape evolution)
- chromosome can evolve via homologous recombinations in male meiosis (sex chromosomes)
Intro to Genome Evolution
What happens when chromosomes evolve via fusion/fission, what is 1 example?
When 2 chromosomes fuse together to become one, or when 1 chromosome splits into 2
* Example of fusion is: Evolution of chimpazee chromosome to human chromosome:
* The 2a & 2b chromosome in chimpanzees fused telomere-to-telomere to form human chromosome 2
Intro to Genome Evolution
What happens when chromosomes evolve via translocation? What is an example in sex chromosomes?
Transolcation is a genetic change in which a piece of one chromosome breaks off and attaches to another chromosome. Sometimes pieces from two different chromosomes will trade places with each other.
1. The 2nd PAR (pseudoautosomal regions) of humans XY chromosomes arose due to a translocation from X to the Y chromosome in the human lineage after its split from chimpanzee lineage.
2. PAR region is the cross over/recombined ends of the X and Y in male meiosis
Intro to Genome Evolution
How can Chromosomes volve evia inversions and segmental duplications to have chromosome shape evolution? Does this affect gene expression?
- Gene duplication causes double amount fo genes (so longer chromosome)
- Inversion can cause silencing due to affecting expression via how coiled it is, and this can change chromatin structure
- It affects gene expression massively and can even lead to the creation of new genes/switching on-off genes be changing its epigenetics
- multiple inversions occured in humans. around 13.7% genome is segmentally duplicated
Intro to Genome Evolution
What is a good example of chromosome evolution via fusion in Muntjac Deer
Tandem chromosome fusions in karyotypic evolution of Muntjac Deers
* Karyotype: the visible number of and appearances of chromosomes in the cell nuclei of a cell of a species
* Muntiacus gongshanensis (M.gongshanensis) has lowest chromosome number for mammales (4)
* in their relative M.reevesi, which had 9 chromosomes, they cosely compared the two using in situ hybridisation probes of telomeres.
* They found that M.gongshanensis shows chromosomes fusion because all their telomeres where fused into the middle section of the chromosomes
source is from Huang et al., 2006
Intro to Genome Evolution
What is another example of how translocation during male meiosis and homolopgous recombination can lead to chromosome evolution?
Multiple sex chromosomes in platypus. 5X and 5Y form a chain of sex chromosomes during male meisosi and homologous recombination.
1. Formation of this chain is due to transocationr ecombination between a sex chromosome and an autosome
2. During evolution, a translocation between end of a Y chromosome and autosome caused a part of the Y chromosome (Y1) to become homologus w the autosome it translocated with.
3. The rest of the Y1 chromosome is still homologous to the original X1 chromosome.
4. The autosome it translocated with then becomes the X2 chromosome, and then autosome X2 is then homologous to another autosome which becomes the neo Y2. As this goes on, it ends up with 5 X and 5Y where some were originally autosomes.
5. As translocation happens again in the next round, the sex chromosomes increase, and autosomes form the neo-X/Y chromosomes, and cause the elongation of the chain of sex chromosomes during homologous recombination during male meiosis
6. Suprisingly the formation of this chain still manages to segregate correctly during meiosis.
7. Multiple sex chromosomes also seen in the dioecious plant S.diclinis
Intro to Genome Evolution
How did sex chromosomes evolve
- Sex chromosomes evolved multiple times independently
- sex chromosomes in birds and mammals arose independently at ~170 and 100 MYA.
- the 2 independent evolutions of sex chromosomes is what led to mammalls being male heterogamety (XY for males) while birds being female heterogamety (ZW for females)
- Sex chromosomes evolve from pair of autosomes that acquired a sex determining gene and stop recombining with each other. The non recombining region then expand by inversions resulting the entire Y chromosome region becoming non-recombinant
- this expanding process is called EVOLUTIONARY STRATA on sex chromosomes
- through inversions the genomic region becomes non-recombining and the regions start to accumulate deleterious mutations and gradually become degenerate
- they accumulate deleterious mutations because: no recombination = lack of genetic diversity = hence unable to remove the mutations from the population via natural selection and mutations start accumulating
- therefore the Y chromosome are usually emtirely degenerate in most organisms
- this is called the Y chromosome degeneration Evolution Strata
Intro to Genome Evolution
What is the Y chromosome Evolutionary Strata and what causes it to be degenerate? What about Y chromosome gene loss?
Y chromosomes slowly beome degenerate. The process of this occurs at different rates, and studies have investigates how fast genes were lost/become degenerate once the Y chromosome stops recombining.
The study did this in 8 mammalisan species, and showed that
* genes are lost very quickly (almost immediately ar long evolutionary timescale) once a region becomes non-rocombining, and only a few indispensable/highly conserved genes reamain functional on Y chromosome
* They constructed a evolutionary dynamics of gene loss, and showed thgat natural selection cannot work effectively in non-recombining regions.
* this results is graudal loss of genes from Y and W
* The number of genes in human Y chromosome has reached a base level, with little further degeneration present
* Y chromosome genes unlikely to be lost al together because it is a non-linear graph. The loss of genes at the ends are very slow (more conserved)
* Ongoing Y-chromosome degeneration
Nature 2014
Intro to Genome Evolution
Describe the example of Vitamine C synthesis gene loss in animals
- loss of vitamin C synthesis ability + related genes independently occurred
- many birds + mammals cannot synthesis Vit C
- the inability to synthesize vitamin C is due to mutations in the L-gulono-lactone oxidase (GLO) gene that encodes the enzyme responsible for catalyzing the last step of vitamin C biosynthesis.
- It is thought that the loss of this gene occurs whenever there is sufficient vitamin C is present in food. This is a general tendency to lose genes that become unnecessary.
Current Genomics 2011
Intro to Genome Evolution
Describe gene loss for loss of teeth in birds turtles and mammals
- loss of genes = unnecessary
- genome comparison study showed mineralized teeth in birds were lost in 120mya
- in mammals like toothless whales, lost teeth due to loss of genes for making teeth
Mereditch et al., 2004
Intro to Genome Evolution
What often causes loss of genes
when genes become unneccessary
* teeth in some birds
* vit C synthesis in birds + mammals
Intro to Genome Evolution
How does a study on Fungi genomics show that genes are being lost and gained all the time?
Within many related species of fungi, genes are constantly being lost, and duplicated..etc
even whole genome duplication
Intro to Genome Evolution
what are some mechanisms for where do new genes come from
- Exon shuffling: ectopic (abnormal) recombination of exons and domains from distant genes
i.e jingwei, - Gene duplication: classic model of duplication with divergence
i.e CGβ, RNASE1B - Retroposition: new gene duplicates are created in new genomic positions by reverse transcription or other processes
i.e PGAM3 - Gene fusion/fission: 2 adjacent genes fuse into a single genes, or a single gene splits into two genes
i.e Fatty-acid synthesis enzymes - De novo origination, a coding region originated from a previously non-coding region
i.e AFGPs
Intro to Genome Evolution
How do new proteins evolve?
apart from new genes -> new proteins
* many proteins evolve by ‘borrowing’ domains from other proteins
* this is done by exon shuffling: proteins can shuffle other protein domains (which corresond to certain exons) to add domain or function to existing protein
* this can reduce, change, or add function to protein
* or this can be done by alternative splicing
Intro to Genome Evolution
What is an example of exon shuffling
jingwei gene in drosophila
Intro to Genome Evolution
What does exon shuffling change
it allows the evolution of new proteins
or proteins with new function/changes to protein function
Intro to Genome Evolution
Describe to the exon shuffling in terms of the origin of jingwei gene in drosophila
Nature genetics 2003
- Jingwei Gene in drosophila originated as a gene duplication of ancestral gene ‘yande’
- this was then followed by a exon shuffling or retroposition of Alcohol dehydrogenase (adh) gene into the middle of yande, creating a gene fusion
- the new chimeric gene was created, consisting of 3 exons of yande and the middle exon a coding region of Adh
- the new gene gained functions of Adh, being and alcohol dehygrogenase, but the new gene works more effectively for longer chains of alcohol molecules
- this is a gene creation of a new gene ‘jingwei’ w sub-functionalisation
Intro to Genome Evolution
What do you call when a gene gains a completely new function and a gene that only gains some new functions
- neo-functionalisation: creation of gene w completely differet and new functions to old gene
- sub-functionalisation: Creation of a gene with similar function to the old gene
Intro to Genome Evolution
how did the evolution of introns occur?
- 2 main theories: Intron first and Intron Late
1. Intron first: evolution of introns in RNA world; Introns are very ancient and are gradually lost (e.g. lost completely in bacteria)
2. Intron late: introns evolved in the ancestor of eukaryotes; Introns evolved in early eukaryotes and keep spreading - across tree of life, clearly introns are lost and gained all the time, but exactly which one came first is still unclear
Intro to Genome Evolution
How did alternative splicing occur?
- major role in evolution of new gene functions/new proteins
- existence of introns and genes being in ‘pieces’ of exons and introns allow alternative splicing of same gene into different proteins depending on mRNA splicing
- alternative splicing evolution is thouh to be a by product of splicing noise
1. imperfect or incorrect splicing that occasionally occurs
2. If the resulting new combination of exons is advantageous, selection can make that splicing variant more likely to occur.
3. Selection changes the relative abundances of the results of proteins, Can be done through changing promoters..etc
Intro to Genome Evolution
advantages of alternative splicing
- major role in evolution of new gene functions/new proteins
- existence of introns and genes being in ‘pieces’ of exons and introns allow alternative splicing of same gene into different proteins depending on mRNA splicing
- doesn’t require completely new synthesis of an exon, one copy is sufficient
- hence saves energy to copy exon
Intro to Genome Evolution
what is the 2R hypothesis
2 rounds of whole genome duplication in animals, specifically in vertebrate ancestry
* hypothesis that vertebrates originated after 2 rounds of WGD
Intro to Genome Evolution
why is gene duplication important in evolution
- Evolution by gene duplication = major source of new genes and evolutionary novelty
- Involves duplication of individual denes and sometimes entire genomes (like the fungi example)
- WGD - whole genome duplication in animals thought to have driven originated of vertebrates
- WGD much more common in plants and fungi
Intro to Genome Evolution
what happens in gene duplication, and how does it effect protein/gene function?
* Once a gene is duplicated, some functional redundancy is created, reduces purifying selection and allows the two copies to accumulate mutations and diverge in function.
* Often duplicated copies perform very similar roles, but in a slightly different way
* as was the case for Adh and jingwei genes in Drosophila The two copies of a gene can specialise (and be optimised) to work in different tissues or in slightly different ways.
Intro to Genome Evolution
what are possible outcomes of genes which have been duplicated
- sub-functionalisation
- neo-functionalisation
- specialise and optimsed to work in different tissues or via different mechanisms
- can also be selected against and be removed
Intro to Genome Evolution
What is an example of gene duplication leading to subfunctionalisation
- evolution of trichromatic color vision in primates
- ancestral state is dichromatic, because ancestors were nocturnal and color vision wasn’t necessary for that
- primates are diurnal (work in day) and rely on cokor vision to find ripe fruits, hence this trait is more useful
* Evolution of trichromatic colour vision in primates occurred as a result of gene duplication: the L- gene (for Long wave length) was duplicated and the resulting genes diverged little bit, resulting in L- and M-genes (for Medium wave length). - after gene duplication which created L and M opsin genes, the duplication copies diverged to acquire different spectral sensitivities
- In humans spectral sensitivity is relatively poor, in comparison to bees and birds which civer nearly equally the entire visible ligjt wavelength
- in birds with better vision, they have evolved to have high spectral sensitivity across all wavelengths, and evolved a new VS opsin gene
2010 paper
Intro to Genome Evolution
what causes increase and decreasein genome sizes
- gene duplication
- genome duplication
- spread of transposable elements
ONLY DELETION causes downsize in genome, frequency and size of deletion determine eficiency of genome downsizing
Intro to Genome Evolution
Are there any links to genome size and number of proteins
- genome expansion cause genome sizes vary over several orders of magnitude.
- Does not mean that the bigger genomes contain more genes and could encode for more complex organisms:
* Larger genome IS NOT EUQAL to more genes DOES NOT EQUAL TO more complexed organisms - Proof: the genomes of various flowering plants (that have relatively similar ‘design/function’ and complexity) vary over three orders of magnitude, suggesting that the size of the genome has little to do with organism complexity.
Intro to Genome Evolution
what is an example that shows genome size does not mean more complexed organism
Proof: the genomes of various flowering plants (that have relatively similar ‘design/function’ and complexity) vary over three orders of magnitude, suggesting that the size of the genome has little to do with organism complexity.
- also the C paradox shows this as well, eukaryotic genome size is not linearly correlated to its number of genes, this is because the existence of junk DNA
Intro to Genome Evolution
what is the C-value paradox, define
- The number of genes and the genome size show good correlation in viruses and prokaryotes, correlation is much weaker for eukaryotes – the so called “C-value paradox”.
Intro to Genome Evolution
what is the simplified reason for C-value paradox
- The reason for this is the abundance of non-coding DNA in eukaryotic genomes.
- More DNA doesn’t mean more genes in eukaryotes due to existence of introns
Intro to Genome Evolution
genome size varies example?
range very quickly across multiple species from 200mya. shows the variavility of the genome even in closely related species. Also showing genome size change = changes in function hugely
Organ et al., 2007
Intro to Genome Evolution
How did they estimate the genome sizes for dinosaurs
Genome sizes for extinct animals was measured from the size of cells inside the bones – the larger the genome, the larger the cell. For dinosaurs the size of the cells was measured from the size of pores in the bones.
This revealed that genomes of dinosaurs and mammals are relatively large, while bird genomes appear to have been downsized, possibly as an adaptation to faster lifecycle and flight.
organ et al., 2007
Intro to Genome Evolution
what might be an explanation for larger genome larger cell?
the size of genome had structural roles to maintail cell size and cell shape in animals
large cell size and genome for dinosaurs and mammals which are physically larger and have bigger cells, and cell size and genome are smalle for birds (which are smaller)
Intro to Genome Evolution
How does transposable elements effect evolution of genome size
- Transposable elements are a major component of most genomes:
- > 50% of humans and > 70% of wheat genome is comprised of various TE,
- Activation of jumping gene transposable elements can quickly cause the double the genome size and cause increase of genome sizes very quickly
- Activation of a single family of TE can lead to rapid increase of genome size, as was reported for S.latifolia and cotton
- Example 1: Silene latifolia: the spread of a TE (SIOgr1) family ~5mya increased it’s genome size from 2Gb to 3.5Gb (comparing S.latifolia and its most recent relative S.vulgaris)
- Example 2: Gossypium (cotton) family: The four fold difference in genome size between closely related species: from 880Mb in G.raimondii to G.exiguum (2460Mb)
- because of the spread of the TE family Gorge 3 (which increased from 61Mb in G.raimondii to 831Mb in G.exiguum)
Filatov et a;., 2008 ; Hawkins et al 2009
Intro to Genome Evolution
what are 2 example fo how TE jumping genes effect genome size
activation of TE leads to rapid increase in genome size in:
1. Example 1: Silene latifolia: the spread of a TE (SIOgr1) family ~5mya increased it’s genome size from 2Gb to 3.5Gb (comparing S.latifolia and its most recent relative S.vulgaris)
- Example 2: Gossypium (cotton) family: The four fold difference in genome size between closely related species: from 880Mb in G.raimondii to G.exiguum (2460Mb); because of the spread of the TE family Gorge 3 (which increased from 61Mb in G.raimondii to 831Mb in G.exiguum)
Intro to Genome Evolution
If deletions are the only way to keep genome downsizing, how is this done efeectively?
- **The frequency and size of deletions occurring in the genome determines how efficient is genome downsizing. **
- A study analysed and compared the frequency and size of deletions occurring in a small genome of Drosophila and a very large genome of cricket Laupala.
- This showed that Drosophila has frequent deletions and many of them are relatively long (>16 nucleotides).
- On the other hand, the cricket had fewer deletions and most of them very short.
- This study demonstrated that half-life of a piece of junk DNA in Drosophila is only 14 million years, while in the cricket it is over half a billion years
- effectively junk DNA is never removed from Laupala genome.
- This is likely to be the reason why these species have so different genome sizes.
- In constant process of adding and removing genes, and rate and size of removing and adding is all varable
- From large genome junk DNA is almost never removed
Intro to Genome Evolution
why does it take so long to remove a junk DNA from a very large genome
Maybe bcuz of their large genome, the junk DNA doesn’t have enough negative selection pressure to remove it. Too much DNA to process, so not worth removing it from evo perspective?? UNSURE THO!!
Intro to Genome Evolution
What is are example of extreme genome reduction?
Buchnera
* Buchnera is a mutualistic intracellular symbiont of aphids.
* Their association began about 200 million years ago, with host and symbiont lineages co-evolving in parallel since that time.
* During this coevolutionary process, Buchnera has experienced a dramatic decrease of genome size (from ~4Mb to ~0.5Mb genome), retaining only essential genes for its specialized lifestyle – essentially majority of all biochemical pathways are removed
* for better adaptation as an intracellular symbiont, and doesnt require processes which it can rely on host for
* Lost because selection is not keeping them intact
* Strong selection for non-essential genes due to symbiotic lifestyle (intracellular)
*Miochondria
* similarly, mitochondria as a symbiont in endosymbiosis theory, also became a symbiont
* mitochondria has a super reduced genome of only 16kb long, having essential mitochondria specific genes
Intro to Genome Evolution
Genome comparison of buchnera for evidence of genome reduction
- The comparison of genomes of two buchnera species that diverged 50 million years ago revealed very similar gene content,
- Buchnera has undergone genome reduction a long time ago and little has changed since then
- it is effectively in genomic static, with no signs of further genome reduction.
- genome unlikely to be reduced further
- Maybe due to the remaining ones are VERY ESSENTIAL to life and further reduction won’t have any more evolutionary benefits for its current lifestyle and environment
Intro to Genome Evolution
what causes multiple sex chromosome evolution
homologous recombinantions between autosomes and original sex chromosomes, this causes multiple sex chromosomes,
Intro to Genome Evolution
what caused the first evolution of sex chromosomes
a pair of autosomal chromosomes acquired sex determining genes, and becomes non-recombining through increasing mutations which prevent them from recombining
Intro to Genome Evolution
how do proteins evolve
alternatie splicing
exon shuffling
impacts of gene duplication
divergence in functionality and results in sub functionalisation or neofunctionalisation
Intro to Genome Evolution
what does larger genomes mean?
IT DOESNT MEAN MORE COMPLEXED ORGANISM OR MORE PROTEINS
but it can be related to cell sizes and life-strategy of organisms
Intro to Genome Evolution
what do Transposable elements do?
when activated, they can increase genome size very rapidly by creating lost of junk DNA
Intro to Genome Evolution
what is genome size dependent on, and what varies genome size
increases: duplication, TE, …etc
Deletion rate depends on frequency, size of genome, and selection pressure. Larger genomes often is harder to remove junk DNA
Intro to Genome Evolution
Extreme examples of genome reduction (brief)
- Genome reduction in mitochondria and symbionts like Buchnera are extreme examples, as their genomes have been reduced too an extreme amount due to their intracellular and symbiotic lifestyle
Lecture 2 (HG) intro to human evo genomics
what is the importance of poulation genetics in human evolution study
- Addresses questions about recent evolution
- Where did humans originate?
- When have humans spread across the world?
- While spreading, have humans been adapting to a diverse set of environmental conditions?
- Have humans interbred with other closely related species, such as Neanderthals?
Lecture 2 (HG) intro to human evo genomics
how is MtDNA diversity used to research evolutionary genetics in human
MtDNA more useful than nuclear DNA because it is non-recombining in humans
mtDNA: High copy number per cell, small genome (16kb), High mutation rate, No recombination
easy and cheap to research/sequence
* mtDNA is not recombining and hence can build phylogenies within species, consistent trees
Lecture 2 (HG) intro to human evo genomics
what is the benefit of using MtDNA for studies?
MtDNA more useful than nuclear DNA because it is non-recombining in humans
shows the ‘maternal’ side pof the story
mtDNA: High copy number per cell, small genome (16kb), High mutation rate, No recombination
easy and cheap to research/sequence
* mtDNA is not recombining and hence can build phylogenies within species, consistent trees
Lecture 2 (HG) intro to human evo genomics
results of MtDNA diversity study in human evolutionary genetics
- African genetic diversity significantly higher than others
- Conclusion of this study (building phylogenetic tree from different races) resulted in that the root of the mtDNA phylogeny is in Africa
- Consistent with the fact Africa has the biggest genetic diversity
- Whole genome mtDNA comparisons gave same conclusion (using molecular clock)
LIMITATION: selective sweep of a single strongly advantageous mutation would produce same appearance, might not neccessarily be migration from africa
Lecture 2 (HG) intro to human evo genomics
Use of Y-chromosome to study human evolution + migration
- non-recombining
- independent evidence about the human history as it is unlinked to mtDNA.
- paternally inherited and allows us to look at the ‘male history’,
- Y- based phylogeny is consistent with mtDNA. It also has a root in Africa and African lineages are most diverse
- Supports previous hypothesis
Lecture 2 (HG) intro to human evo genomics
Use of autosomal markers to investigate human evo gen + migration
- However using austosomes (nuclear genes) is a problem as they do recombine
- Difficult to form the phylogenetic tree
- But Principle Component Analysis (PCA) can be used to analyse polymorphisms in autosomal DNA sequences.
- Recombination leads to independence of evolutionary histories of different genes hence more unique data
- provided further support to the idea that Africa is the source of all modern humans
Lecture 2 (HG) intro to human evo genomics
how do nuclear gene and mtDNAs differ in providing in formation about human ancestry
- Timescale:
- Nuclear genes – deeper phylogenies explore human ancestry
- mtDNA would only give us the most recent common ancestor for humans, but nothing further due to the single mitochondrial lineage
Lecture 2 (HG) intro to human evo genomics
How does selection impact polymorphism
- lower recombination rate, means stronger hitchhiker effect, neighbouring genes of the selected gene is more likely to stay
- hence The spread and fixation of an adaptive allele results in loss of genetic variation around the loci of the target allele (the allele around it will also be preserved along w the advantageous allele)
- Size of the region affected by selection sweep depends on recombination rate
- i.e in Y chromosome where no recombination occurs: entir chromosome will lose genetic variation after an advantageous (adaptive) allele is fixed in the chromosome
- i.e If frequence recombination: only a short region around the adv allele will be fixed after the adv allele is fixed
Lecture 2 (HG) intro to human evo genomics
What is Fst and how does it work?
The fixation index (FST) is a measure of population differentiation due to genetic structure
FST is the proportion of the total genetic variance contained in a subpopulation relative to the total genetic variance, ranges from 0-1
low Fst means a lot of gene flow and breeding and connectivity between subpopulations, keeping it similar to the overall genetic variance
high Fst above 15% means that this sub-division is differentiated
Lecture 2 (HG) intro to human evo genomics
what happens to DNA polymorphism after multiple selective sweeps
After 1st selective sweep, it leads to surrounding genes of the adv allele to decrease in genetic variation
as new mutations accumulate, genetic variation in that area will recover due to new mutations arising causing more genetic diversity/variation.
Lecture 2 (HG) intro to human evo genomics
how can biased new selective sweeps be detected?
using a statistics called Tajima’s D, this is because ALL new mutations are at very low frequencies, this bias allows them to be detected by the stats
Lecture 2 (HG) intro to human evo genomics
what effect does 2 contrasting conditions have on alleles in populations
Adaptation to contrasting conditions (e.g. high/low altitude) leads to spread and fixation of different locally adaptive alleles in the populations.
this is a typical footprint of local adaptation to identify genetic variants which are evolving under this type of selection pressure
Lecture 2 (HG) intro to human evo genomics
how to measure differentiation of local of subpopulation in a quantifiable way
Fst statistic = (Ht - Hs)/Ht
Ht = total heterozygosity across all populations, and Hs is heterozygosity within populations
Lecture 2 (HG) intro to human evo genomics
How to identify whether a local population has adapted to a environmental changes
Identifying local adaptation can be done through population differentiation to different environments which leaves a ‘signature’ as their genetics between the 2 diff environments would be different,
Lecture 2 (HG) intro to human evo genomics
what is an example of human adaptation to environments
In a study of Han chinese and Tibetan population differentiation, the EPAS1: has the strongest and most obvious differentiation to other genes: it has a very high Fst. This means that this gene has the strongest signal for population differentiation and locally adapted gene EPAS1.
this gene EPAS1 in tibetans are ery divergent from other hapotypes in other human populations, and was revealed to have been inherited from Denisovan genome, during early interbreeding. Even if early interbreeding was rare, the provided genetic diversity can be very beneficial and advantageous and can spread through natural selection
Lecture 2 (HG) intro to human evo genomics
what is the neutral theory?
It suggests that most evolutionary changes at the molecular level (such as changes in DNA or protein sequences) and genetic variation/diversity are not caused by natural selection acting on advantageous traits. Instead, these changes are the result of random genetic drift of mutant alleles that are neutral.
rare beneficial mutations occur and selective sweep the occurs rapidly to fix these beneficial mutations.
Lecture 2 (HG) intro to human evo genomics
what is balancing selection
a type of natural selection where genetic diversity is maintained within a population. Unlike directional selection, which favors a single allele and can lead to a decrease in genetic diversity over time, balancing selection ensures that multiple alleles are preserved at a particular gene locus.
Lecture 2 (HG) intro to human evo genomics
what are some mechanisms of balancing selection?
- Heterozygote Advantage (Overdominance): when being a heterozygote (hence having 2 diff allele copies) give you adv. For example, in africa, adv to have heterozygote sickle cell anemia (Hbs/HbA), as it protects you from malaria, and is not completely fatal
- Frequency-Dependent Selection: fitness of a phenotype depends on its frequency relative to other phenotypes in the population. There are two types: positive frequency-dependent selection, where the fitness of a phenotype increases with its frequency, and negative frequency-dependent selection, where the fitness of a phenotype decreases as it becomes more common. Example: coloration to avoid predation. If one coloration becomes popular, predator will recognise and predate more of the same color. Hence in this case fitness of phenotype decreased as phenotype becomes common. Hence allows multiple alleles for colors to be present in population
- Disruptive selection: Not neccessarily balancing selection., but can contribute to maintaining multiple alleles
Lecture 2 (HG) intro to human evo genomics
how can the rate of selective sweep change depending on the type of adv mutation?
- all mutations are rare to start with
- if adv mutation occured which was dominant, then selective sweep may occur quicker
- if adv mutation was recessive, it would take much longer as it would have to meet another recessive first.
- if they were linked to another dominant beneficial allele, it can spread quicker through hitch-hiking
- or spread wuicker through migration, genetic drift, funder’s effect…etc
Lecture 3 HG: Molecular Phylogenetics
what is phylogenetics
Reconstructing patterns of shared ancestry among organisms (within or between species), really about ancestry
Lecture 3 HG: Molecular Phylogenetics
what is taxonomy
Taxonomy: describing, naming, identifying and classifying species, grouping organisms into groups
Lecture 3 HG: Molecular Phylogenetics
what is phylogeny
the evolutionary history, ancestry and relationships between groups of organisms
Lecture 3 HG: Molecular Phylogenetics
how do we now represent phylogenies
we combine phylogenetics and taxonomy together to create phylogenies.
* modern taxonomy classification use phylogenetics techniques too
* phylogenetic trees: evolution of phylogenies
Lecture 3 HG: Molecular Phylogenetics
what was the begining of modern phylogenetics era
DNA sequence date, useing DNA sequence of cytochrome c phylogeny to build phylogeny trees by looking and comparing mutations
Lecture 3 HG: Molecular Phylogenetics
Definition of homology
similarity essentially.
the state of having the same or similar relation, relative position, or structure:many proteins show homology across their whole length|a region ofhomology withanother gene.
Lecture 3 HG: Molecular Phylogenetics
why is phylogenetic based on the principle of homology
because when comparing organisms in terms of evolution, their characteristics can either be homologous or analogous.
Characteristics of organisms are homologous if they are similar and have descended from a common ancestor.
Characteristics are analogous if they are similar but have descended from different ancestors.
i.e Bird and bat wings are homologous when considered as forelimbs, but analogous as wings.
Lecture 3 HG: Molecular Phylogenetics
what is phylogenetics based on, from the very fundamental level
homology, by comparing similarities and differences
Lecture 3 HG: Molecular Phylogenetics
What is molecular phylogenetics
Using molecular sequences which contain information about evolutionary history to build phylogenetic trees.
Information is often hidden, or fragmented in DNA
hence modern phylogenetics use stats, and technology to try recover and interpret information from DNA about phylogeny and evolutionary history
Lecture 3 HG: Molecular Phylogenetics
what types of conclusions/descriptions do sequence comparisons give about evolutioanry relationships?
- Homologous sequences: sequences have a shared common ancestry, and are related. Very broad term. umbrella term for sequences that are related by descent from a common ancestral sequence.
- Orthologous Sequences: In different species, occurs ater a speciation event. They are sequences which were inherited from same ancestor, but then they speciated and diverged. Ortholog genes in different species have not undergone gene duplication, and remain tio have similar function
- Paralogous sequences: sequences of genes of 2 diff species which are related through gene duplication events in the same genome. Evolved new function, new gene from old gene.
Lecture 3 HG: Molecular Phylogenetics
what are simple descriptions of homologus, orthologous, paralogous genes/sequences
Homologous is the broad category indicating genetic relatedness due to common ancestry.
Orthologous genes diverge after a speciation event, leading to similar genes in different species.
Paralogous genes result from gene duplication within the same organism, potentially leading to genes with new or specialized functions.
Lecture 3 HG: Molecular Phylogenetics
WHy do we use molecular characteristics for phylogenetics instead of morphological ones
- Molecular characters have many advantages over morphological ones:
- Very common
- Objective, easy to quantify
- Available when morphology is uninformative (micro-organisms)
- Cheap, fast
- Can be obtained without specialist training
However phylogenetics has many cases where morphological and molecular data initially disagreed, andthis led to progress of phylogenetics
Lecture 3 HG: Molecular Phylogenetics
What is the one significant disadvantage about molecular sequences
unavailable in extinct species or fossils
Lecture 3 HG: Molecular Phylogenetics
how was the 3 domains classified/made
bacteria, archaea and eukarya phylogenetics tree was constructed using rRNA sequences. 163 rRNA and 18srRNA
Lecture 3 HG: Molecular Phylogenetics
what are examples of when molecular data and morphological data initially disagreed
- The Placement of Whales: morphology like marine, but molecular data = mammals
- funghi classification: morphology = plant like, molecular = more related to animals
- protists = grouped together due to morphological characteristics of being ‘animal-like’ and ‘plant- like’, but molecular data revealed = paraphyletic lineages were included. Hence still very much a ‘dump’ classification
Lecture 3 HG: Molecular Phylogenetics
what are the types of mutations
- transition mutation: purine to purine, pyramidine to pyramidine, A-G, C-T, quite common
- transverse mutations: purine to pyramidine, rare (less freq than transitions)
- silent/synonymous mutations: encoded amino acid is unchanges, 70% in 3rd position of codon dont change amino acid sequence at all (redundancy)
- replacement/non-synonymous mutations: encoded amino acid is changed (can cause selection pressure)
- insertion: addition of one or more nucleotides to a sequence
- deletion: removal of one or more nucleotides form a sequence
- indels cause nonsense mutations, replacemnt can cause missense
Lecture 3 HG: Molecular Phylogenetics
give a brief overview of the process of constructing a phylogenetics tree
- First obtain molecular sequences
- using alignment methods, align the sequences correctly
- then using sequence evolution models, to work out the genetic distance between the sequences
- then using phylogenetic methods, ypu can build a evolutionary tree where time scale - genetic distance
- then using molecular clock models to build an evolutionayr tree, timescale = years
- then either using coalescent thoery: to get population level process, or macroevolution models to get species level processes
Lecture 3 HG: Molecular Phylogenetics
examples of population-level processes and species level processes
population level processes changes to population of single species over time
* natural selection
* genetic drift
* gene flow
* mutation
* sexual selection
species level processeschanges that affect the emergence, evolution, and extinction of species.
* speciation
* extinction
* adaptive radiation
* coevolution
* hybridization
Lecture 3 HG: Molecular Phylogenetics
what are ways of sequence alignment methods
- BLAST can be used for MSA and general matches
- but other algorithms such as clustal and muscle may be better
- Global alignment
Lecture 3 HG: Molecular Phylogenetics
why is BLAST not
Lecture 3 HG: Molecular Phylogenetics
Why do we need to align sequence
Because there are multiple ways to align and compare sequences. Depending on the way of alignment, the interpretation of the sequences will be different, giving potential incorrect evolutionary histories
Lecture 3 HG: Molecular Phylogenetics
what is molecular sequence alignment
- molecular sequence alignment is based on the concept of positional homology.
- nucleotides or aa have positional homology if they exist at equivalent positions in their compared sequences
- A set of nucleotide or amino acid sequences is converted into an alignment by proposing positional homologies for each site.
- There are many possible ways to align and compare a sequence
Lecture 3 HG: Molecular Phylogenetics
what are 2 methods of sequence alignment
- multiple sequence alignment (MSA): alignment of three or more sequences of similar length.
- Global alignment: a method used to align two sequences from beginning to end, maximizing the number of matches and minimizing the number of mismatches and gaps across the entire length of the sequences. It’s a type of pairwise sequence alignment.
Lecture 3 HG: Molecular Phylogenetics
describe the Multiple Sequence alignment methods
- for multiple sequence comparisons
- helps iidentify conserved sequences across multiple organisms, which may be indicative of functional or structural importance.
- understanding phylogenetic relationships, predicting the function of unknown proteins, and identifying conserved motifs.
Challenge:
as number of sequences increase, more computationally challenging to align
Example:
CLUSTAL and MUSCLE
Lecture 3 HG: Molecular Phylogenetics
Global alignment
- The goal is to find the best possible alignment that includes all characters from both sequences, which is particularly useful for comparing sequences of similar length and identifying overall similarities and differences.
- Needleman-Wunsch algorithm, systematically compare all possible alignments and select the one with the highest score based on a scoring matrix.
Lecture 3 HG: Molecular Phylogenetics
compare and contrast MSA and Global alignment methods, and pros and cons
- MSA is used to align three or more sequences and is essential for analyzing conserved regions across multiple sequences, while global alignment is designed for comprehensively aligning two sequences from start to finish.
- MSA is widely used in evolutionary studies, functional annotation, and identification of conserved motifs across multiple sequences. Global alignment is more suited for comparing two sequences in their entirety, such as when determining the overall similarity between two genes or proteins from different species.
MSA pros:
* useful in evolution study of multiple organisma and phylogenetic relationships
* identify conserved regions
MSA cons:
* Gap Penalty Ambiguity: length of indels may affect how valid it is, especially in low similarity regions
* highly divergent and varied lengths can be difficult to align
* computational limitations: if too many sequences
GA pros:
* Needleman-Wunsch, which systematically explore all possible alignments to find the optimal one.
* Complete Alignment: It aligns two sequences from beginning to end, useful for closely related sequences of similar length.
* Scoring System: The use of a scoring system (for matches, mismatches, and gaps) allows for quantifiable comparison of alignments, making it easier to assess the quality of the alignment.
GA cons:
* only pairwise, so less suited for evolutionary and phylogeny studies
* varying lengths with large indels are difficult to compare and align
Lecture 3 HG: Molecular Phylogenetics
How do most alignment methods work?
by assigning a different “cost” to each type of sequence difference (transitions, transversions, insertions, deletions etc).
Using algorithms calculate costs, each possible alignment therefore has a total cost.
Then identify the algorithm with the lowest cost
Lecture 3 HG: Molecular Phylogenetics
compare and contrast clustal and MUSCLE
clustal
* clustal is algorithm for MSA
* first use scoring system for pairwise comparison between sequences, then creating a guide tree from these sequences and making adjustments with pairwise and MSA considerations
* can align varying lengths and divergence (as long as not too diverged sequences
MUSCLE
* good for large datasets for high speed and accuracy
* 3 step model
Lecture 3 HG: Molecular Phylogenetics
what is one limitation for sequence alignment algorithms
- if too diverged and low similarity, less accurate
- varying lengths can be dificult
- large dataset/too many sequences = computational complexity
- too large/too many indels can be difficult and lead to low accuracy
Lecture 3 HG: Molecular Phylogenetics
Alignment to genetic distance: why do we need to measure genetic distance
- to identify if they have undergone convergent or divergent evolution, and how many substitutions and mutations occured at each nucleotide/amino acid position
- to measure the evolutionary process of sequences, not simply compare their differences and number of mismatch
- also need to consider and calculate if they have gone through multiple substitutions and convergent evolution from A-C-A instead of A-A (p-distance)
Lecture 3 HG: Molecular Phylogenetics
How do we extimate how many substitutions actually occured in calculating genetic distances in phylogenetic tree construction
When divergence is low, the observed number of changes is similar to the true genetics distance
When divergence is high, the observed number underestimates the true genetic distance.
Lecture 3 HG: Molecular Phylogenetics
what is the multiple hits problem
same position of genome having multiple mutations over time, showing convergent evolution
Lecture 3 HG: Molecular Phylogenetics
what is p-distance
p-distance= Number of differing positions/ Total number of positions compared
proportion of differences, measure of genetic distance in evolutionary biology.
It quantifies the genetic difference between two sequences (DNA or protein sequences) by calculating the proportion of sites (nucleotide or amino acid positions) at which the sequences differ.
Limitations: The p-distance can underestimate the true evolutionary distance between sequences because it does not account for multiple substitutions at the same site (back mutations or parallel mutations).
For sequences that are highly divergent, use algorithms that account for multiple hits
Lecture 3 HG: Molecular Phylogenetics
what is the nucleotide substitutiom model:
- it is a mathematical model which aims to represent the processes of mutation and natural selection at the molecular level. It considers the probabilities of changes from one nucleotide (A, T, C, or G) to another across a phylogenetic tree.
- describes the rate of nucleotide change from one to another over time
Lecture 3 HG: Molecular Phylogenetics
what are nucleotide substitution models useful for?
- Estimating Divergence Times: By calculating the rates of nucleotide substitutions, we can estimate how long ago two species or sequences diverged from a common ancestor.
- Understanding Evolutionary Forces: These models can provide insights into the forces shaping genetic evolution, such as mutation rates, selection pressures, and genetic drift.
- Evolutionary Inference: By understanding genetic distances, we can construct evolutionary histories to.
Lecture 3 HG: Molecular Phylogenetics
What are some common models of nucleotide substitution models and how do they work?
*assign relative rates of different types of mutations
* Jukes-Cantor (JC) Model: The simplest model, assuming that all substitutions occur at the same rate and each nucleotide has an equal chance of changing into any other nucleotide. Assumes all mutations occur at same rate
* K2P model: Also quite simple, introduces 2 different rates for transition and transversion mutations
* HKY model: more advanced. More realistic model. takes into account of unequal base frequencies. Also it distinguishes different rates between pyramidine to pyra, and puri to puri. as well as if its pyuri to pyri, or pyri to puri.
* The geneal Time reverisble (GTR) model: The most general and complex model, allowing for different rates for all possible changes between nucleotides and different equilibrium nucleotide frequencies. GTR can encompass the simpler models as special cases. Supposes 6 different mutation rates for each case.
Lecture 3 HG: Molecular Phylogenetics
What should realistic nucleotide substitution models do?
Realistic models include the relative frequency of each nucleotide, i.e if lots of As, more likely to see A mutate than T mutate (looks at percentage mutation rather than abundance)
so far GTR is most flexible and most complex, as it supposes 6 diff mutation rates, and takes into account of frequencies of bases.
Lecture 3 HG: Molecular Phylogenetics
compare and contrast the simple JC models to more complexed GTR models. When would you use each?
**JC models: **
* Pros: easy to use, good if the dataset of sequences are limited and when the frequency of bases are largely similar and supposed similar substitution rates are also somewhat the same.
* Cons: over-simplification of the reality of mutations, in large datasets not very realistic and doesn’t reflect evolutionary history if the rates are different
GTR model
* Pros: reflects the evolutionary histories more realistically, more flexible and complex. More adaptabiloty for different substitution rates
* cons: overfitting: may over-interpret on noise data or small datasets. Requires mor computational complexity
Lecture 3 HG: Molecular Phylogenetics
what are Amino acid substitution models?
- nucleotide models for DNA/RNA seqs, aa models for protein seq
- These models are used to study the evolution of protein-coding genes by describing how amino acids change over evolutionary time.
- calculating the probabilities of one amino acid being replaced by another in a protein sequence over time.
- These models account for the fact that some amino acid changes occur more frequently than others due to factors like the physicochemical properties of the amino acids and the functional constraints of the protein. The models define rates of substitution for all possible pairs of the 20 amino acids.
Lecture 3 HG: Molecular Phylogenetics
what can you infer from substitution models
Understanding these changes helps scientists infer protein function, evolutionary relationships, and the dynamics of molecular evolution.
Lecture 3 HG: Molecular Phylogenetics
how do protein substitution models work?
- These models account for the fact that some amino acid changes occur more frequently than others due to factors like the physicochemical properties of the amino acids and the functional constraints of the protein. The models define rates of substitution for all possible pairs of the 20 amino acids.
- hence it is a 20x20 matrix, for all 400 possibilities
- These rates are obtained from large surveys of protein variation (not from your particular data set).
- Equilibrium Frequencies: Most models also consider the equilibrium frequencies of the amino acids, which represent the expected frequencies of each amino acid during evolution
Lecture 3 HG: Molecular Phylogenetics
what are some common models of amino acid substitution models, and how do they work?
- JTT Model: model uses a large database of known protein sequences to deduce rates. It adjusts the substitution rates and amino acid frequencies based on empirical data, making it useful for a wide range of evolutionary distances.
- These rates are obtained from large surveys of protein variation (not from your particular data set).
Lecture 3 HG: Molecular Phylogenetics
Nucleotide substitution models vs Amino acid substitution models, pros and cons
amino acid models
* Pros: functional evolution of proteins, more suitable to study highly diverged protein sequences - can cpature more info when nucleotide saturation (multiple hits) occurs. More simple analysis looking at a larger scale rather than small synonymous changes in nucleotide.Directly understand selection pressures at a phenotypic/protein level
* Cons: Loss of Information: losing information about synonymous changes (which don’t alter the amino acid sequence) and potentially informative patterns in codon usage or RNA secondary structure.
Nucleotide models
* Pros: Detailed Evolutionary Insights, including synonymous mutations, which can be crucial for understanding selective pressures at the molecular level. Can construct higher resolution phylogenetic tree, when the protein models don’t produce detailed enough evolutionary differences
* Cons: Saturation Issues: For highly divergent sequences, saturation—where multiple hits occur—can make nucleotide models less effective over long evolutionary timescales. Only for short time scale
Lecture 3 HG: Molecular Phylogenetics
when to use amino acid model and when to use nucleotide model?
amino acid model:
* when focus is on protein-coding genes/phenotype selections that have experienced significant evolutionary divergence,
* when interested in functional aspects of protein evolution,
* or when nucleotide sequences are so divergent that saturation obscures their evolutionary history.
nucleotide models
* when analyzing closely related sequences, (identifying phylogenetic relationships between closely related species)
* where synonymous changes are informative, or when studying non-coding DNA regions.
ultimately depends on whether interested in coding/non-coding, and how specific the researched evolutionary distance is, and how long/large the divergence/evolutionary history may be
Lecture 3 HG: Molecular Phylogenetics
what are the major biological assumptions and limitations of using substitution models
- assumptions lead to errors when assumptions aren’t true
- Evolution homogeneity: These models often assume that substitution rates are homogeneous/same across the entire sequence being studied. However, different regions of a gene or protein may evolve at different rates due to functional constraints or varying levels of selective pressure.
- Independence: Another assumption is that substitutions at different sites occur independently of one another. In reality, the evolutionary process can be influenced by interactions between sites (epistasis), where the effect of a mutation at one site depends on the sequence at another site. Especially this assumption doesn’t consider 3D shapes and structures of proteins
- Saturation in nucleotide: multiple hits problem, can’t infer long evolutionary history or hugely divergent sequences
Lecture 3 HG: Molecular Phylogenetics
How can the assumption of evolution heterogeneity in substitution models be fixed?
Use models of among-site rate heterogeneity (usually the gamma model) where it models not every site evolves at same rate
It incorporates the realistic scenario that not all parts of a sequence evolve at the same rate. Some regions might be highly conserved due to functional constraints, while others might evolve more rapidly.
Lecture 3 HG: Molecular Phylogenetics
how does the gamma distribution model for among site variation work?
provides a distribution of sites having different evolutionary rates. This is done by adding an extra alpha parameter, to indicate if this site has a faster or slower variation rate.
Lecture 3 HG: Molecular Phylogenetics
What are some phylogenetic methods which allow us to interpret genetic distances into making evolutoonary tree (time scale = genetic distance)
Lecture 3 HG: Molecular Phylogenetics
what does constructing phylogenetics tree with genetic distance tell us?
helps understand relationships among different species or genes
and also when their divergences occurred in terms of genetic distance
construct new branches and nodes on phylogenetic tree based on genetic distances/genetic changes
Lecture 3 HG: Molecular Phylogenetics
what are 2 ways to construct phylogenetic tree
- rooted tree: has evolutionary direction, and horizontal lines represent genetic distance
- unrooted tree: all lines represent genetic distance, and there is no evolutionary direction
Lecture 3 HG: Molecular Phylogenetics
what are phylogenetic methods
techniques used in evolutionary biology to infer the evolutionary relationships and history among groups of organisms, genes, or other units of biological interest.
These methods aim to reconstruct the “phylogeny” or evolutionary tree that represents hypotheses about the ancestral relationships and divergence events that have led to the current diversity of life.
They require other molecular information (like genetic distances) as a basis
Lecture 3 HG: Molecular Phylogenetics
Examples of phylogenetics methods
- UPGMA
- Neighbour Joining
- Maximum Parsimony
- Maximum Likelihood
- Bayesian Inference
Lecture 3 HG: Molecular Phylogenetics
what data do phylognetetic methods/models require in advance prior to being able to generate a tree?
- Different for different models.
- UPGMA and Neighbour-Joining rely on genetic distance for algorithm to work, and hence require sequence alignment to make genetic distance and then use a matrix of genetic distances as a basis for the algorithm
- others like Maximum likelihood, maximum parsimony and bayesian inference only require aligned sequences, but lther parameters have to be altered when running the algorithm
- some programs can run multiples steps all together, from alignment to genetic distance to generating a tree. But choosing the best and most suitable model prior to constructing the tree is important
Lecture 3 HG: Molecular Phylogenetics
What additional steps should you take and be aware of when making a phylogenetic tree
- choosing right model based on sequences/dataset. i.e how divergent sequence is, how large data set is
- changing the parameters when using the models
- adding additional time estimates
- making sure prior alignment + genetic distance estimates are accurate
- adding in bootstrap value
Lecture 3 HG: Molecular Phylogenetics
what does a phylogenetic tree actually interpret?
- being able to look at divergences and homology of species
- evolutionary history and relatedness of species
Lecture 3 HG: Molecular Phylogenetics
What ‘types’ of phylogenetic methods are there, how do you classify them?
- classified by different ways
Algorithmic/ distance based methods - These methods begin with a genetic distance for each pair of sequences. A ‘clustering algorithm’ then transforms the genetic distances into a tree.
- e.g .UPGMA, Neighbour-Joining (NJ)
Optimality methods
* These methods define some kind of score for each possible tree. An optimisation algorithm to find the tree with the highest score., most optimal tree
* e.g. Maximum Parsimony (MP), Maximum Likelihood (ML) , Bayesian Inference
Statistical method
* These methods calculate a probability for each possible tree. They frame phylogeny estimation as a formal statistical problem
* e.g. Maximum Likelihood, Bayesian Inference
Lecture 3 HG: Molecular Phylogenetics
what is molecular clock
- molecular clock is not a separate method for constructing phylogenetic trees by itself, but rather a concept or parameter that can be integrated into various phylogenetic methods.
- hypothesis: genetic mutations accumulate at a roughly constant rate over time in a given genomic region. In other words, the amount of genetic change (mutations) that occurs is proportional to the time that has passed.
Lecture 3 HG: Molecular Phylogenetics
Types of molecular clock?
* Strict Molecular Clock: Assumes the same rate of mutation accumulation for all lineages being studied. This is a simpler but often less realistic assumption.
* Relaxed Molecular Clock: Allows for different rates of mutation accumulation in different lineages. This is more complex but often more accurate, as it accounts for the fact that different species or genes might evolve at different rates.
Local Clock: Rate varies, but is inherited, so adjacent branches have more similar rates
Lecture 3 HG: Molecular Phylogenetics
what is the application of molecular clocks in phylogenetics study/constructing phylogenetics tree
- used to estimate the time of divergence between different species or lineages based on their genetic differences, hence constructing tree
- a parameter that can change genetic distances to time estimates in tree
- Because of this constant rate, the molecular clock can be used to estimate the time of divergence between different species or lineages.
- By comparing the genetic differences between two species and knowing the rate at which mutations accumulate, scientists can estimate how long ago their common ancestor lived.
Lecture 3 HG: Molecular Phylogenetics
How to integrate molecular clock into phylogenetics in applications?
- Rate Measurement: if a certain gene accumulates one mutation every million years on average, and two species differ by ten mutations in that gene, they likely diverged about ten million years ago. This can be done through measuring genetic distances
- Calibration: To use a molecular clock effectively, it often needs to be calibrated with independent data, such as fossil records, geological events, or known evolutionary events. These calibrations provide reference points to determine the mutation rate. For example using a knwon divergence time of species
Lecture 3 HG: Molecular Phylogenetics
what are some limitations of molecular clock?
- Rate Variation: Not all genes or regions of DNA evolve at the same constant rate. Some may evolve more quickly or slowly, affecting the accuracy of the clock.
- Calibration Accuracy: The accuracy of a molecular clock depends heavily on the accuracy of the calibration points used.
- Evolutionary Pressures: Natural selection and other evolutionary forces can affect mutation rates, complicating the assumption of constant rates.
Lecture 3 HG: Molecular Phylogenetics
Molecular clock integration with Phylogenetic Methods:
Distance-Based Methods (e.g., UPGMA, Neighbor Joining):
* molecular clock can be an underlying assumption.
* For instance, UPGMA assumes a strict molecular clock (constant rate of evolution across all lineages), which can be a limitation.
* Neighbor Joining does not assume a strict molecular clock, making it more flexible and broadly applicable.
Maximum Likelihood (ML) and Bayesian Inference (BI):
* incorporate the molecular clock as an optional parameter in their models.
*use “relaxed” molecular clock models, which allow for variation in the rate of mutation accumulation among different lineages. This is more realistic for many datasets.
In Bayesian analysis, particularly, the molecular clock can be integrated with prior information to estimate divergence times along with the phylogenetic relationships.
Lecture 3 HG: Molecular Phylogenetics
what are comparative methods?
- comparative methods involve comparing various biological traits (like anatomical, physiological, or molecular traits) across different species or groups.
- These methods are used to understand the evolutionary relationships and processes that have shaped these traits.
- Comparative methods can be used to test hypotheses about evolutionary processes, like adaptation, co-evolution, or the impact of environmental factors on evolution.
- used to investigate evolutionary processes after tree is constructed: for example determining divergent or convergent evolution by analysing tree
Lecture 3 HG: Molecular Phylogenetics
What is UPGMA, how does it work, and its limitations
- distance based/algorithmic methods
- hence requires genetic distance in matrix and accurate alignment in advance
- The distances measure how different the sequences are, which is assumed to reflect evolutionary time.
- strict molecular clock is assumed, where assumes constant rate of evolution across all lineages (same rate of mutation)
- constructs a phylogenetic tree by clustering taxa based on their pairwise distance. It begins with the closest pairwise distant pair of taxa and builds up the tree by sequentially adding branches.
Limitations:
*cannot compare for ‘best’ tree as this method only forms one tree.
* Assumes a constant rate of evolution, which is often unrealistic.
* Not as accurate for datasets where evolutionary rates vary.
* not accurate for highly divergent sequences
Pros:
* useful for quick analysis
* useful if the dataset fits assumption of molecular clock (where not much selection pressure is present, constant rate of mutation..which is unlikely)
Lecture 3 HG: Molecular Phylogenetics
What is Neighbour Joining, how does it work, and its limitations
- distance based/algorithmic method
- hence requires genetic distance + accurate alignment
- NJ builds a tree by iteratively grouping the closest pair of taxa, but without assuming a constant rate of evolution
- very similar to UPGMA but without assuming a strict molecular clock
Limitations:
* only produces one tree so cannot ocme up with a ‘more ideal’ tree to compare with
* Accuracy can diminish with highly divergent sequences
What is Maximum Parsimony, how does it work, and its limitations
- Optimality based method
- doesn’t require genetic distances (so no substitution model needed), only requires alignment
- MP constructs a tree by minimizing the total number of evolutionary changes (like mutations) required to explain the observed data.
- It compares all possible tree topologies and selects the one with the least changes.
- uses a parsimony score: the minimum number of evolutionary changes/mutations required to explain the observed changes in sequences
Pros:
* fast
* doesnt require substitution models
* Most useful when applied to morphological character data.
Limitations:
* Can be misled by convergent evolution (independent evolution of similar traits)
* but inapplicable to fast-evolving or highly-divergent sequences.
* does not specifically account for different evolutionary rates/models
Lecture 3 HG: Molecular Phylogenetics
What is Maximum Likelihood, how does it work, and its limitations
- optimality + statistical model
- most commonly used
- ML evaluates different tree topologies based on the probability of observing the given data under different evolutionary models.
- evolutionary models = substitution models lol
- Chooses the tree that maximizes the likelihood of the observed data given a particular model of sequence evolution/substitution models
- the highest probability = the best tree
- the probability is calculates based on the tree topolopgy, the branch lengths of the tree (which represents genetic distance calculated by the substituion/evolutionary models), and the rate parameters of substitution models
- uses relaxed clock
- Tree seaching is used to dins the topology with the highest likelihood
Pros:
* Statistically robust; uses explicit models of sequence evolution.
* Handles variable rates of evolution across lineages and sites well.
* High accuracy in tree estimation.
* sophisticated
Limitations:
* slow
* Computationally demanding, especially with large datasets.
* The choice of model can greatly influence the results.
Lecture 3 HG: Molecular Phylogenetics
what is tree searching in Maximum likelihood?
- A tree search is used to find the tree top with the highest likelihood.
Tree Searching
1. Exhaustive Search: - Tries every possible tree. Only feasible with small numbers of taxa.
2. Hill Climbing: - Searches through trees by iterative trial and error.
- Start with one tree and try, if incorrect, try another tree
- Doesn’t check all possible trees and isn’t guaranteed to find the optimal one.
Lecture 3 HG: Molecular Phylogenetics
What is Bayesian Inference, how does it work, and its limitations
- optimality and statistical model
- Similar to ML in using probability models for evolutionary change,
- BI incorporates prior knowledge and calculates the probability of a tree given the data. It provides a statistical framework for estimating the uncertainty in phylogenetic inferences.
- Suitable for complex datasets where incorporating prior knowledge or hypothesis testing is important.
- Ideal when you need to estimate the uncertainty in your phylogenetic inferences.
Pros:
* Incorporates prior knowledge and uncertainty into the analysis.
* Statistically rigorous, using explicit models like ML.
* Provides estimates of the probability of different phylogenetic trees.
Limitations:
* slow
* Even more computationally intensive than ML.
* The choice of priors and models can significantly affect the results.
* Requires careful interpretation