14 | Large scale analyses Flashcards

1
Q

Define population genetics.

A
  • subfield of genetics
  • part of evolutionary biology
  • deals with genetic differences within and among pops
  • examine phenomena eg: adaptation, speciation, population structure

from 1920s on
(Fisher, Haldane, Wright)
* formal models of evolution, statistical methods
* allele and gene frequencies in populations over time
* data: few genotypes of a limited number of individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define phylogenomics

A

Ultimate goal: reconstruct the evolutionary history of species through their genomes

Intersection of the fields of evolution and genomics

Analysis that involves genome data and evolutionary reconstructions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is molecular genetics / genomics about?

What type of data?

A

from 1950s on
(Watson & Crick)?
* makeup, expression, regulation of genes;
genotype-phenotype
* data: gene & genome sequences, phenotypes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define Population Genomics.

Data used?
Concepts and tools?
Questions?

A
  • large-scale application of genomic technologies to study populations of individuals
  • data: multiple genomes from the same (or closely
    related) species; thousands / millions of SNPs per
    individual
  • studies genome-wide effects to improve our understanding of microevolution –> learn phylogenetic history and demography of a pop

concepts & tools
- linkage disequilibrium, genetic drift, coalescent,
multivariate statistics, …

questions: population structure & history; detect evolutionary processes along the genome
- gene & genome evolution, recombination events, split times, gene flow, population sizes, demographic events, selection, diversification, relatedness, pop. structure, etc
- contemporary & ancestral populations/species

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Genome-scale evolutionary analysis

Based on which regions? (two options)

A

In both cases: several gene sequences (can’t be one like in the previous small scale analyses we studied)

based on coding regions:
several (or all) gene sequences from different species –> families of homologs / orthologs
- often one individual per species
- homology/orthology assignment required

based on any/all genomic regions:
independent of gene content –> homologous genomic regions –> alignments and phylogenies
- generally done based on re-sequencing data
- separate homology assignment step not necessary
- same or closely related species

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which step in the pipeline sets large scale analyses apart from the small scale analyses we studied previously?

A

Just this step is done differently and makes large scale more difficult:

Assigning sequences (genomic data from multiple species) to families of homologous/orthologous genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three general approaches to Genome-scale inference of homology / orthology?

And what are the possible types of data? (plus examples)

A

Approaches:
- Tree based
- Graph based
- Hybrid

Data:
- Databases: use pre-computed sets of homologs / orthologs (eg Treefam,. OMA)
- Customized data set: compute project-specific homologs / orthologs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an example of a database which can be used for tree-based genome scale inference of homology/orthology?

A

`Treefam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the two types of graph-based genome scale inference of homology/orthology?

Name an example database for each type.

A

graph-based
- reciprocal best match (RBH): COG
- clustering (MCL): OrthoMCL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is usually the best approach for genome-scale inference of homology / orthology?

Name a database that can be used for this

A

A hybrid approach combining graph-based and tree-based methods.

For example OMA (Orthologous MAtrix)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is OMA?

A

OMA (Orthologous MAtrix) algorithm & database

  • best strategy for large-scale assignment of orthologs/homologs
  • using a combination of graph- and tree-based methods
  • multi-step pipeline
  • pre-computed (and cross-linked from other dbs)
    • 1:1 orthologs
    • homologs (orthologs & paralogs)
  • stand-alone software
  • companion tools (analysis, visualization)
  • HOGS

–> large scale analysis and assignment of homology is really difficult, number of strategies, result often incorrect, time-consuming
–> rather go with pre existing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are seven challenges of assigning homology / orthology

A
  • pairwise orthology definition (non-transitive)
  • differential gene loss (or incomplete sampling)
  • multi-domain proteins / mosaics
  • horizontal transfer (xenologs)
  • high rates of sequence divergence
  • poor genome assembly / annotation
  • computational demand
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is resequencing?

A

Resequencing is typically performed when a reference genome sequence is available.

Sequencing reads are aligned back to the reference to determine the location in the genome the specific read best matches.

Only works for same or closely related species

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a disadvantage of re-sequencing projects compared to independently (de novo) assembled genomes ?

A

Re-sequencing against a reference genome can lead to reference bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What can genome scale phylogenetic analysis result in, when carried out from one gene, entire genomes, or many genes (genomic windows)

A

one gene –> one tree

entire genomes –> one tree

many genes
- if you concatenate the genes –> one tree
- separate analyses of each gene –> many trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is phylogenetic incongruence?

A

gene/locus/window trees
- are different from each other
- (are different from the known/expected species tree)

17
Q

Phylogenetic incongruence - technical explanation?

A
  • insufficient taxon sampling
  • orthology mis-assignment
  • misalignment
  • excessive trimming
  • inappropriate model, …
18
Q

Phylogenetic incongruence - biological explanation? name 5

A

different genome regions have different evol. histories!
- incomplete lineage sorting / deep coalescence
- hybridization or introgression
- horizontal gene transfer (HGT)
- differential duplication and loss
- natural selection

19
Q

Define incomplete lineage sorting / coalescence

A

A cause of phylogenetic incongruence - can lead to gene tree ≠ gene tree ≠ species tree

allelic polymorphisms exist across speciation events

alleles coalesce first with alleles from more distantly
related species

Random sorting of ancestral polymorphisms:
Anything other than perfect segregation of all alleles into all lineages is called “incomplete lineage sorting” – and for a large genome, it is a given that at least some genes will exhibit this effect.

Also termed hemiplasy, deep coalescence, retention of ancestral polymorphism, or trans-species polymorphism, describes a phenomenon in population genetics when ancestral gene copies fail to coalesce (looking backwards in time) into a common ancestral copy until deeper than previous speciation events.

20
Q

Consider an ancestral polymorphism: in the common ancestor we have three alleles A, B, C, then two speciation events resulting in the following species tree for the corresponding species: ((A,B)C)

What are the three possible gene trees according to when the alleles coalesced?

What is the expected frequency of gene trees discordant with the species tree? (UNLESS …?)

A
  • ((A,B)C)
  • ((A,C)B) –> discordant
  • ((B,C)A) –> discordant

Each gene tree is equally likely, so the probability of a discordant tree is 2/3.

UNLESS there was gene flow

21
Q

Define gene flow

A

Gene flow (aka gene migration)

  • transfer of genetic material from one population to another
  • between two pops of closely related species (or lineages) or between the same species
  • mediated by reproduction and vertical gene transfer from parent to offspring.
  • ancient or recent / rare or ongoing
  • a lot more frequent than initially thought!
21
Q

Potential outcomes of gene flow?

A
  • nothing
  • merge into 1 species
  • invasion of 1 species
  • form hybrid zone
  • form new hybrid species

most important for us: exchange few genes = introgression

22
Q

Define introgression

A

(aka backcrossing)

gene flow between closely related species (lineages)
- ancient or recent / rare or ongoing
- a lot more frequent than initially thought!
- hybrids are rare
- they backcross with parental species
- parental species remain distinct

23
Q

How can introgression be detected? Most important method for us?

A

using genomic data
- trees that are discordant from species tree?
- tests to identify introgressed genomic regions, direction or amount of gene flow
- …

Most important for us:
- excess of shared alleles between hybridizing taxa
- D-statistic / ABBA-BABA test, f-statistic

24
Q

What is the ABBA BABA test / statistic (D-Statistic)?

(research more)

A

ABBA BABA statistics (also called D statistics)

  • simple and powerful test for a deviation from a strict bifurcating evolutionary history.
  • frequently used to test for introgression using genome-scale SNP data.
  • developed to quantify the amount of genetic exchange between Neanderthals and modern humans

An excess of either ABBA or BABA, resulting in a D-statistic that is significantly different from zero, is indicative of gene flow between two taxa.

(A positive D-statistic (i.e. an excess of ABBA) points to introgression between P2 and P3, whereas a negative D-statistic (i.e. an excess of BABA) points to introgression between P1 and P3.)

25
Q

Explain the D-statistic and the ABBA-BABA test with an example.

(research more)

A

Consider four taxa P1, P2, P3, and O (outgroup) with the following species tree:
(((P1,P2),P3),O)

Which have either ancestral (‘A’) or derived (‘B’) alleles across their genomes.

An analysis of A and B alleles can result in the following:
- The outgroup species has only A
- P3 has only B
- P1 and P2 have one of each

This means there are two possible gene trees:
–> (((P2,P3),P1),O) –> (((P1,P3),P2),O)

Using the frequencies of A and B in P1 and P2 (ie frequencies of ABBA and BABA pattern), we can determine if introgression has taken place between P3 and either P1 or P2

if D = 0:
- equal frequencies of ABBA and BABA trees
- only incomplete lineage sorting, no introgression

if D ≠ 0:
- introgression has taken place

26
Q

one genome = one phylogeny?

one single genome/species tree??

A

different genome regions have different evolutionary histories

different gene/locus/window trees can differ from each
other and from the species/organismal tree

27
Q

Phylogenetic incongruence:

challenge or opportunity?

A

challenge!
- assumptions about species evolution

opportunity!
- a tool to learn about the evolution of
lineages & their genomes

28
Q

Genome-level evolutionary analyses in R?

A

Many libraries developed eg for:

  • identify species tree nodes affected by gene flow
  • identify admixed genomic regions
  • identify direction of admixture
  • determine relative age of gene flow
  • (graphically) summarize discordance
29
Q

What method did we learn which can be used to evaluate incongruent trees ?

A

D-Statistic / ABBA-BABA

30
Q

EXAM (2019, 2020)

List three biological reasons for which we may get incongruences in gene trees. Explain one of them, and how it is reflected in the tree.

A
  • incomplete lineage sorting / deep coalescence
  • hybridization or introgression
  • horizontal gene transfer (HGT)
  • differential duplication and loss
  • natural selection

still to do - how it is reflected in the tree?

31
Q

EXAM (2020)

Given multicast file of homolog sequences, how to extract orthologs and paralogs

A

All against all comparisons
- based on score & length criteria
–> homologs (candidate pairs)

Formation of stable pairs
- analysis within and between genomes
- pairwise & multiple sequence comparisons
- ML evolutionary distances
- protein similarity graph, clustering
–> putative orthologs (stable pairs)

Verification of stable pairs
- compare with third genome:
check for hidden paralogs,
differential loss
- use species tree information
- graph theoretic approaches
–> Orthologs (verified pairs)

32
Q

Is orthology transitive?

A

No

pairwise orthology definition –> non-transitive

33
Q

Explain tree-based vs graph-based approaches for inferring orthology

A

Graph based
- rely on graphs with genes as nodes and evolutionary relationships as edge.
- infer whether edges represent orthology or paralogy
- build clusters of genes on the basis of the graph.

Tree-based
- gene/species tree reconciliation
- annotating all splits of a given gene tree as duplication or speciation,
- given the phylogeny of the relevant species
- reconciled tree –> trivial to derive all pairs of orthologous and paralogous genes.
- gene pairs coalesce in speciation node = orthologs
- paralogs if they split at a duplication node