14 | Large scale analyses Flashcards
Define population genetics.
- subfield of genetics
- part of evolutionary biology
- deals with genetic differences within and among pops
- examine phenomena eg: adaptation, speciation, population structure
from 1920s on
(Fisher, Haldane, Wright)
* formal models of evolution, statistical methods
* allele and gene frequencies in populations over time
* data: few genotypes of a limited number of individuals
Define phylogenomics
Ultimate goal: reconstruct the evolutionary history of species through their genomes
Intersection of the fields of evolution and genomics
Analysis that involves genome data and evolutionary reconstructions.
What is molecular genetics / genomics about?
What type of data?
from 1950s on
(Watson & Crick)?
* makeup, expression, regulation of genes;
genotype-phenotype
* data: gene & genome sequences, phenotypes
Define Population Genomics.
Data used?
Concepts and tools?
Questions?
- large-scale application of genomic technologies to study populations of individuals
- data: multiple genomes from the same (or closely
related) species; thousands / millions of SNPs per
individual - studies genome-wide effects to improve our understanding of microevolution –> learn phylogenetic history and demography of a pop
concepts & tools
- linkage disequilibrium, genetic drift, coalescent,
multivariate statistics, …
questions: population structure & history; detect evolutionary processes along the genome
- gene & genome evolution, recombination events, split times, gene flow, population sizes, demographic events, selection, diversification, relatedness, pop. structure, etc
- contemporary & ancestral populations/species
Genome-scale evolutionary analysis
Based on which regions? (two options)
In both cases: several gene sequences (can’t be one like in the previous small scale analyses we studied)
based on coding regions:
several (or all) gene sequences from different species –> families of homologs / orthologs
- often one individual per species
- homology/orthology assignment required
based on any/all genomic regions:
independent of gene content –> homologous genomic regions –> alignments and phylogenies
- generally done based on re-sequencing data
- separate homology assignment step not necessary
- same or closely related species
Which step in the pipeline sets large scale analyses apart from the small scale analyses we studied previously?
Just this step is done differently and makes large scale more difficult:
Assigning sequences (genomic data from multiple species) to families of homologous/orthologous genes
What are the three general approaches to Genome-scale inference of homology / orthology?
And what are the possible types of data? (plus examples)
Approaches:
- Tree based
- Graph based
- Hybrid
Data:
- Databases: use pre-computed sets of homologs / orthologs (eg Treefam,. OMA)
- Customized data set: compute project-specific homologs / orthologs
What is an example of a database which can be used for tree-based genome scale inference of homology/orthology?
`Treefam
What are the two types of graph-based genome scale inference of homology/orthology?
Name an example database for each type.
graph-based
- reciprocal best match (RBH): COG
- clustering (MCL): OrthoMCL
What is usually the best approach for genome-scale inference of homology / orthology?
Name a database that can be used for this
A hybrid approach combining graph-based and tree-based methods.
For example OMA (Orthologous MAtrix)
What is OMA?
OMA (Orthologous MAtrix) algorithm & database
- best strategy for large-scale assignment of orthologs/homologs
- using a combination of graph- and tree-based methods
- multi-step pipeline
- pre-computed (and cross-linked from other dbs)
- 1:1 orthologs
- homologs (orthologs & paralogs)
- stand-alone software
- companion tools (analysis, visualization)
- HOGS
–> large scale analysis and assignment of homology is really difficult, number of strategies, result often incorrect, time-consuming
–> rather go with pre existing
What are seven challenges of assigning homology / orthology
- pairwise orthology definition (non-transitive)
- differential gene loss (or incomplete sampling)
- multi-domain proteins / mosaics
- horizontal transfer (xenologs)
- high rates of sequence divergence
- poor genome assembly / annotation
- computational demand
What is resequencing?
Resequencing is typically performed when a reference genome sequence is available.
Sequencing reads are aligned back to the reference to determine the location in the genome the specific read best matches.
Only works for same or closely related species
What is a disadvantage of re-sequencing projects compared to independently (de novo) assembled genomes ?
Re-sequencing against a reference genome can lead to reference bias.
What can genome scale phylogenetic analysis result in, when carried out from one gene, entire genomes, or many genes (genomic windows)
one gene –> one tree
entire genomes –> one tree
many genes
- if you concatenate the genes –> one tree
- separate analyses of each gene –> many trees