Bioinformatics - Final Exam Content Flashcards
What are the differences between substitution models?
the substitution changes based on what parameters you include, simplest models include just the number of substitutions (hamming distance), others correct for unobserved mutations, some may characterize transitions vs transversions differently, others may have proportions of invariable sites and gamma distributions, differences between models result from what parameters each model includes
What parameters are included in substitution models?
- transitions vs transversions
- hamming distance
- jukes and cantor distance (correcting for unobserved mutations)
- equal/unequal base frequencies
- proportion of invariable sites
- gamma distributed rate variation among sites
How do you find the best substitution model?
- the best thing to do is test ALL models and find the one that best fits your sequence data, this is done under the maximum likelihood framework, based mostly on lowest BIC and highest AIC values
- after all of this is determined you also want to include bootstrap analysis
What are the steps to finding the best Tree?
- do a tree search under each model
- calculate the maximum likelihood score of the best tree for each model
- compare them using BIC or AIC scores, which are estimators of relative quality of statistical models
How do phylogenetic approaches provide insight on evolution?
phylogeny - compare phylogenies to biogeography and major paleoecological events
evolutionary processes - pattern heterogeneity and selection ratios (dN/dS)
How do we use the Disparity Index (I) to estimate pattern heterogeneity?
- a common WRONG assumption is that sequences evolve in homogeneity (same conditions and processes)
- we know that sequence evolve differently based on locations and pressures
- we measure pattern heterogeneity via the disparity index
- the disparity index identifies pairs of sequences that evolved under substantially different evolutionary processes
What is the basis for dN/dS ratio tests?
it is a means to test if selection is occuring, substitution rate outliers will include sequences which affect an organism’s ability to survive and reproduce, substitution patterns reflect selection and dN/dS is the best thing we have for this
How do you interpret I (disparity index) statistics?
I = 0 means the sequences evolved under the same processes and pressures
I > 0 means the sequences evolved under different processes and pressures
how do you interpret dN/dS statistics?
dN/dS = 1 : neutral not undergoing selection
dN/dS > 1 : positive selection so a mutation made that is beneficial
dN/dS < 1 : purifying selection so a mutation change is bad and these will lead to fixed sites
Transition
a change from an A to G or C to T
- in other words these are substitutions which are more likely to happen because we are not changing from purine to pyrimidine or vice versa
Transversion
a change from A>C, A<T, G<C, G<C
- these are substitutions which happen less frequently and are more serious because it is change from purine to pyrimidine or vice versa
Hamming Distance ( Dh)
- the simplest approach to modeling substitutions, it counts the number of difference, this is differences divided by length
- Dh = n / N
- n is the number sites which are different
- N is the length of the alignment
Jukes and Cantor (1969)
- a model for distance of substitutions which corrects for unobserved mutations
- Djc1969 = (-3/4)ln(1-4/3p)
- p = the proportion of sites which differ between sequences
Distance (phylogenetic tree sense)
essentially it is based on how different sequences in the alignment are taking into account the differences or substitutions which have occurred
Proportion of invariable sites (I)
- a parameter to significantly improve models
- (I) is the extent of static, unchanging site in the dataset
gamma distribution (G)
- a parameter to significantly improve models
- indicates a gamma distributed rate variation among sites
BIC value
bayesian information criteria (lowest scored model is best)
AIC value
akaike information criteria (highest scored model is best)
pattern heterogeneity
if two sequences evolved under the same processes their nucleotide composition will be similar, however if they evolved under separate pressures their nucleotide composition will reflect that
dN/dS ratio
- a highly important and common approach for testing if selection has occurred
- nonsynonymous subs per site / synonymous subs per site
- = 1 : neutral not undergoing selection
- > 1 : positive selection so a mutation made that is beneficial
- <1 : purifying selection so a mutation change is bad and these will lead to fixed sites
disparity index (I)
the observed difference in evolutionary patterns for a pair of sequences based on nucleotide composition
- I = 1/2 summation (xi - yi) squared - Nd
- xi = composition of ith nucleotide
- yi = composition of ith nucleotide
- Nd = composition of distance under homogeneity
values associated w disparity index:
I = 0 -> same evolutionary pressures
I > 0 -> different evolutionary pressures
neutral theory of molecular evolution (Kimura 1968)
- most mutations are neutral or “nearly neutral
- it is a basic principle that differences in fecundity lead to natural selection and fixation of mutations
- substitution pattern reflect selection
synonymous
- sub where the amino acid will stay the same
- more likely to be neutral
nonsynonymous
- sub where the amino acid will change
- more likely to change phenotype
- positive selection may result from a beneficial change in phenotype
neutrality
- dN/dS ratio = 1 where the number of dN and dS are the same, indicates no selection happening
positive selection
- when the dN/dS ratio > 1
- a mutation is beneficial so selection is occuring to change to that mutation
purifying selection
- when the dN/dS ratio < 1
- a mutation is detrimental so selection is preventing that bad mutation and working to fix a site in a population
What is the perspective that molecular genetics uses to examine variation?
molecular evolution/genetics focuses on fixed differences between species
What is the perspective that population genetics uses to examine variation?
population genetics focuses on the differences between populations of one species
- so like how does a mountain range separating two populations of the same species affect how those species have evolved
What parameters are estimated in population genetics?
gene pool, allele frequency, genotype frequency
- these population parameters will affect the gene pool in a predicted way
what are the basics of the hardy weinberg equilibrium (HWE)?
- extending Mendel’s law of inheritance to populations yield HWE
- when gametes containing either two alleles, A or a, unite in random to form the next generation, the genotype frequencies in offspring (zygote) is A : Aa : a (p2 : 2pq : q2)
- we maintain genotype frequency by allele frequency
what are the assumptions of the HWE?
allele frequencies will remain constant over time if these assumptions are met:
- random mating
- infinite population size
- no migration
- no selection
- no mutation
violations to these assumptions have predicted effects on allele and genotype frequencies
how does violating assumptions of HWE effect parameters?
inbreeding - decreases heterogeneity, so genotype frequencies change but allele frequencies to not, lead to heritable diseases
genetic drift (small pop) - randomly drift towards one allele, so we converge on one allele type (fixation), but which allele becomes fixed is random
migration - may lead to admixture, combining two or more pops w different allele frequencies into one group
selection - maybe recessive, dominant, or additive, a frequency of a certain allele becomes fixed in a population
mutation - randomly change genotype
what can you estimate if you know allele and genotype frequencies?
we can go backwards and guess which assumptions were violated
- inbreeding rates
- population sizes
- effective population size (number of breeders)
- migration/dispersal
- population structure/gene flow
- recent changes in population sizes
- selection coefficients
- genotype-phenotype associations
how do we estimate population structure with fixation index (Fst) values?
- we look at how alleles are distributed among vs within populations
- Fst is an estimate of the genetic divergence between species
- Fst = AP / (WI + AI + AP)
AP = estimated variance in allele frequencies Among Populations
WI = estimated variance in allele frequencies Within Individuals
AI = AP = estimated variance in allele frequencies Among Individuals
what are microsatellites?
short repeats found within a species and certain populations may have varying numbers of these repeats
- short segment of DNA, usually one to six or more base pairs in length, that is repeated multiple times in succession at a particular genomic location
how are microsatellites genotyped?
- obtain primer for microsat
- PCR
- fragment analysis
- see how big the pieces are to determine how many repeats they have
- genotype
populations
group of individuals of one species living in the same geographical area
subpopulations
local populations within which most individuals find their mates
gene pool
all genetic variation within a population
allele
variant at a locus, comes from a mutation
locus
independent location on a chromosome, can be a gene
allele frequency
proportion of any specific allele in a population
genotype frequency
proportion of individuals in a population with a specific genotype
(in diploid, the genotype is the combination of two alleles in individual hetero or homo)
Hardy Weinberg equilibrium
when gametes containing either of two alleles, A or a, unite at random to form the next generation, the genotype frequencies in offspring (zygote) is AA : Aa : aa (alo p2 : 2pq : q2) and p + q = 1
inbreeding
violates non-random mating, decreases heterogeneity and usually fitness
genetic drift
migration
- movement of individuals between populations followed by breeding
selection
additive selection
recessive selection
dominant selection
fixation index (Fst)
microsatellites
a short segment of DNA, usually one to six or more base pairs in length, that is repeated multiple times in succession at a particular genomic location. These DNA sequences are typically non-coding
how are phenotypes associated with genotypes?
why are phenotypes associated with genotypes?
how do we model gene-phenotype interactions?
how does linkage disequalibrium lead to haplotype blocks?
how does linkage disequalibrium lead to haplotype blocks?
how does linkage disequilibrium lead to haplotype blocks?
how are GWAS studies performed?
how are GWAS studies interpreted?
what are some ways to decrease error in GWAS studies?
genome wide association studies (GWAS)
quantitative traits
genotype-phenotype association
genotype-phenotype models
multiplicative : genotype-phenotype model
additive : genotype-phenotype model
additive : genotype-phenotype model
recessive : genotype-phenotype model
common dominant : genotype-phenotype model
polygenic : genotype-phenotype model
linkage map
cM
linkage disequilibrium
haplotype block
coefficient of linkage disequilibrium (D)
TAG SNP
Bonferroni correction
power
odds ratio
multi-stage approach
permutation
false positives
population stratification
admixture
why do need NGS? what did we hope to learn?
most phenotypes and diseases are complex
Health things to learn
- genetic factors affecting health
- predict, prevent, detect disease
- personalized effective treatment
- monitor disease progression
Wildlife/domestic animals things to learn
- genes that affect traits
- better management and conservation
- improve important traits
what makes up our genome?
- 45% of the genome is repetitive elements
- 30% of genome from genes, of that only about 2% is coding exons, there are also noncoding RNAs
- 70% of genome is intergenic (between genes), this includes repetitive elements (simple repeats, transposons, SINES and LINES), conserved noncoding regions, regulatory regions, and structural regions (centromeres and telomeres)
what types of variation are present in genomes?
- deletion
- duplication
- inversion
- translocation
why do we need next gen sequencing (NGS)?
elaborate on the development of NGS