Bioinformatics - Final Exam Content Flashcards
What are the differences between substitution models?
the substitution changes based on what parameters you include, simplest models include just the number of substitutions (hamming distance), others correct for unobserved mutations, some may characterize transitions vs transversions differently, others may have proportions of invariable sites and gamma distributions, differences between models result from what parameters each model includes
What parameters are included in substitution models?
- transitions vs transversions
- hamming distance
- jukes and cantor distance (correcting for unobserved mutations)
- equal/unequal base frequencies
- proportion of invariable sites
- gamma distributed rate variation among sites
How do you find the best substitution model?
- the best thing to do is test ALL models and find the one that best fits your sequence data, this is done under the maximum likelihood framework, based mostly on lowest BIC and highest AIC values
- after all of this is determined you also want to include bootstrap analysis
What are the steps to finding the best Tree?
- do a tree search under each model
- calculate the maximum likelihood score of the best tree for each model
- compare them using BIC or AIC scores, which are estimators of relative quality of statistical models
How do phylogenetic approaches provide insight on evolution?
phylogeny - compare phylogenies to biogeography and major paleoecological events
evolutionary processes - pattern heterogeneity and selection ratios (dN/dS)
How do we use the Disparity Index (I) to estimate pattern heterogeneity?
- a common WRONG assumption is that sequences evolve in homogeneity (same conditions and processes)
- we know that sequence evolve differently based on locations and pressures
- we measure pattern heterogeneity via the disparity index
- the disparity index identifies pairs of sequences that evolved under substantially different evolutionary processes
What is the basis for dN/dS ratio tests?
it is a means to test if selection is occuring, substitution rate outliers will include sequences which affect an organism’s ability to survive and reproduce, substitution patterns reflect selection and dN/dS is the best thing we have for this
How do you interpret I (disparity index) statistics?
I = 0 means the sequences evolved under the same processes and pressures
I > 0 means the sequences evolved under different processes and pressures
how do you interpret dN/dS statistics?
dN/dS = 1 : neutral not undergoing selection
dN/dS > 1 : positive selection so a mutation made that is beneficial
dN/dS < 1 : purifying selection so a mutation change is bad and these will lead to fixed sites
Transition
a change from an A to G or C to T
- in other words these are substitutions which are more likely to happen because we are not changing from purine to pyrimidine or vice versa
Transversion
a change from A>C, A<T, G<C, G<C
- these are substitutions which happen less frequently and are more serious because it is change from purine to pyrimidine or vice versa
Hamming Distance ( Dh)
- the simplest approach to modeling substitutions, it counts the number of difference, this is differences divided by length
- Dh = n / N
- n is the number sites which are different
- N is the length of the alignment
Jukes and Cantor (1969)
- a model for distance of substitutions which corrects for unobserved mutations
- Djc1969 = (-3/4)ln(1-4/3p)
- p = the proportion of sites which differ between sequences
Distance (phylogenetic tree sense)
essentially it is based on how different sequences in the alignment are taking into account the differences or substitutions which have occurred
Proportion of invariable sites (I)
- a parameter to significantly improve models
- (I) is the extent of static, unchanging site in the dataset
gamma distribution (G)
- a parameter to significantly improve models
- indicates a gamma distributed rate variation among sites
BIC value
bayesian information criteria (lowest scored model is best)
AIC value
akaike information criteria (highest scored model is best)
pattern heterogeneity
if two sequences evolved under the same processes their nucleotide composition will be similar, however if they evolved under separate pressures their nucleotide composition will reflect that
dN/dS ratio
- a highly important and common approach for testing if selection has occurred
- nonsynonymous subs per site / synonymous subs per site
- = 1 : neutral not undergoing selection
- > 1 : positive selection so a mutation made that is beneficial
- <1 : purifying selection so a mutation change is bad and these will lead to fixed sites
disparity index (I)
the observed difference in evolutionary patterns for a pair of sequences based on nucleotide composition
- I = 1/2 summation (xi - yi) squared - Nd
- xi = composition of ith nucleotide
- yi = composition of ith nucleotide
- Nd = composition of distance under homogeneity
values associated w disparity index:
I = 0 -> same evolutionary pressures
I > 0 -> different evolutionary pressures
neutral theory of molecular evolution (Kimura 1968)
- most mutations are neutral or “nearly neutral
- it is a basic principle that differences in fecundity lead to natural selection and fixation of mutations
- substitution pattern reflect selection
synonymous
- sub where the amino acid will stay the same
- more likely to be neutral
nonsynonymous
- sub where the amino acid will change
- more likely to change phenotype
- positive selection may result from a beneficial change in phenotype
neutrality
- dN/dS ratio = 1 where the number of dN and dS are the same, indicates no selection happening
positive selection
- when the dN/dS ratio > 1
- a mutation is beneficial so selection is occuring to change to that mutation
purifying selection
- when the dN/dS ratio < 1
- a mutation is detrimental so selection is preventing that bad mutation and working to fix a site in a population
What is the perspective that molecular genetics uses to examine variation?
molecular evolution/genetics focuses on fixed differences between species
What is the perspective that population genetics uses to examine variation?
population genetics focuses on the differences between populations of one species
- so like how does a mountain range separating two populations of the same species affect how those species have evolved
What parameters are estimated in population genetics?
gene pool, allele frequency, genotype frequency
- these population parameters will affect the gene pool in a predicted way
what are the basics of the hardy weinberg equilibrium (HWE)?
- extending Mendel’s law of inheritance to populations yield HWE
- when gametes containing either two alleles, A or a, unite in random to form the next generation, the genotype frequencies in offspring (zygote) is A : Aa : a (p2 : 2pq : q2)
- we maintain genotype frequency by allele frequency
what are the assumptions of the HWE?
allele frequencies will remain constant over time if these assumptions are met:
- random mating
- infinite population size
- no migration
- no selection
- no mutation
violations to these assumptions have predicted effects on allele and genotype frequencies
how does violating assumptions of HWE effect parameters?
inbreeding - decreases heterogeneity, so genotype frequencies change but allele frequencies to not, lead to heritable diseases
genetic drift (small pop) - randomly drift towards one allele, so we converge on one allele type (fixation), but which allele becomes fixed is random
migration - may lead to admixture, combining two or more pops w different allele frequencies into one group
selection - maybe recessive, dominant, or additive, a frequency of a certain allele becomes fixed in a population
mutation - randomly change genotype
what can you estimate if you know allele and genotype frequencies?
we can go backwards and guess which assumptions were violated
- inbreeding rates
- population sizes
- effective population size (number of breeders)
- migration/dispersal
- population structure/gene flow
- recent changes in population sizes
- selection coefficients
- genotype-phenotype associations
how do we estimate population structure with fixation index (Fst) values?
- we look at how alleles are distributed among vs within populations
- Fst is an estimate of the genetic divergence between species
- Fst = AP / (WI + AI + AP)
AP = estimated variance in allele frequencies Among Populations
WI = estimated variance in allele frequencies Within Individuals
AI = AP = estimated variance in allele frequencies Among Individuals
what are microsatellites?
short repeats found within a species and certain populations may have varying numbers of these repeats
- short segment of DNA, usually one to six or more base pairs in length, that is repeated multiple times in succession at a particular genomic location
how are microsatellites genotyped?
- obtain primer for microsat
- PCR
- fragment analysis
- see how big the pieces are to determine how many repeats they have
- genotype
populations
group of individuals of one species living in the same geographical area
subpopulations
local populations within which most individuals find their mates
gene pool
all genetic variation within a population
allele
variant at a locus, comes from a mutation
locus
independent location on a chromosome, can be a gene
allele frequency
proportion of any specific allele in a population
genotype frequency
proportion of individuals in a population with a specific genotype
(in diploid, the genotype is the combination of two alleles in individual hetero or homo)
Hardy Weinberg equilibrium
when gametes containing either of two alleles, A or a, unite at random to form the next generation, the genotype frequencies in offspring (zygote) is AA : Aa : aa (alo p2 : 2pq : q2) and p + q = 1
inbreeding
violates non-random mating, decreases heterogeneity and usually fitness
genetic drift
migration
- movement of individuals between populations followed by breeding
selection
additive selection
recessive selection
dominant selection
fixation index (Fst)
microsatellites
a short segment of DNA, usually one to six or more base pairs in length, that is repeated multiple times in succession at a particular genomic location. These DNA sequences are typically non-coding
how are phenotypes associated with genotypes?
why are phenotypes associated with genotypes?
how do we model gene-phenotype interactions?
how does linkage disequalibrium lead to haplotype blocks?
how does linkage disequalibrium lead to haplotype blocks?
how does linkage disequilibrium lead to haplotype blocks?
how are GWAS studies performed?
how are GWAS studies interpreted?
what are some ways to decrease error in GWAS studies?
genome wide association studies (GWAS)
quantitative traits
genotype-phenotype association
genotype-phenotype models
multiplicative : genotype-phenotype model
additive : genotype-phenotype model
additive : genotype-phenotype model
recessive : genotype-phenotype model
common dominant : genotype-phenotype model
polygenic : genotype-phenotype model
linkage map
cM
linkage disequilibrium
haplotype block
coefficient of linkage disequilibrium (D)
TAG SNP
Bonferroni correction
power
odds ratio
multi-stage approach
permutation
false positives
population stratification
admixture
why do need NGS? what did we hope to learn?
most phenotypes and diseases are complex
Health things to learn
- genetic factors affecting health
- predict, prevent, detect disease
- personalized effective treatment
- monitor disease progression
Wildlife/domestic animals things to learn
- genes that affect traits
- better management and conservation
- improve important traits
what makes up our genome?
- 45% of the genome is repetitive elements
- 30% of genome from genes, of that only about 2% is coding exons, there are also noncoding RNAs
- 70% of genome is intergenic (between genes), this includes repetitive elements (simple repeats, transposons, SINES and LINES), conserved noncoding regions, regulatory regions, and structural regions (centromeres and telomeres)
what types of variation are present in genomes?
- deletion
- duplication
- inversion
- translocation
why do we need next gen sequencing (NGS)?
elaborate on the development of NGS
how has NGS impacted genomics
what is illumina sequencing technology?
what are the methods of illumina sequencing technology?
how is sequence data presented and formatted in Fastaq files?
how is a De Novo sequencing assembly constructed using NGS?
how do you evaluate how good an assembly is?
how do you deal with repeats when assembling contigs and scaffolds?
how/why do we re-sequence genomes to characterize variation?
repetitive elements
Alu transposable element
L1 transposon
Hemophilia A
indel
SNP
structural variation
insertion : structural variation
deletion : structural variation
translocation : structural variation
inversion : structural variation
alternative splicing
MAPT gene
next generation sequencing (NGS)
illumina
adapter
barcode
flow cell
cluster
bridge amplification
cycle
paired reads
Fastaq
phred score
vector
De Novo assembly
C (coverage)
string graph
consensus
N50
contigs
scaffolds
collapsed contig
repeat region
mate pair reads
assembly programs
velvet
re-sequencing
split mapping
what are the strategies behind genome re-sequencing?
what is the design of low coverage re-sequencing?
what are different types of reduced-representation sequencing?
what are ampliconic libraries?
what are the different types of targeted enrichment libraries?
elaborate on the methods for RadSeq libraries
how does one interpret the results of RadSeq libraries?
how can genomics be used to understand adaptation?
genome re-sequencing
low-coverage sequencing
reduced-representation sequencing
restriction enzyme digestion
plasmodium flaciparum
amylase
targted enrichment
uniplex
multiplex
RainStorm
hybridization
oligo probes
biotin
streptavidin
miller syndrome
RadSeq
Sbf
ApeKI
GBS
RadTag
sliding window analysis
selective sweep
Bobcat
GPR158
LECT2
LECT
TRPM
what is meant when referring to the dynamic nature of gene expression?
what are the pitfalls of gene expression analysis?
what are some experimental approaches needed to understand gene expression?
how are microarrays designed?
how are microarrays analyzed?
what are the 7 main steps to differential gene expression?
how is RNAseq data analyzed?
elaborate on microarray and RNAseq analysis?
what are the main approaches to data analysis of gene expression?
what is involved in the pre-processing to clean up data?
elaborate a bit on inferential (t-tests and ANOVA) and descriptive statistics (scatter plots, volcano plots)
what are inferential statistics?
what are descriptive statistics?
how do we interpret results for biological significance?
how do we analyze clustering and heatmaps?
how does gene ontology allow for understanding function?
functional analysis
gene expression differences
microarrays
RNAseq
inferential statistics
exploratory statistics
oligos
probes
cDNA
hybridization
fluorescent tags
Rett syndrome
a b crystallin
clustering
classification
northern blots
western blots
RT-PCR
in situ hybridization
technical replicates
biological replicates
RNAseq pipeline
gene expression omnibus (GEO) databases
metadata
MIAME
annotated reference
FPKM
fragment count
isoforms
preprocessing
systematic bias
normalization
scatter plot
volcano plot
heat map
validation
gene ontology
cellular component
biological process
molecular function
enrichment analysis
pathways