Lecture 5 Flashcards
What is a GWAS?
‘Association testing at many (but not all) markers across the entire genome’. SNP association with disease.
This is likely only to detect INDIRECT ASSOCIATION (Tagging).
- Genotyping performed at roughly 1 million markers (SNPs).
- These genotyped SNPs capture most of the common variation in the genome, through correlation (Linkage disequilibrium) with all (~10m) common SNPs
What is the issue with using GWAS?
The human genome encodes 1 SNP/100-300bp
The human genome has approximately 3000m bp
So approx. 10m SNPs (assuming 1 SNP / 300bp)
It is often not possible to genotype and analyse such a large number of data due to several limiting factors
- Available genotyping platforms
- Cost
How should the issue of GWAS use be dealt with?
Use of Linkage Disequilibrium (LD)
Instead of genotyping all the 10M SNPs we can genotype tagSNPs in a haplotype block.
A tagging SNP is a representative SNP in a given region of the genome in high LD to all other SNPs in the region.
Genotyping chips with 0.5M-1M SNPs is sufficient for a good GWAS
What should be considered in concern to tagging?
Association does not necessarily mean causation
How are SNPs ‘close’ to each other correlated?
What does this mean?
Haplotypes (a group of genes within an organism that was inherited together from a single parent).
If a causal SNP at position 2 is correlated (say R2=1) with one at position 1 -> then you will observe an association with the SNP at position 1
What are the main steps in designing and analyzing a large (GWAS) association study?
Sample collection
- Ethnicity (Try to avoid population stratification)
- Sample size (large SS = more statistical power)
Data generation
- DNA extraction
- Genotyping: Current technology allows typing of ~ 5m markers (approx. £500 per sample)
- IMPUTATION (Guess un-typed loci)
Standard analyses for identifying associated loci.
- Association testing
- Logistic regression
Replication
Quality assurance (QA) and Quality control (QC) (carried out over multiple stages of GWAS)
- QA : Planning experiment to minimise problems with data
- QC : Analysing the data to detect problems
Why should a GWAS study be replicated?
- Our GWAS results are a random sample of allele frequencies in Cases and Controls
- Some results might be specific to the GWAS design - Missed pop structure
- Batch effects
- Also, we only have “Evidence against the null”
A replication will reduce worries about (1) and provide more independent evidence against the Null
A Replication study should be performed by different group using a different sample and genoytyping method.
True or false
True
What problems were noted for a GWAS investigating Lupus?
Population Structure
Logistics
Politics
Too few controls
A genetic association study tests whether…
the presence of a specific genetic variant correlates with a trait of interest (e.g. presence/absence of disease)
What does the solid red line on a Manhattan plot represent?
P-value threshold that must be crossed for a snp to be declared as significant
What is Quality assurance good practice for?
Generating quality data
What does Quality control do?
Why?
Rids of bad data to conform to quality metrics
- GWA studies have massive multiple testing issue
- Only ‘hits’ with very low p-values considered real
- Even small biases / errors in assumptions can greatly “pollute” the extreme tails of the test statistic distribution
- This can result in many more false positives than true positives in the extreme tail
What is the most significant aspect of GWAS?
Quality assurance
What is important for statistical analysis?
Quality Control
Describe the quality control pipeline
Post-genotyping checks
- Individual QC
- SNP QC
- Choosing QC thresholds
Post-association checks
- Re-examine SNP cluster plots
- Re-examine QC metrics
- Does LD make sense?
Describe the quality assurance pipeline
Pre-genotyping checks
- DNA preparation & quantification
- Test SNPs
- Equal treatment of cases & controls
Genotype calling
- Alternative software
- Re-run after removing bad individuals?
- Run on cases + controls together?
Why is individual missingness an issue in GWAS?
Indicates poor quality DNA.
Informative missingness. If DNA quality correlates with phenotype.
Why is Gender Check an issue in GWAS?
Indicates data recording problem
Why is Relatedness an issue in GWAS?
Independence
Violates association testing assumptions
Why are Population outliers an issue in GWAS?
False positives
Why is inbreeding an issue in GWAS?
Sample contamination
Population effect
How can quality assurance overcome Individual Missingness?
Equal number of cases and controls plated together
How can quality control overcome Individual Missingness?
Plot 1-missingness against % removed
How can quality assurance overcome Gender Check?
Robust data recording
How can quality control overcome Gender Check?
Chrom X/Y Data Inbreeding coeff (F) Check to HWE expectation F=0 for females F=1 for males
How can quality assurance overcome Relatedness?
Recruitment
How can quality control overcome Relatedness?
Prune SNPs for LD Calculate IBS and IBD IBS: Proportion of alleles shares averaged over genome IBD: Proportion of genomes the same due to inheritance IBD=1 : duplicated/twin IBD=0.5 : 1st degree (sib; parent) IBD=0.25 : 2nd degree (grand-parent) IBD=0.125: 3rd degree (cousin)
How can quality assurance overcome Population outliers?
Recruitment
How can quality control overcome Population outliers?
PCA
How can quality assurance overcome Inbreeding?
DNA quality
Recruitment
How can quality control overcome Inbreeding?
Check heterozygosity
Wright’s inbreeding coefficient (F)
+ indicates excess homozygotes
- indicates excess heterozygotes
Why is SNP Missingness an issue in GWAS?
Informative missingness. If DNA quality with phenotype
Why is Minor Allele Frequency (MAF) an issue in GWAS?
Quality positively correlated with MAF
Genotype calling is more difficult for rare SNPs
Why is Hardy-Weinberg Equilibrium (HWE)
an issue in GWAS?
Departure can indicate genotype calling problems.
Can lead to false positives if problem in cases and controls not balanced
Why are Mendelian errors an issue in GWAS?
Genotype check
How can quality assurance overcome the issue of SNP missingness?
Equal number of cases and controls plated together
How can quality control overcome the issue of SNP missingness?
Association QQ-plots at different missingness thresholds
Plot 1-missingness against % removed
How can quality control overcome issue of minor allele frequency?
MAF > 10/N
AT MAF = 10/N, we would expect 20 heterozygotes
How can quality assurance overcome issue of Hardy-Weinberg Equilibrium (HWE)?
Recruitment
How can quality control overcome issue of Hardy-Weinberg Equilibrium (HWE)?
Calculate p-value (null = HWE). FROM CONTROLS
How can quality control overcome issue of mendelian error?
Use one or two sets of trios per plate
Check child’s genotype compatibility with parents for every SNP
SNPS with very low MAF will have one larger cluster and two very small clusters which may lead to uncertainty
True or false
True
For high quality SNP clustering is clearly presented
True or false
True
Where are cases preferentially drawn from?
A sub-population which also has a higher frequency of one allele at the locus in question
What does difference in genotypes between cases and controls reflect?
Their population origin, not their disease status (“confounding’)
Confounding from population structure can only arise if…
different proportion of cases/controls are from each population
and
populations differ in allele frequency at the locus in question
Apart from confounding from population structure, what else can lead to false negative results?
When does this occur?
Cryptic population structure
This occurs when the genetic variant does have an effect on disease status, but cases are drawn preferentially from a sub-population which has a higher frequency of the ‘low-risk’ allele
What issues can a cryptic population lead to?
Two “hidden” populations
Mutation at higher frequency in one population
Cases/Controls sampled disproportionately
- > Higher frequency of mutation in cases compared to controls
- > False positive association “hit” for this mutation
What are three ways in which a cryptic population structure can be avoided?
- QA: Match cases and controls on ethnicity, geographic location, birthplace of grandparents, etc
- Correct for bias: Use an analysis method that accounts for population stratification. A key method is Principal components analysis (PCA): Use a set of Ancestry informative markers (AIMs) to determine a ‘weighted score’ for each subject
The weighted score is a linear combination of the SNP (coded as 0,1,2 copies of minor allele)
The AIMs are a set of SNPs that are know to differ in frequency between populations.
Due to population differences in frequencies the weighted score is correlated with the population.
Use weighted score as a ‘covariate’ in the logistic regression analysis
- Use a family-based association study design - Not covered here (e.g. TDT)
Correlation between SNPs (LD) enable us to make inference genome wide without genotyping every SNP n the Genome
We type approximately 1 in 10 SNPs, and this is dense enough to capture most of the ‘common’ variation.
True or false
True
Primary SNP-by SNP scan involves what?
- Null hypothesis: “allele frequencies in cases and controls are the same”
Test this hypothesis separately on each SNP - Genetic model: Additive model (=“allele dosing” / “log-additive” / “multiplicative”) is a good choice. Assume generally is additive, as opposed to dominant or recessive. This is a simple approach but fairly robust (if a SNP acts non-additively then we are still likely to detect it as associated).
- Generally robust to non-additivity
- Tagging tends to conserve additive component only
- Adjust for residual population stratification
Typically by adding principal component (PC) axes as covariates in a logistic regression applied to each SNP - Apply very strict p-value threshold for significance
Need to account for large number of tests
Why is P value adjustment needed for multiple testing?
Under the null all p-values are “independent” and distributed 0-1
With 1 test: Probability of a p-value less than 0.01, IS 0.01
With 2 tests: Probability of a value less than 0.01= (1-0.99^2) = 0.02
With 100 tests: Probability of a value less than 0.01= (1-0.99^100) = 0.63.
Thus multiple testing required a low p-value threshold otherwise, even if every SNP is ‘null’, there will be 1000’s of hits
If all p-values are from the NULL, what is expected to appear on a QQ plot?
A straight line with some deviation due to random variability
How can the P value be adjusted for multiple testing?
On what basis can someone decide which test to choose?
Bonferroni
- Threshold = α / number-of-tests
- The probability of rejecting one null ≤ α
- Conservative
False Discovery rate
- Sets the proportion of false positives to be ≤ α
- More powerful
- More coherent interpretation
The choice is based upon how costly false discoveries are. If do not wish to follow up any SNPS that are likely false, then Bonferroni should be chosen. On the other hand, if wish to maximise discoveries, and the occasional error is not minded, then the false discovery rate may be better.
Linkage distribution also means what?
The genotypes of all common SNPs can be imputed from the GWAS genotypes
What does genotype imputation allow?
Estimation of genotypes at
at un-typed SNPs.
This is possible due to the correlation in the genome between SNPs, also known as Linkage Disequilibrium.
Because SNPs are correlated, and some SNPs are very highly correlated, then given the genotype at one or more SNPs are known can guess the genotypes at other SNPs.
How does imputation work?
- Genotype data with missing data at untyped SNPs
- Testing association at typed SNPs may not lead to a clear signal
- Each sample is phased and the haplotypes are modelled as a mosaic of those in haplotype reference panel
- Reference set of haplotypes ,for example, HapMap
- The reference haplotypes are used to impute alleles int the samples to create imputed genotypes
- Testing association at imputed SNPs may boost the signal
What are two advantages of Imputation?
This lead to better resolution of the association signal across a region of the genome
Aids meta-analysis where several studies association data are combined. To ensure the same SNPs are available in each study. E.g. 6-study meta-analysis whereby each study used a different genotyping chip. Before imputation not a great overlap across the six studies in terms of SNPs typed in every study, thus the meta-analysis would be conducted on only a few SNPs. However, after all six studies have been imputed then it was possible to conduct a meta-analysis on a much larger sets of SNPs.
When is an additional QC required?
When imputation is used in meta-analysis of multiple GWAS’s
Imputation quality
When is imputation conducted when it facilitates meta-analysis of multiple GWAS’s ?
Early stages
What is an example of the use of GWAS leading to success?
In Lupus, for example, before 2008 only half a dozen loci known to have strong evidence of association with disease. By 2015, after several GWAS and a large meta-analysis between Chinese and European GWAS, there were 61 loci. The number associated today is now generally accepted to be more than 80 as additional studies have replicated these signals and found more.
What studies are much lower resolution, generally investigating large segments of the genome and understanding how these segregate in families that have an individuals affected by disease?
Linkage studies
When is GWAS successful?
Common variation that tends to have lower effects sizes.
When are linkage studies successful?
When there is a large effect size for a generally rare polymorphism in the genome
Currently there is little evidence of common variation implicating large effects, while rare variants with small effects are very difficult for any study to identify.
True or false
True
GWAS have identified many potentially causal loci.
Why didn’t candidate gene studies work?
Scientists were not successfully picking candidates
We were over-optimistic about the effect sizes expected within candidate genes
Underpowered studies, but lots of them
-A “perfect storm” for generating high rate of false positives
Candidate gene studies focused what?
One area that was “hypothesised” to be involved in the disease
GWAS is hypothesis free: tests everywhere in the genome for association
True or false
True
The ratio of true hits to false hits in the “publication pool” depends upon what?
The relative ratio of these two events in the “research pool”
With many low powered studies, combining the false associations with the small number of identified real associations, the majority of published associations are false!
GWAS has been very successful in identifying many loci as associated with disease.
However what are three issues?
1: Missing Heritability
2: Not very good for prediction- associated variants have too small an effect size to be useful for predicting disease.
3: Determining causality can be hard as most associated SNPs lie outside of gene coding areas.
What is the explanation of the first issue relating to GWAS?
Missing heritability is due to:
Imperfect tagging of common variants
-Perhaps of non-SNP variation like CNVs
Rare variants of moderate effect
Many common variants of tiny effect
Current heritability estimates are wrong?
Current methodology to estimate heritability explained by GWAS are biased (they underestimate the amount explained)
Why is the third issue of GWAS not so dissapointing?
Now aware however that most GWAS associated SNPs are actually affecting the expression of genes, rather than the actual protein coding. Thus this result is less of a disappointment and more enlightening. Areas of the genome that increase or decrease gene expression do lie outside of protein coding areas, and this seems to be where altered risk for many diseases (including Lupus) reside.
Why is replication of GWAS needed?
To provide reassurance
What question does a GWAS ask repeatedly?
“Does the allele frequency of SNP j differ between cases and controls?” [for j=1…j=1,000,000