Lecture 5 Flashcards
What is a GWAS?
‘Association testing at many (but not all) markers across the entire genome’. SNP association with disease.
This is likely only to detect INDIRECT ASSOCIATION (Tagging).
- Genotyping performed at roughly 1 million markers (SNPs).
- These genotyped SNPs capture most of the common variation in the genome, through correlation (Linkage disequilibrium) with all (~10m) common SNPs
What is the issue with using GWAS?
The human genome encodes 1 SNP/100-300bp
The human genome has approximately 3000m bp
So approx. 10m SNPs (assuming 1 SNP / 300bp)
It is often not possible to genotype and analyse such a large number of data due to several limiting factors
- Available genotyping platforms
- Cost
How should the issue of GWAS use be dealt with?
Use of Linkage Disequilibrium (LD)
Instead of genotyping all the 10M SNPs we can genotype tagSNPs in a haplotype block.
A tagging SNP is a representative SNP in a given region of the genome in high LD to all other SNPs in the region.
Genotyping chips with 0.5M-1M SNPs is sufficient for a good GWAS
What should be considered in concern to tagging?
Association does not necessarily mean causation
How are SNPs ‘close’ to each other correlated?
What does this mean?
Haplotypes (a group of genes within an organism that was inherited together from a single parent).
If a causal SNP at position 2 is correlated (say R2=1) with one at position 1 -> then you will observe an association with the SNP at position 1
What are the main steps in designing and analyzing a large (GWAS) association study?
Sample collection
- Ethnicity (Try to avoid population stratification)
- Sample size (large SS = more statistical power)
Data generation
- DNA extraction
- Genotyping: Current technology allows typing of ~ 5m markers (approx. £500 per sample)
- IMPUTATION (Guess un-typed loci)
Standard analyses for identifying associated loci.
- Association testing
- Logistic regression
Replication
Quality assurance (QA) and Quality control (QC) (carried out over multiple stages of GWAS)
- QA : Planning experiment to minimise problems with data
- QC : Analysing the data to detect problems
Why should a GWAS study be replicated?
- Our GWAS results are a random sample of allele frequencies in Cases and Controls
- Some results might be specific to the GWAS design - Missed pop structure
- Batch effects
- Also, we only have “Evidence against the null”
A replication will reduce worries about (1) and provide more independent evidence against the Null
A Replication study should be performed by different group using a different sample and genoytyping method.
True or false
True
What problems were noted for a GWAS investigating Lupus?
Population Structure
Logistics
Politics
Too few controls
A genetic association study tests whether…
the presence of a specific genetic variant correlates with a trait of interest (e.g. presence/absence of disease)
What does the solid red line on a Manhattan plot represent?
P-value threshold that must be crossed for a snp to be declared as significant
What is Quality assurance good practice for?
Generating quality data
What does Quality control do?
Why?
Rids of bad data to conform to quality metrics
- GWA studies have massive multiple testing issue
- Only ‘hits’ with very low p-values considered real
- Even small biases / errors in assumptions can greatly “pollute” the extreme tails of the test statistic distribution
- This can result in many more false positives than true positives in the extreme tail
What is the most significant aspect of GWAS?
Quality assurance
What is important for statistical analysis?
Quality Control
Describe the quality control pipeline
Post-genotyping checks
- Individual QC
- SNP QC
- Choosing QC thresholds
Post-association checks
- Re-examine SNP cluster plots
- Re-examine QC metrics
- Does LD make sense?
Describe the quality assurance pipeline
Pre-genotyping checks
- DNA preparation & quantification
- Test SNPs
- Equal treatment of cases & controls
Genotype calling
- Alternative software
- Re-run after removing bad individuals?
- Run on cases + controls together?
Why is individual missingness an issue in GWAS?
Indicates poor quality DNA.
Informative missingness. If DNA quality correlates with phenotype.
Why is Gender Check an issue in GWAS?
Indicates data recording problem
Why is Relatedness an issue in GWAS?
Independence
Violates association testing assumptions
Why are Population outliers an issue in GWAS?
False positives
Why is inbreeding an issue in GWAS?
Sample contamination
Population effect
How can quality assurance overcome Individual Missingness?
Equal number of cases and controls plated together
How can quality control overcome Individual Missingness?
Plot 1-missingness against % removed
How can quality assurance overcome Gender Check?
Robust data recording
How can quality control overcome Gender Check?
Chrom X/Y Data Inbreeding coeff (F) Check to HWE expectation F=0 for females F=1 for males
How can quality assurance overcome Relatedness?
Recruitment
How can quality control overcome Relatedness?
Prune SNPs for LD Calculate IBS and IBD IBS: Proportion of alleles shares averaged over genome IBD: Proportion of genomes the same due to inheritance IBD=1 : duplicated/twin IBD=0.5 : 1st degree (sib; parent) IBD=0.25 : 2nd degree (grand-parent) IBD=0.125: 3rd degree (cousin)
How can quality assurance overcome Population outliers?
Recruitment
How can quality control overcome Population outliers?
PCA