Lecture 5 Flashcards

1
Q

What is a GWAS?

A

‘Association testing at many (but not all) markers across the entire genome’. SNP association with disease.

This is likely only to detect INDIRECT ASSOCIATION (Tagging).

  • Genotyping performed at roughly 1 million markers (SNPs).
  • These genotyped SNPs capture most of the common variation in the genome, through correlation (Linkage disequilibrium) with all (~10m) common SNPs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the issue with using GWAS?

A

The human genome encodes 1 SNP/100-300bp

The human genome has approximately 3000m bp

So approx. 10m SNPs (assuming 1 SNP / 300bp)

It is often not possible to genotype and analyse such a large number of data due to several limiting factors

  • Available genotyping platforms
  • Cost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How should the issue of GWAS use be dealt with?

A

Use of Linkage Disequilibrium (LD)

Instead of genotyping all the 10M SNPs we can genotype tagSNPs in a haplotype block.

A tagging SNP is a representative SNP in a given region of the genome in high LD to all other SNPs in the region.

Genotyping chips with 0.5M-1M SNPs is sufficient for a good GWAS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What should be considered in concern to tagging?

A

Association does not necessarily mean causation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are SNPs ‘close’ to each other correlated?

What does this mean?

A

Haplotypes (a group of genes within an organism that was inherited together from a single parent).

If a causal SNP at position 2 is correlated (say R2=1) with one at position 1 -> then you will observe an association with the SNP at position 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the main steps in designing and analyzing a large (GWAS) association study?

A

Sample collection

  • Ethnicity (Try to avoid population stratification)
  • Sample size (large SS = more statistical power)

Data generation

  • DNA extraction
  • Genotyping: Current technology allows typing of ~ 5m markers (approx. £500 per sample)
  • IMPUTATION (Guess un-typed loci)

Standard analyses for identifying associated loci.

  • Association testing
  • Logistic regression

Replication

Quality assurance (QA) and Quality control (QC) (carried out over multiple stages of GWAS)

  • QA : Planning experiment to minimise problems with data
  • QC : Analysing the data to detect problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why should a GWAS study be replicated?

A
  1. Our GWAS results are a random sample of allele frequencies in Cases and Controls
    - Some results might be specific to the GWAS design
  2. Missed pop structure
  3. Batch effects
  4. Also, we only have “Evidence against the null”

A replication will reduce worries about (1) and provide more independent evidence against the Null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A Replication study should be performed by different group using a different sample and genoytyping method.

True or false

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What problems were noted for a GWAS investigating Lupus?

A

Population Structure
Logistics
Politics
Too few controls

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A genetic association study tests whether…

A

the presence of a specific genetic variant correlates with a trait of interest (e.g. presence/absence of disease)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the solid red line on a Manhattan plot represent?

A

P-value threshold that must be crossed for a snp to be declared as significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Quality assurance good practice for?

A

Generating quality data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does Quality control do?

Why?

A

Rids of bad data to conform to quality metrics

  • GWA studies have massive multiple testing issue
  • Only ‘hits’ with very low p-values considered real
  • Even small biases / errors in assumptions can greatly “pollute” the extreme tails of the test statistic distribution
  • This can result in many more false positives than true positives in the extreme tail
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the most significant aspect of GWAS?

A

Quality assurance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is important for statistical analysis?

A

Quality Control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe the quality control pipeline

A

Post-genotyping checks

  • Individual QC
  • SNP QC
  • Choosing QC thresholds

Post-association checks

  • Re-examine SNP cluster plots
  • Re-examine QC metrics
  • Does LD make sense?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe the quality assurance pipeline

A

Pre-genotyping checks

  • DNA preparation & quantification
  • Test SNPs
  • Equal treatment of cases & controls

Genotype calling

  • Alternative software
  • Re-run after removing bad individuals?
  • Run on cases + controls together?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is individual missingness an issue in GWAS?

A

Indicates poor quality DNA.

Informative missingness. If DNA quality correlates with phenotype.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is Gender Check an issue in GWAS?

A

Indicates data recording problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is Relatedness an issue in GWAS?

A

Independence

Violates association testing assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why are Population outliers an issue in GWAS?

A

False positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why is inbreeding an issue in GWAS?

A

Sample contamination

Population effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can quality assurance overcome Individual Missingness?

A

Equal number of cases and controls plated together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can quality control overcome Individual Missingness?

A

Plot 1-missingness against % removed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How can quality assurance overcome Gender Check?
Robust data recording
26
How can quality control overcome Gender Check?
``` Chrom X/Y Data Inbreeding coeff (F) Check to HWE expectation F=0 for females F=1 for males ```
27
How can quality assurance overcome Relatedness?
Recruitment
28
How can quality control overcome Relatedness?
``` Prune SNPs for LD Calculate IBS and IBD IBS: Proportion of alleles shares averaged over genome IBD: Proportion of genomes the same due to inheritance IBD=1 : duplicated/twin IBD=0.5 : 1st degree (sib; parent) IBD=0.25 : 2nd degree (grand-parent) IBD=0.125: 3rd degree (cousin) ```
29
How can quality assurance overcome Population outliers?
Recruitment
30
How can quality control overcome Population outliers?
PCA
31
How can quality assurance overcome Inbreeding?
DNA quality | Recruitment
32
How can quality control overcome Inbreeding?
Check heterozygosity Wright’s inbreeding coefficient (F) + indicates excess homozygotes - indicates excess heterozygotes
33
Why is SNP Missingness an issue in GWAS?
Informative missingness. If DNA quality with phenotype
34
Why is Minor Allele Frequency (MAF) an issue in GWAS?
Quality positively correlated with MAF | Genotype calling is more difficult for rare SNPs
35
Why is Hardy-Weinberg Equilibrium (HWE) | an issue in GWAS?
Departure can indicate genotype calling problems. | Can lead to false positives if problem in cases and controls not balanced
36
Why are Mendelian errors an issue in GWAS?
Genotype check
37
How can quality assurance overcome the issue of SNP missingness?
Equal number of cases and controls plated together
38
How can quality control overcome the issue of SNP missingness?
Association QQ-plots at different missingness thresholds | Plot 1-missingness against % removed
39
How can quality control overcome issue of minor allele frequency?
MAF > 10/N | AT MAF = 10/N, we would expect 20 heterozygotes
40
How can quality assurance overcome issue of Hardy-Weinberg Equilibrium (HWE)?
Recruitment
41
How can quality control overcome issue of Hardy-Weinberg Equilibrium (HWE)?
Calculate p-value (null = HWE). FROM CONTROLS
42
How can quality control overcome issue of mendelian error?
Use one or two sets of trios per plate | Check child’s genotype compatibility with parents for every SNP
43
SNPS with very low MAF will have one larger cluster and two very small clusters which may lead to uncertainty True or false
True
44
For high quality SNP clustering is clearly presented True or false
True
45
Where are cases preferentially drawn from?
A sub-population which also has a higher frequency of one allele at the locus in question
46
What does difference in genotypes between cases and controls reflect?
Their population origin, not their disease status (“confounding’)
47
Confounding from population structure can only arise if...
different proportion of cases/controls are from each population and populations differ in allele frequency at the locus in question
48
Apart from confounding from population structure, what else can lead to false negative results? When does this occur?
Cryptic population structure This occurs when the genetic variant does have an effect on disease status, but cases are drawn preferentially from a sub-population which has a higher frequency of the ‘low-risk’ allele
49
What issues can a cryptic population lead to?
Two “hidden” populations Mutation at higher frequency in one population Cases/Controls sampled disproportionately - > Higher frequency of mutation in cases compared to controls - > False positive association “hit” for this mutation
50
What are three ways in which a cryptic population structure can be avoided?
1. QA: Match cases and controls on ethnicity, geographic location, birthplace of grandparents, etc 2. Correct for bias: Use an analysis method that accounts for population stratification. A key method is Principal components analysis (PCA): Use a set of Ancestry informative markers (AIMs) to determine a ‘weighted score’ for each subject The weighted score is a linear combination of the SNP (coded as 0,1,2 copies of minor allele) The AIMs are a set of SNPs that are know to differ in frequency between populations. Due to population differences in frequencies the weighted score is correlated with the population. Use weighted score as a ‘covariate’ in the logistic regression analysis 3. Use a family-based association study design - Not covered here (e.g. TDT)
51
Correlation between SNPs (LD) enable us to make inference genome wide without genotyping every SNP n the Genome We type approximately 1 in 10 SNPs, and this is dense enough to capture most of the ‘common’ variation. True or false
True
52
Primary SNP-by SNP scan involves what?
1. Null hypothesis: “allele frequencies in cases and controls are the same” Test this hypothesis separately on each SNP 2. Genetic model: Additive model (=“allele dosing” / “log-additive” / “multiplicative”) is a good choice. Assume generally is additive, as opposed to dominant or recessive. This is a simple approach but fairly robust (if a SNP acts non-additively then we are still likely to detect it as associated). - Generally robust to non-additivity - Tagging tends to conserve additive component only 3. Adjust for residual population stratification Typically by adding principal component (PC) axes as covariates in a logistic regression applied to each SNP 4. Apply very strict p-value threshold for significance Need to account for large number of tests
53
Why is P value adjustment needed for multiple testing?
Under the null all p-values are “independent” and distributed 0-1 With 1 test: Probability of a p-value less than 0.01, IS 0.01 With 2 tests: Probability of a value less than 0.01= (1-0.99^2) = 0.02 With 100 tests: Probability of a value less than 0.01= (1-0.99^100) = 0.63. Thus multiple testing required a low p-value threshold otherwise, even if every SNP is ‘null’, there will be 1000’s of hits
54
If all p-values are from the NULL, what is expected to appear on a QQ plot?
A straight line with some deviation due to random variability
55
How can the P value be adjusted for multiple testing? On what basis can someone decide which test to choose?
Bonferroni - Threshold = α / number-of-tests - The probability of rejecting one null ≤ α - Conservative False Discovery rate - Sets the proportion of false positives to be ≤ α - More powerful - More coherent interpretation The choice is based upon how costly false discoveries are. If do not wish to follow up any SNPS that are likely false, then Bonferroni should be chosen. On the other hand, if wish to maximise discoveries, and the occasional error is not minded, then the false discovery rate may be better.
56
Linkage distribution also means what?
The genotypes of all common SNPs can be imputed from the GWAS genotypes
57
What does genotype imputation allow?
Estimation of genotypes at at un-typed SNPs. This is possible due to the correlation in the genome between SNPs, also known as Linkage Disequilibrium. Because SNPs are correlated, and some SNPs are very highly correlated, then given the genotype at one or more SNPs are known can guess the genotypes at other SNPs.
58
How does imputation work?
1. Genotype data with missing data at untyped SNPs 2. Testing association at typed SNPs may not lead to a clear signal 3. Each sample is phased and the haplotypes are modelled as a mosaic of those in haplotype reference panel 4. Reference set of haplotypes ,for example, HapMap 5. The reference haplotypes are used to impute alleles int the samples to create imputed genotypes 6. Testing association at imputed SNPs may boost the signal
59
What are two advantages of Imputation?
This lead to better resolution of the association signal across a region of the genome Aids meta-analysis where several studies association data are combined. To ensure the same SNPs are available in each study. E.g. 6-study meta-analysis whereby each study used a different genotyping chip. Before imputation not a great overlap across the six studies in terms of SNPs typed in every study, thus the meta-analysis would be conducted on only a few SNPs. However, after all six studies have been imputed then it was possible to conduct a meta-analysis on a much larger sets of SNPs.
60
When is an additional QC required?
When imputation is used in meta-analysis of multiple GWAS’s Imputation quality
61
When is imputation conducted when it facilitates meta-analysis of multiple GWAS's ?
Early stages
62
What is an example of the use of GWAS leading to success?
In Lupus, for example, before 2008 only half a dozen loci known to have strong evidence of association with disease. By 2015, after several GWAS and a large meta-analysis between Chinese and European GWAS, there were 61 loci. The number associated today is now generally accepted to be more than 80 as additional studies have replicated these signals and found more.
63
What studies are much lower resolution, generally investigating large segments of the genome and understanding how these segregate in families that have an individuals affected by disease?
Linkage studies
64
When is GWAS successful?
Common variation that tends to have lower effects sizes.
65
When are linkage studies successful?
When there is a large effect size for a generally rare polymorphism in the genome
66
Currently there is little evidence of common variation implicating large effects, while rare variants with small effects are very difficult for any study to identify. True or false
True
67
GWAS have identified many potentially causal loci. Why didn’t candidate gene studies work?
Scientists were not successfully picking candidates We were over-optimistic about the effect sizes expected within candidate genes Underpowered studies, but lots of them -A “perfect storm” for generating high rate of false positives
68
Candidate gene studies focused what?
One area that was “hypothesised” to be involved in the disease
69
GWAS is hypothesis free: tests everywhere in the genome for association True or false
True
70
The ratio of true hits to false hits in the “publication pool” depends upon what?
The relative ratio of these two events in the “research pool” With many low powered studies, combining the false associations with the small number of identified real associations, the majority of published associations are false!
71
GWAS has been very successful in identifying many loci as associated with disease. However what are three issues?
1: Missing Heritability 2: Not very good for prediction- associated variants have too small an effect size to be useful for predicting disease. 3: Determining causality can be hard as most associated SNPs lie outside of gene coding areas.
72
What is the explanation of the first issue relating to GWAS?
Missing heritability is due to: Imperfect tagging of common variants -Perhaps of non-SNP variation like CNVs Rare variants of moderate effect Many common variants of tiny effect Current heritability estimates are wrong? Current methodology to estimate heritability explained by GWAS are biased (they underestimate the amount explained)
73
Why is the third issue of GWAS not so dissapointing?
Now aware however that most GWAS associated SNPs are actually affecting the expression of genes, rather than the actual protein coding. Thus this result is less of a disappointment and more enlightening. Areas of the genome that increase or decrease gene expression do lie outside of protein coding areas, and this seems to be where altered risk for many diseases (including Lupus) reside.
74
Why is replication of GWAS needed?
To provide reassurance
75
What question does a GWAS ask repeatedly?
“Does the allele frequency of SNP j differ between cases and controls?” [for j=1…j=1,000,000