Lecture 5 Flashcards

1
Q

What is a GWAS?

A

‘Association testing at many (but not all) markers across the entire genome’. SNP association with disease.

This is likely only to detect INDIRECT ASSOCIATION (Tagging).

  • Genotyping performed at roughly 1 million markers (SNPs).
  • These genotyped SNPs capture most of the common variation in the genome, through correlation (Linkage disequilibrium) with all (~10m) common SNPs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the issue with using GWAS?

A

The human genome encodes 1 SNP/100-300bp

The human genome has approximately 3000m bp

So approx. 10m SNPs (assuming 1 SNP / 300bp)

It is often not possible to genotype and analyse such a large number of data due to several limiting factors

  • Available genotyping platforms
  • Cost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How should the issue of GWAS use be dealt with?

A

Use of Linkage Disequilibrium (LD)

Instead of genotyping all the 10M SNPs we can genotype tagSNPs in a haplotype block.

A tagging SNP is a representative SNP in a given region of the genome in high LD to all other SNPs in the region.

Genotyping chips with 0.5M-1M SNPs is sufficient for a good GWAS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What should be considered in concern to tagging?

A

Association does not necessarily mean causation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are SNPs ‘close’ to each other correlated?

What does this mean?

A

Haplotypes (a group of genes within an organism that was inherited together from a single parent).

If a causal SNP at position 2 is correlated (say R2=1) with one at position 1 -> then you will observe an association with the SNP at position 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the main steps in designing and analyzing a large (GWAS) association study?

A

Sample collection

  • Ethnicity (Try to avoid population stratification)
  • Sample size (large SS = more statistical power)

Data generation

  • DNA extraction
  • Genotyping: Current technology allows typing of ~ 5m markers (approx. £500 per sample)
  • IMPUTATION (Guess un-typed loci)

Standard analyses for identifying associated loci.

  • Association testing
  • Logistic regression

Replication

Quality assurance (QA) and Quality control (QC) (carried out over multiple stages of GWAS)

  • QA : Planning experiment to minimise problems with data
  • QC : Analysing the data to detect problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why should a GWAS study be replicated?

A
  1. Our GWAS results are a random sample of allele frequencies in Cases and Controls
    - Some results might be specific to the GWAS design
  2. Missed pop structure
  3. Batch effects
  4. Also, we only have “Evidence against the null”

A replication will reduce worries about (1) and provide more independent evidence against the Null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A Replication study should be performed by different group using a different sample and genoytyping method.

True or false

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What problems were noted for a GWAS investigating Lupus?

A

Population Structure
Logistics
Politics
Too few controls

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A genetic association study tests whether…

A

the presence of a specific genetic variant correlates with a trait of interest (e.g. presence/absence of disease)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the solid red line on a Manhattan plot represent?

A

P-value threshold that must be crossed for a snp to be declared as significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Quality assurance good practice for?

A

Generating quality data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does Quality control do?

Why?

A

Rids of bad data to conform to quality metrics

  • GWA studies have massive multiple testing issue
  • Only ‘hits’ with very low p-values considered real
  • Even small biases / errors in assumptions can greatly “pollute” the extreme tails of the test statistic distribution
  • This can result in many more false positives than true positives in the extreme tail
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the most significant aspect of GWAS?

A

Quality assurance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is important for statistical analysis?

A

Quality Control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe the quality control pipeline

A

Post-genotyping checks

  • Individual QC
  • SNP QC
  • Choosing QC thresholds

Post-association checks

  • Re-examine SNP cluster plots
  • Re-examine QC metrics
  • Does LD make sense?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe the quality assurance pipeline

A

Pre-genotyping checks

  • DNA preparation & quantification
  • Test SNPs
  • Equal treatment of cases & controls

Genotype calling

  • Alternative software
  • Re-run after removing bad individuals?
  • Run on cases + controls together?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is individual missingness an issue in GWAS?

A

Indicates poor quality DNA.

Informative missingness. If DNA quality correlates with phenotype.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is Gender Check an issue in GWAS?

A

Indicates data recording problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is Relatedness an issue in GWAS?

A

Independence

Violates association testing assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why are Population outliers an issue in GWAS?

A

False positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why is inbreeding an issue in GWAS?

A

Sample contamination

Population effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can quality assurance overcome Individual Missingness?

A

Equal number of cases and controls plated together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can quality control overcome Individual Missingness?

A

Plot 1-missingness against % removed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can quality assurance overcome Gender Check?

A

Robust data recording

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How can quality control overcome Gender Check?

A
Chrom X/Y Data
Inbreeding coeff (F)
Check to HWE expectation
F=0 for females
F=1 for males
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How can quality assurance overcome Relatedness?

A

Recruitment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How can quality control overcome Relatedness?

A
Prune SNPs for LD
Calculate IBS and IBD 
IBS: Proportion of alleles shares averaged over genome
IBD: Proportion of genomes the same due to inheritance
IBD=1       :   duplicated/twin
IBD=0.5    :   1st degree (sib; parent)
IBD=0.25  :  2nd degree (grand-parent)
IBD=0.125: 3rd degree (cousin)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How can quality assurance overcome Population outliers?

A

Recruitment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How can quality control overcome Population outliers?

A

PCA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How can quality assurance overcome Inbreeding?

A

DNA quality

Recruitment

32
Q

How can quality control overcome Inbreeding?

A

Check heterozygosity
Wright’s inbreeding coefficient (F)
+ indicates excess homozygotes
- indicates excess heterozygotes

33
Q

Why is SNP Missingness an issue in GWAS?

A

Informative missingness. If DNA quality with phenotype

34
Q

Why is Minor Allele Frequency (MAF) an issue in GWAS?

A

Quality positively correlated with MAF

Genotype calling is more difficult for rare SNPs

35
Q

Why is Hardy-Weinberg Equilibrium (HWE)

an issue in GWAS?

A

Departure can indicate genotype calling problems.

Can lead to false positives if problem in cases and controls not balanced

36
Q

Why are Mendelian errors an issue in GWAS?

A

Genotype check

37
Q

How can quality assurance overcome the issue of SNP missingness?

A

Equal number of cases and controls plated together

38
Q

How can quality control overcome the issue of SNP missingness?

A

Association QQ-plots at different missingness thresholds

Plot 1-missingness against % removed

39
Q

How can quality control overcome issue of minor allele frequency?

A

MAF > 10/N

AT MAF = 10/N, we would expect 20 heterozygotes

40
Q

How can quality assurance overcome issue of Hardy-Weinberg Equilibrium (HWE)?

A

Recruitment

41
Q

How can quality control overcome issue of Hardy-Weinberg Equilibrium (HWE)?

A

Calculate p-value (null = HWE). FROM CONTROLS

42
Q

How can quality control overcome issue of mendelian error?

A

Use one or two sets of trios per plate

Check child’s genotype compatibility with parents for every SNP

43
Q

SNPS with very low MAF will have one larger cluster and two very small clusters which may lead to uncertainty

True or false

A

True

44
Q

For high quality SNP clustering is clearly presented

True or false

A

True

45
Q

Where are cases preferentially drawn from?

A

A sub-population which also has a higher frequency of one allele at the locus in question

46
Q

What does difference in genotypes between cases and controls reflect?

A

Their population origin, not their disease status (“confounding’)

47
Q

Confounding from population structure can only arise if…

A

different proportion of cases/controls are from each population

and

populations differ in allele frequency at the locus in question

48
Q

Apart from confounding from population structure, what else can lead to false negative results?

When does this occur?

A

Cryptic population structure

This occurs when the genetic variant does have an effect on disease status, but cases are drawn preferentially from a sub-population which has a higher frequency of the ‘low-risk’ allele

49
Q

What issues can a cryptic population lead to?

A

Two “hidden” populations

Mutation at higher frequency in one population

Cases/Controls sampled disproportionately

  • > Higher frequency of mutation in cases compared to controls
  • > False positive association “hit” for this mutation
50
Q

What are three ways in which a cryptic population structure can be avoided?

A
  1. QA: Match cases and controls on ethnicity, geographic location, birthplace of grandparents, etc
  2. Correct for bias: Use an analysis method that accounts for population stratification. A key method is Principal components analysis (PCA): Use a set of Ancestry informative markers (AIMs) to determine a ‘weighted score’ for each subject

The weighted score is a linear combination of the SNP (coded as 0,1,2 copies of minor allele)

The AIMs are a set of SNPs that are know to differ in frequency between populations.
Due to population differences in frequencies the weighted score is correlated with the population.

Use weighted score as a ‘covariate’ in the logistic regression analysis

  1. Use a family-based association study design - Not covered here (e.g. TDT)
51
Q

Correlation between SNPs (LD) enable us to make inference genome wide without genotyping every SNP n the Genome

We type approximately 1 in 10 SNPs, and this is dense enough to capture most of the ‘common’ variation.

True or false

A

True

52
Q

Primary SNP-by SNP scan involves what?

A
  1. Null hypothesis: “allele frequencies in cases and controls are the same”
    Test this hypothesis separately on each SNP
  2. Genetic model: Additive model (=“allele dosing” / “log-additive” / “multiplicative”) is a good choice. Assume generally is additive, as opposed to dominant or recessive. This is a simple approach but fairly robust (if a SNP acts non-additively then we are still likely to detect it as associated).
  • Generally robust to non-additivity
  • Tagging tends to conserve additive component only
  1. Adjust for residual population stratification
    Typically by adding principal component (PC) axes as covariates in a logistic regression applied to each SNP
  2. Apply very strict p-value threshold for significance
    Need to account for large number of tests
53
Q

Why is P value adjustment needed for multiple testing?

A

Under the null all p-values are “independent” and distributed 0-1

With 1 test: Probability of a p-value less than 0.01, IS 0.01

With 2 tests: Probability of a value less than 0.01= (1-0.99^2) = 0.02

With 100 tests: Probability of a value less than 0.01= (1-0.99^100) = 0.63.

Thus multiple testing required a low p-value threshold otherwise, even if every SNP is ‘null’, there will be 1000’s of hits

54
Q

If all p-values are from the NULL, what is expected to appear on a QQ plot?

A

A straight line with some deviation due to random variability

55
Q

How can the P value be adjusted for multiple testing?

On what basis can someone decide which test to choose?

A

Bonferroni

  • Threshold = α / number-of-tests
  • The probability of rejecting one null ≤ α
  • Conservative

False Discovery rate

  • Sets the proportion of false positives to be ≤ α
  • More powerful
  • More coherent interpretation

The choice is based upon how costly false discoveries are. If do not wish to follow up any SNPS that are likely false, then Bonferroni should be chosen. On the other hand, if wish to maximise discoveries, and the occasional error is not minded, then the false discovery rate may be better.

56
Q

Linkage distribution also means what?

A

The genotypes of all common SNPs can be imputed from the GWAS genotypes

57
Q

What does genotype imputation allow?

A

Estimation of genotypes at
at un-typed SNPs.

This is possible due to the correlation in the genome between SNPs, also known as Linkage Disequilibrium.

Because SNPs are correlated, and some SNPs are very highly correlated, then given the genotype at one or more SNPs are known can guess the genotypes at other SNPs.

58
Q

How does imputation work?

A
  1. Genotype data with missing data at untyped SNPs
  2. Testing association at typed SNPs may not lead to a clear signal
  3. Each sample is phased and the haplotypes are modelled as a mosaic of those in haplotype reference panel
  4. Reference set of haplotypes ,for example, HapMap
  5. The reference haplotypes are used to impute alleles int the samples to create imputed genotypes
  6. Testing association at imputed SNPs may boost the signal
59
Q

What are two advantages of Imputation?

A

This lead to better resolution of the association signal across a region of the genome

Aids meta-analysis where several studies association data are combined. To ensure the same SNPs are available in each study. E.g. 6-study meta-analysis whereby each study used a different genotyping chip. Before imputation not a great overlap across the six studies in terms of SNPs typed in every study, thus the meta-analysis would be conducted on only a few SNPs. However, after all six studies have been imputed then it was possible to conduct a meta-analysis on a much larger sets of SNPs.

60
Q

When is an additional QC required?

A

When imputation is used in meta-analysis of multiple GWAS’s

Imputation quality

61
Q

When is imputation conducted when it facilitates meta-analysis of multiple GWAS’s ?

A

Early stages

62
Q

What is an example of the use of GWAS leading to success?

A

In Lupus, for example, before 2008 only half a dozen loci known to have strong evidence of association with disease. By 2015, after several GWAS and a large meta-analysis between Chinese and European GWAS, there were 61 loci. The number associated today is now generally accepted to be more than 80 as additional studies have replicated these signals and found more.

63
Q

What studies are much lower resolution, generally investigating large segments of the genome and understanding how these segregate in families that have an individuals affected by disease?

A

Linkage studies

64
Q

When is GWAS successful?

A

Common variation that tends to have lower effects sizes.

65
Q

When are linkage studies successful?

A

When there is a large effect size for a generally rare polymorphism in the genome

66
Q

Currently there is little evidence of common variation implicating large effects, while rare variants with small effects are very difficult for any study to identify.

True or false

A

True

67
Q

GWAS have identified many potentially causal loci.

Why didn’t candidate gene studies work?

A

Scientists were not successfully picking candidates

We were over-optimistic about the effect sizes expected within candidate genes

Underpowered studies, but lots of them
-A “perfect storm” for generating high rate of false positives

68
Q

Candidate gene studies focused what?

A

One area that was “hypothesised” to be involved in the disease

69
Q

GWAS is hypothesis free: tests everywhere in the genome for association

True or false

A

True

70
Q

The ratio of true hits to false hits in the “publication pool” depends upon what?

A

The relative ratio of these two events in the “research pool”

With many low powered studies, combining the false associations with the small number of identified real associations, the majority of published associations are false!

71
Q

GWAS has been very successful in identifying many loci as associated with disease.

However what are three issues?

A

1: Missing Heritability
2: Not very good for prediction- associated variants have too small an effect size to be useful for predicting disease.

3: Determining causality can be hard as most associated SNPs lie outside of gene coding areas.

72
Q

What is the explanation of the first issue relating to GWAS?

A

Missing heritability is due to:

Imperfect tagging of common variants
-Perhaps of non-SNP variation like CNVs

Rare variants of moderate effect

Many common variants of tiny effect

Current heritability estimates are wrong?

Current methodology to estimate heritability explained by GWAS are biased (they underestimate the amount explained)

73
Q

Why is the third issue of GWAS not so dissapointing?

A

Now aware however that most GWAS associated SNPs are actually affecting the expression of genes, rather than the actual protein coding. Thus this result is less of a disappointment and more enlightening. Areas of the genome that increase or decrease gene expression do lie outside of protein coding areas, and this seems to be where altered risk for many diseases (including Lupus) reside.

74
Q

Why is replication of GWAS needed?

A

To provide reassurance

75
Q

What question does a GWAS ask repeatedly?

A

“Does the allele frequency of SNP j differ between cases and controls?” [for j=1…j=1,000,000