Lecture 5 Flashcards

Question

How can quality assurance overcome Gender Check?

Answer 1

Robust data recording

Answer 2

``` Chrom X/Y Data Inbreeding coeff (F) Check to HWE expectation F=0 for females F=1 for males ```

Answer 3

Recruitment

Answer 4

``` Prune SNPs for LD Calculate IBS and IBD IBS: Proportion of alleles shares averaged over genome IBD: Proportion of genomes the same due to inheritance IBD=1 : duplicated/twin IBD=0.5 : 1st degree (sib; parent) IBD=0.25 : 2nd degree (grand-parent) IBD=0.125: 3rd degree (cousin) ```

Answer 5

Recruitment

Answer 6

DNA quality | Recruitment

Answer 7

Check heterozygosity Wright’s inbreeding coefficient (F) + indicates excess homozygotes - indicates excess heterozygotes

Answer 8

Informative missingness. If DNA quality with phenotype

Answer 9

Quality positively correlated with MAF | Genotype calling is more difficult for rare SNPs

Answer 10

Departure can indicate genotype calling problems. | Can lead to false positives if problem in cases and controls not balanced

Answer 11

Genotype check

Answer 12

Equal number of cases and controls plated together

Answer 13

Association QQ-plots at different missingness thresholds | Plot 1-missingness against % removed

Answer 14

MAF > 10/N | AT MAF = 10/N, we would expect 20 heterozygotes

Answer 15

Recruitment

Answer 16

Calculate p-value (null = HWE). FROM CONTROLS

Answer 17

Use one or two sets of trios per plate | Check child’s genotype compatibility with parents for every SNP

Answer 18

A sub-population which also has a higher frequency of one allele at the locus in question

Answer 19

Their population origin, not their disease status (“confounding’)

Answer 20

different proportion of cases/controls are from each population and populations differ in allele frequency at the locus in question

Answer 21

Cryptic population structure This occurs when the genetic variant does have an effect on disease status, but cases are drawn preferentially from a sub-population which has a higher frequency of the ‘low-risk’ allele

Answer 22

Two “hidden” populations Mutation at higher frequency in one population Cases/Controls sampled disproportionately - > Higher frequency of mutation in cases compared to controls - > False positive association “hit” for this mutation

Answer 23

1. QA: Match cases and controls on ethnicity, geographic location, birthplace of grandparents, etc 2. Correct for bias: Use an analysis method that accounts for population stratification. A key method is Principal components analysis (PCA): Use a set of Ancestry informative markers (AIMs) to determine a ‘weighted score’ for each subject The weighted score is a linear combination of the SNP (coded as 0,1,2 copies of minor allele) The AIMs are a set of SNPs that are know to differ in frequency between populations. Due to population differences in frequencies the weighted score is correlated with the population. Use weighted score as a ‘covariate’ in the logistic regression analysis 3. Use a family-based association study design - Not covered here (e.g. TDT)

Answer 24

1. Null hypothesis: “allele frequencies in cases and controls are the same” Test this hypothesis separately on each SNP 2. Genetic model: Additive model (=“allele dosing” / “log-additive” / “multiplicative”) is a good choice. Assume generally is additive, as opposed to dominant or recessive. This is a simple approach but fairly robust (if a SNP acts non-additively then we are still likely to detect it as associated). - Generally robust to non-additivity - Tagging tends to conserve additive component only 3. Adjust for residual population stratification Typically by adding principal component (PC) axes as covariates in a logistic regression applied to each SNP 4. Apply very strict p-value threshold for significance Need to account for large number of tests

Answer 25

Under the null all p-values are “independent” and distributed 0-1 With 1 test: Probability of a p-value less than 0.01, IS 0.01 With 2 tests: Probability of a value less than 0.01= (1-0.99^2) = 0.02 With 100 tests: Probability of a value less than 0.01= (1-0.99^100) = 0.63. Thus multiple testing required a low p-value threshold otherwise, even if every SNP is ‘null’, there will be 1000’s of hits

Answer 26

A straight line with some deviation due to random variability

Answer 27

Bonferroni - Threshold = α / number-of-tests - The probability of rejecting one null ≤ α - Conservative False Discovery rate - Sets the proportion of false positives to be ≤ α - More powerful - More coherent interpretation The choice is based upon how costly false discoveries are. If do not wish to follow up any SNPS that are likely false, then Bonferroni should be chosen. On the other hand, if wish to maximise discoveries, and the occasional error is not minded, then the false discovery rate may be better.

Answer 28

The genotypes of all common SNPs can be imputed from the GWAS genotypes

Answer 29

Estimation of genotypes at at un-typed SNPs. This is possible due to the correlation in the genome between SNPs, also known as Linkage Disequilibrium. Because SNPs are correlated, and some SNPs are very highly correlated, then given the genotype at one or more SNPs are known can guess the genotypes at other SNPs.

Answer 30

1. Genotype data with missing data at untyped SNPs 2. Testing association at typed SNPs may not lead to a clear signal 3. Each sample is phased and the haplotypes are modelled as a mosaic of those in haplotype reference panel 4. Reference set of haplotypes ,for example, HapMap 5. The reference haplotypes are used to impute alleles int the samples to create imputed genotypes 6. Testing association at imputed SNPs may boost the signal

Answer 31

This lead to better resolution of the association signal across a region of the genome Aids meta-analysis where several studies association data are combined. To ensure the same SNPs are available in each study. E.g. 6-study meta-analysis whereby each study used a different genotyping chip. Before imputation not a great overlap across the six studies in terms of SNPs typed in every study, thus the meta-analysis would be conducted on only a few SNPs. However, after all six studies have been imputed then it was possible to conduct a meta-analysis on a much larger sets of SNPs.

Answer 32

When imputation is used in meta-analysis of multiple GWAS’s Imputation quality

Answer 33

Early stages

Answer 34

In Lupus, for example, before 2008 only half a dozen loci known to have strong evidence of association with disease. By 2015, after several GWAS and a large meta-analysis between Chinese and European GWAS, there were 61 loci. The number associated today is now generally accepted to be more than 80 as additional studies have replicated these signals and found more.

Answer 35

Linkage studies

Answer 36

Common variation that tends to have lower effects sizes.

Answer 37

When there is a large effect size for a generally rare polymorphism in the genome

Answer 38

Scientists were not successfully picking candidates We were over-optimistic about the effect sizes expected within candidate genes Underpowered studies, but lots of them -A “perfect storm” for generating high rate of false positives

Answer 39

One area that was “hypothesised” to be involved in the disease

Answer 40

The relative ratio of these two events in the “research pool” With many low powered studies, combining the false associations with the small number of identified real associations, the majority of published associations are false!

Answer 41

1: Missing Heritability 2: Not very good for prediction- associated variants have too small an effect size to be useful for predicting disease. 3: Determining causality can be hard as most associated SNPs lie outside of gene coding areas.

Answer 42

Missing heritability is due to: Imperfect tagging of common variants -Perhaps of non-SNP variation like CNVs Rare variants of moderate effect Many common variants of tiny effect Current heritability estimates are wrong? Current methodology to estimate heritability explained by GWAS are biased (they underestimate the amount explained)

Answer 43

Now aware however that most GWAS associated SNPs are actually affecting the expression of genes, rather than the actual protein coding. Thus this result is less of a disappointment and more enlightening. Areas of the genome that increase or decrease gene expression do lie outside of protein coding areas, and this seems to be where altered risk for many diseases (including Lupus) reside.

Answer 44

To provide reassurance

Answer 45

“Does the allele frequency of SNP j differ between cases and controls?” [for j=1…j=1,000,000

Lecture 5 Flashcards

(75 cards)