Disease Gene Discovery (Complex Disorders) Flashcards by Phil T

What are Genetic association studies?

Genetic association studies are used to find candidate genes or genome regions that contribute to a specific disease.

By testing for a correlation between disease status and genetic variation.

Generally achieved by testing cohorts of affected and unaffected individuals.

How well did you know this?

Not at all

Perfectly

What is the key difference between linkage and association studies?

Linkage analysis : Based on ‘within-family’ design analysing sibling pairs or large pedigrees.
Association studies: In general, are based on ‘case-control’ design that analyses allele frequencies between groups of unrelated cases and unrelated controls.

How well did you know this?

Not at all

Perfectly

What types of disease genes are identified in linkage vs association studies?

Linkage: Individual genes of genes of major effect.
Association: Genes that have less of a strong effect, and thus maybe multiple genes of lesser effect can be detected.

How well did you know this?

Not at all

Perfectly

What is a ‘complex’ disorder?

Conditions caused by many contributing factors are called complex or multifactorial disorders.
Many common medical problems such as heart disease, diabetes, and obesity do not have a single genetic cause—they are likely associated with the effects of multiple genes of low impact in combination with lifestyle and environmental factors.
Although complex disorders often cluster in families, they do not have a clear-cut pattern of inheritance.

How well did you know this?

Not at all

Perfectly

What is the key concept that underpins association studies?

Linkage disequilibrium (LD)

LD is the non-random association of alleles at two or more loci with a frequency greater than expected by chance

How well did you know this?

Not at all

Perfectly

Explain LD using a mathematical example

If the alleles at locus;

A are a1 and a2 with frequencies of 0.7 and 0.3
B are b1 and b2 with frequencies of 0.6 and 0.4

The expected frequencies of the four possible haplotypes would be

a1b1, 0.42
a1b2, 0.28
a2b1, 0.18
a2b2, 0.12

If a2b2 was found in a population at a frequency of 0.45, this is called linkage disequilibrium between a2 and b2.

How well did you know this?

Not at all

Perfectly

What factors might cause LD?

Linkage disequilibrium may result from selective forces (natural, reproductive ect) or by chance.
When a new variant arises on a founder chromosome and not much time has elapsed since the mutational event, the new variant will be in linkage disequilibrium with alleles from loci close to the gene.
If the new variant is disease causing then linkage disequilibrium can be a powerful tool for genetic mapping.

How well did you know this?

Not at all

Perfectly

How can Recombination affect LD?

Recombination – Over time recombination between loci will gradually reduce LD as alleles that were shared on an ancestral chromosome are separated. It can therefore be harder to find LD in older populations. Areas of the genome with lower recombination rate can maintain LD for longer.

How well did you know this?

Not at all

Perfectly

How can Gene conversion affect LD?

Gene conversion – Regions of the genome with a low recombination rate or markers that are tightly linked can still lose LD via gene conversion. Markers either side of a gene conversion event may still show LD.

How well did you know this?

Not at all

Perfectly

How can Selection affect LD?

Selection – If there is a selective advantage to two alleles coexisting LD between the alleles is more likely to be maintained. A negative selection pressure can remove LD. Selection enables loci on different chromosome to be in LD if loss of one gives a selective disadvantage when the other is still present. Selective sweeps can also lead to a higher than expected distribution of alleles.

How well did you know this?

Not at all

Perfectly

How can population structure affect LD?

Population structure – population subdivision can create LD and also maintain LD due to the smaller effective population size. Inbreeding and non-random mating are also likely to alter the expected allele distributions.

How well did you know this?

Not at all

Perfectly

How can New mutations affect LD?

New mutation – A high new mutation rate at a locus will make it hard to detect LD. Mutations will arise on different ancestral backgrounds, the phenotypic affect may be the same, but the underlying haplotypes will not.

How well did you know this?

Not at all

Perfectly

How can genetic drift, gene flow and population history affect LD?

Genetic drift – Random genetic drift can create and remove LD

Gene flow – The greater the allele frequency differences between populations the greater the LD created when populations join.

Population history- the older the population the shorter the segments of LD

How well did you know this?

Not at all

Perfectly

What is the difference between ‘genetic linkage’ and ‘linkage disequalibrium’?

Loci in LD will often be genetically linked (on the same chromosome)
But LD can occur even if loci are on different chromosomes (because for some reason the alleles of different chr have become non-randomly associated).
It is also possible for loci to be linked, but not be in LD (become recombination between the loci has been unrestricted so the distribution of alleles in a population is as expected).
Linkage and LD are separate phenomena.

How well did you know this?

Not at all

Perfectly

What metric is used to represent LD?

LD is often expressed as D.
If D=0 there is no association between alleles and the distribution of alleles in the population is as expected and dependent on the allelic frequency.
If D does not equal zero there is an association between alleles.
D is usually calculated so that alleles in complete LD will have D=1.

How well did you know this?

Not at all

Perfectly

How is D calculated?

D = (PAB)-(PAxPB)
D is the difference between;
PAB, the frequency of gametes carrying the pair of alleles A and B at two loci PAB and
The product of the frequencies PA and PB
This current definition refers to AB being a haplotype with PAB being the haplotype frequency (calculated by phasing genotypes in a population via trio analysis).

What are the limitation of using LD to identify disease alleles?

Calculating LD relies on the allele frequences of markers being out of kilter with the disease allele
When the allele frequency approaches 1 or 0 it is statistically very difficult to find LD with a marker.
Hence, a_ssociation studies will often not be able to detect rare disease causing variants,_ even if the effect on the phenotype is large.

What type of association study can be used to assess complex disorders?

When several genes make small contributions to disease etiology linkage within families is no longer useful.
Genome-wide association studies (GWAS) utilise technology that allows the assessment of markers accross the entire genome e.g. SNP array
This method is a “hypothesis free” approach that enables the identification of all locations in the genome that are associated with disease (provided sufficient power).

Why are GWAS referred to as hypothesis free?

Prior to the study you don’t have to know anything about;

Genetics: MOI, penetrance etc
Pedigree information
Approx. genomic location as markers are genome wise.

What are the main stages of performing a GWAS?

Two groups of participants are recruited to study: people with the disease (cases) and similar people without (controls).
Each participant is genotyped on genome-wide SNP-array
If a variant is more frequent in people with the disease, the SNP is said to be “associated” with the disease.
The associated SNPs are then considered to mark a region of the human genome which influences the risk of disease.

Why are GWAS referred to as phenotype-first studies?

The the participants are classified first by their clinical manifestation(s), as opposed to genotype-first approach.

Why are GWAS referred to as non-candidate-driven studies?

GWA studies investigate the entire genome as oppose to methods which specifically test one or a few genetic regions

When a SNP marker is found to be significantly associated with disease in a GWAS, what four explanations exist for the apparant association?

The genetic variant measured in the study is indeed important in disease causation
An association has been found by chance and there is no link at the level of disease causation
Confounding bias due to population stratification caused by cases and controls being selected from genetically different subsets of a population
The genetic variant measured in the study is not the true disease-causing variant but is instead in LD with the disease allele.

Role of HapMap project in enabling GWAS?

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome (2005/07/09)
Goal was to identify millions of new SNP loci accross the genome by performing ‘resequencing’ of dozens of trios from populations accross the globe
This provided knowledge for array manufacturers to build affordable platforms to genotype these SNPs in GWASs
Provided high resolution information on the common haplotypes in Humans.

What is an LD map and where did they come from?

* Data from the HapMap project enabled the production of LD maps. * LD maps is a diagram displaying the haplotype diversity of a chromomsal segment. * It can be used to visulate the D' metric betwen any two SNPs on a given strech of chromsomal. * Contiguous runs of SNPs with high LD are often referred to as LD-blocks where the is evdence of limited haplotype diversity in the population.

What is a tag SNP and what are they used for?

* A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype * i.e. A SNP which has very high LD to many other SNPs is knon as a tag SNP. * tagSNPs can be used as a proxy for those SNPs in high LD * If the other SNPs aren't genotyped then as long as the genotype of the tagSNP is known one can predict (impute) with high confidence the genotypes of the other SNPs * Thus the HapMap project enabled researchers to impute many more genotypes from ther data to take forward into GWAS.

How is the P-value in a GWAS calculated?

* P is determined statistically based on the number of samples tested and the difference between data sets. The value represents the probability that the result was detected by chance. * The fundamental unit for reporting effect sizes is the odds ratio. * When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa * A P-value for the significance of the odds ratio is typically calculated using a _simple chi-squared test._

How is the P-value interpreted and adjusted?

* In a typical significance test, if P=0.05 this would represent a 5% chance of observing the result by chance. * Results from a GWAS are only thought to be significant if P is very low due to the large number of SNPs tested at once. * For example if 1 million SNPs are genotyped and the cut off for significance was P= 0.05 you might expect 50,000 false positive results. * For loci to be significantly associated with disease P usually has be less than _5x10-8._

What is a Manhatten plot and how are the used in GWAS?

A Manhatten plot represents the P-value of all SNPs accross the genome. Each dot represents a SNP, with the X-axis showing genomic location and Y-axis showing association level.

What is meant by the 'power' of a GWAS?

* The power of a study is the ability of a design to pick up associations accurately. * The power if often referred to as a percentage of associations that should be detected when risk alleles are above a certain MAF and OR.

What factors can affect the power of a GWAS?

The power is affected by * the frequency of the risk allele in the population, * relative risk conferred by the disease-associated allele, * LD between genotyped marker and true risk allele, * sample size and genetic heterogeneity of the sample population. * Most of these variables are not under the control of experimental design,

How can the power of a GWAS be improved?

Most of these variables are not under the control of experimental design, 1. _Increasing the sample size_ 2. _Using carefully matched cases and controls will always improve the power of a study_

What is the common-disease common-variant hypothesis and why is this important?

* The common disease-common variant (often abbreviated CD-CV) hypothesis predicts that common disease-causing alleles, or variants, will be found in all human populations which manifest a given disease. * The fudamental role of GWAS is to detect common disease-causing variantation

What types of disease alleles are missed with GWAS?

Most of the associations found by GWAS studies are; * Associations of commons variants which have only a small increased risk of the disease, and have only a small predictive value. * In general common variants do not explain much of the heritable variation in diseases. * GWAS will not detect rare variants with large or small risks associations.

What are the technical critisms of GWAS?

* A major technical critisism of GWAS is the massive number of statistical tests performed presents an unprecedented _potential for false-positive results_ * Lack of well defined case and control groups, * Insufficient sample size * No control for multiple testing * No control for population stratification * ALL are common problems leading to FP results in GWAS

What are the more fundamental critisisms of GWAS and how might these issued be solved in the future?

* GWA studies have attracted fundamental criticism because of their assumption that _common_ genetic variation plays a large role in explaining the heritable variation of common disease. * Although it could not have been known prospectively, GWA studies were ultimately not worth the expenditure since they only identify common low risk alleles. * Alternative strategies utilising WGS to detect rare variants may be more effective strategies.