complex diseases Flashcards
monogenic diseases
those where there is a direct relationship between the disease gene and the disease status
Genotype and phenotype closely correlate (high penetrance) Variants CAUSE the disease (1 disease, 1 gene)
The traits presented so far are qualitative
= white eyed or red eyed flies
= cystic fibrosis or no cystic fibrosis
Quantitative traits
Traits with variation showing a
continuous range of phenotypes
e.g. human height, weight, colour, metabolic rate, behaviour
polygenic
Varying phenotypes result from input of many genes
Multifactorial or complex traits
result of a combination of several genes and environmental factors
Complex (polygenic) diseases often show genetic predisposition, but individual genes only marginally affect disease status
Genotype and phenotype poorly correlate (low penetrance)
Variants PREDISPOSE to the disease (1 disease, many genes)
example of multifactorial inheritance
skin colour additive effect complex trait - many genes - environment
single gene vs multifactorial
single gene
- risk remains the same regardless if no. affected
- if parent is carrier there is 1/2 risk
- 1 child had disease the risk of another child is still 1/2
multifactorial
- recurrent risk increases because the couple are high risk
- if 1 child is affected, the recurrent risk is 1 in 25
- if 2 children are affected, the recurrent risk is now 1 in 12
Multifactorial disorders display familial clustering with no recognised pattern of Mendelian inheritance
- Most common cause of congenital malformations 2. Cause of many common acquired diseases
- More prevalent than single gene disorders
- Harder to find the genetic factors / causes
not all polygenic traits show continuous variation
in large sample the data will reflect normal distribution
instead of using interval (so groups like age on x axis) we use number of predisposing alleles in genotype
there will be a certain point (threshold) where there is a higher frequency of disease. thus moving away from normal distribution
3 types of polygenic traits
continuous traits
meristic traits
- phenotype can be recorded by counting integers
threshold traits
- polygenic and often multifactorial
- small number of discrete phenotypic classes
- increasing number of diseases show this pattern
most common multifactorial diseases with a threshold
cleft lip neural tube defect congenital heart defect asthma diabetes autism
multi-gene hypothesis
- A quantitative trait has continuous variation that can be quantified (measured)
- Two or more loci scattered in the genome account for the hereditary influence on the trait in an additive way
- Each gene locus is occupied by either an additive allele or a non- additive allele
- The contribution of each additive allele is approximately equal
- Together, the additive alleles contributing to a single quantitative character produce substantial phenotypic variation
calculating number of polygenes
Number of polygenes (n) contributing to quantitative trait is estimated based on ratio of F2 individuals resembling either of two extreme P phenotypes
- 1/4n = ratio of F2 individuals expressing either extreme phenotype
- For low number of polygenes: (2n + 1) = number of distinct phenotypic categories observed
i.e. 1 gene = 3 classes (1/4, 1/2, 1/4)
2 genes = 5 classes (1/16, 1/8, 1/4, 1/8, 1/16)
Heritability (H2)
the proportion of the total phenotypic
variance (VP) within a certain population that is due to genetic variance (VG) H2 = VG/VP
Different in different environments
A mean heritability estimate of 0.65 for human height does not mean that your height is 65% due to your genes, but rather that in the population sampled, on average, 65% of the overall variation in height could be explained by genotypic differences among individuals in that population.
Familial
a trait shared by a family; they may not share the same genotype e.g. an adopted child speaks the same language as the rest of the family. This
is not heritable, because it is not genetic.
Heritable
a trait shared by people with the same genotype
If an environmental change affects all individuals in a population equally
the mean changes but the variance (heritability) stays the same
if the variance changes, the heritability changes
Gene-environment (G x E) interactions
interaction between genes and environment can play an important role in quantitative traits
broad-sense heritability H2
Measures the proportion of the variance in a population within a single
generation that is due to genetic factors
Gives an estimate of 0 to 1
Low heritability = variation is due mainly to environmental effects
High heritability = variation is due mainly to genotypic effects
Ignores genotype-by-environment interactions
Includes genetic values due to dominance and epistasis
additive gene action vs dominat gene action
for additive the homozygotes would be the two extremes and the heterozygote the intermediate
for dominant the homozygote are the two extremes and the heterozygote is the same as the dominant homozygote
Narrow-sense heritability h2
only takes into account the fully additive genetic variants = all plant or animals wth desired trait are homozygote dominant
in dominant genetic variants the heterozygote is also desired so it would take longer for selective breeding
H2 = Va/ Vp Va = additive variants Vp = total phenotypic variants
How to quantify and interpret heritability
A common way to assess if a trait is heritable is to look for a correlation between the parents and the offspring.
Narrow-sense heritability (h2) = a measure of how heritable a trait is, using family data
This measurement is used in animal and plant breeding to determine if a population can be changed by selective breeding.
Estimate narrow heritability by comparing the offspring value against the averaged value for the two parents (midparent value).
How do we determine if a family
has a higher risk of disease?
- Family members share a greater number of identical genetic variants than unrelated individuals
- The degree of family clustering of a disease can be expressed by the relative risk ratio (λR)
- Risk considers relative(s) (R) of an affected proband compared with the risk in the general population
relative risk ratio = disease prevalence in relatives R of probands / disease prevalence in population
Relative risk ratio interpretation
Higher λR values indicate greater proportion of risk in family compared to the population
Usually it increases with
• Increasing genetic contribution
• Decreasing population prevalence
Familial clustering: the role of environment
Familial clustering confounded by shared environment
If familial aggregation is detected, it does not always and only mean genetics is the explanation
Twin studies
DZ (fraternal non identical, same as siblings)
MZ= identical twins
if a trait is genetic, it should always be the sam in MZ twins
twin studies - concordance and discordance
Concordant twins*
Both affected (+ / +) or unaffected ( - / - )
Discordant twins
1 affected, 1 unaffected (+ / -)
concordance ratio (r) = concordance in MZ/ concordance in DZ r> 1 genetics play a role
High concordance does not prove that a trait has a genetic component
Limitations of twin studies: DZ twins can be of different sex, MZ twins may share more environmental factors, there are also epigenetics factors along life, X-chromosome inactivation, post-zygotic somatic mutations, etc
Adoption studies
Two approaches:
• Find adopted people who suffer from a particular disease known to run in families and ask whether it runs in their biological or adoptive family
• Find affected parents whose children have been adopted away from the family and ask whether being adopted saved the children from the family disease
Main obstacles: lack of information about the biological family, when adoption happened, intrauterine factors, and selective placement
linkage
property of loci
to identify biological mechanism for transmission of a trait
requires family pedigree
use polymorphic markers
association
Association is a property of alleles
To identify an association between an allele and a phenotype
Fine mapping (<1cM)
Case-control or family approach
Usually bi-allelic SNPs
linkage analysis in complex disease
affected sibling pair
When affected siblings share a chromosome region more or less often than expected by chance, then that region is likely involved in causing the disease
limitations of linkage
for risk ratio of 4 (high) you would need a lot of pairs of families to do a linkage analysis
anything less than 4 and the number of families increased drastically
successful linkage study - alzheimers
1991: Linkage analysis identified the proximal long arm of chromosome 19
• Apoliprotein E (APOE) • ε2 decreases risk
• ε4 increases risk
• 15-25% of the population carry 1 copy, 2-4% carry 2 copies
• ε4 drives earlier and more abundant amyloid pathology in the brains of carriers
Most SNPs in a population are
rare
Most SNPs in an individua, are
common
Why most SNPs have neutral effect on phenotype?
- Functionally important DNA sequences are the minority of our genome.
- Genetic redundancy: nucleotide substitutions that don’t change amino acid, or gene duplication.
- Functionally unimportant amino acid or nucleotide positions within proteins or within functionally important noncoding sequences.
Linkage disequilibrium
Chromosomal segments can exist as a block that is only rarely broken up by recombination.
- because theyre so close together they do not recombinate
• Linkage disequilibrium (LD): the nonrandom association of alleles of different loci.
some combinations of alleles are favoured
calculating LD
frequency of haplotype (AB,Ab,aB,ab) - the frequency of the individual alleles
if no LD = frequency of haplotype = frequency of individual alleles multiplied together
if d’ = 1 complete linkage (no recombination)
d’>0.33 threshold to determine LD
Haplotype
sets of nearby SNPs on the same chromosome that are inherited as a block.
Haplotype blocks represent ancestral chromosome segments that have been transmitted intact through many generations
- darker the blocks, the stronger the LD
the older the generation the SNPs were generated and transmitted together, the more consistent the haploid blocks are going to be
Haplotypes are population-specific
similar ancestry, early on difference in mutations, then different haplotypes - the frequency of haplotypes depend on the population
recombination hotspots
concentrated in 1-2kb hotspots
we have ~30,000 hotspots every 50-100kb
with low LD between blocks we have recombination hotspots
hotspots due to epigenetic histone methylation marker
tag-SNPs
reduce the number of SNPs required to examine the entire genome for association with a phenotype
if SNPs are in LD they represent all the snps in that block
by taking a few tag SNPs we can identify the genotype of other snps around them
determining if genotypes are phased cis and trans
Phasing: the process of inferring haplotypes from genotype data, assigning alleles to maternal or paternal chromosomes
if on same chromosome = cis (phased) on different = trans (unphased)
Tag-SNPs: imputing
Using knowledge of linkage disequilibrium to fill in genotypes at loci that were not part of the original experiment.
Tag-SNP imputation in practice
lets say you got 6 SNPS
- lets assume 1 and 2 are linked (i.e. d’ = 1)
- 3 and 5 linked
- 6 and 4 linked
we can just use 1, 3, 6 for single SNP tests
- of lets say A from 1 and G from 3 always go together we can infer 6
Association analyses in complex diseases
Looks for co-occurrence (association) of alleles and phenotypes
we use candidate gene studies (individual genes, require biological insight) and GWAS
Candidate gene and association analysis in complex diseases
Looks for co-occurrence (association) of alleles and phenotypes, comparing cases and controls
e.g. we have two alleles T and C
in cases 62% have allele C and 38% have allele T
in control 49% have C and 51% have T
using odds ratio (axd/bxc)
calculate association
Case-study: Identification of NARC1/PCSK9
candidate gene study
rare mutation in this gene strong correlation to high cardiovascular disease.
used linkage analysis followed by animal studies
when mutation, it binds to LDL receptor leading to lysosomal degradation of the receptor
the receptor cant bind to LDL –> high LDL- leads to clogging of arteries
trials to lower LDL cholesterol by targeting mutation with siRNA leads to mutation mRNA degredation
Not all candidate gene studies were successful: Limitations
- Inadequate matching of controls (not accounting for other factor)
• Insufficient correction for multiple testing (bonferroni)
• Underpowered studies leading to lack of replication
Reasons for an association
- Direct causation
- Epistatic effect
- Population stratification • Linkage disequilibrium
benefits in identification of susceptibility variants
new biological insights -> clinical advances
- therapeutic targets
- biomarkers
- prevention
candidate genes vs whole gene
few SNPs + hypothesis
millions of SNPs and no hypothesis
GWAS
A hypothesis-free method • Uses large sample sizes, or cases versus controls • Identify regions of the human genome that are associated with a phenotype • Based on allele frequencies at hundreds of thousands of tag-SNPs • Association is usually confirmed through replication in independent datasets and/or GWAS meta-analyses • Requires fine mapping through linkage disequilibrium to identify specific variants
Methods to generate genetic information for GWAS
SNP arrays vs WGS - looks into tagSNP vs looking into the sequence of the whole genome - inexpensive vs expensive - reliable vs less accurate -
GWAS major steps
- data collection
- genotype (via SNP arrays and NGS)
- quality control (look into different populations)
- imputation (tag SNPs)
- association testing (manhattan plot)
- meta-analysis or replication
GWAS major steps dependent on
It is dependent on a number of important factors, such as:
• (un)relatedness of individuals (if they share DNA there will be an unwanted association)
• genetic architecture (quality control) • population stratification (quality control) • genetic model
P-value threshold for GWAS
f we assume P<0.05 is significant:
In 100 comparisons, 5 associations will be a false positive
• Need to use a multiple comparison adjustment (e.g. Bonferroni) • GWAS, we do 1 million tests (or more!)
1,000,000 x 0.05 = 50,000 false positives
Estimated that P (for most GWAS) should be < 5 x 10-8 for common variants with MAF >5% and LD r2=0.8
bonferroni
- 05 dived by number of comparisons made.
i. e. 1 million tests = 0.05/1,000,000
Visualising GWAS results: Manhattan plot
threshold red line (normally 5 x 10^-8)
y-axis - adjusted p value threshold
x-axis - chromosome number
each dot represent a SNP based on its p value for association
the higher the p-value on the plot, potentially the highest the significance
for every dot, there is a SNO on a chromosome associated with the disease of interest
in the past what chromosomes were not seen on GWAS
sex chromosomes
its now starting to improve
case- inflammatory bowel disease
the monogenic alleles are few but large impact
the more complex the smaller but greater number of alleles
where do we get the sample size
uk biobank - 500,000
there are many banks in Europe and America and Asia. few in Africa and other countries. demographic problem
case - height
top 697 variants explain 20% of heritability
top 10k variants explain 30% of Vp
case- blood pressure
heritability estimated to 30-70%
Where is the missing heritability?
- due to rare variants with BIG effect
- Due to gene-gene and gene-environment interactions
- Due to epigenetic effects
- no missing heritability; family studies overestimate heritability
- GWAS underestimates heritability due to non reliable tag-SNP detecting variants
- Much heritability due to common variants with very small effects
Whats next for complex disease
- Whole-genome sequencing of large cohorts for rare. Uncommon variants
Interpreting and role of risk of SNPS
Genome-wide polygenic risk score
can identify individuals at risk of common complex diseases
Polygenic risk score (PRS)
- Single value estimate of an individuals genetic liability to a phenotype
- Sum of the genome-wide genotypes, weighted by genotype effect size (odds ratio) derived from GWAS summary statistic data
penetrance in complex diseases
GWAS - many variants with small effects - low penetrance
Mendelian - high penetrance - few variants large effect
the missing alleles could be the intermediate penetrance
GWAS identified SNPs associated with X, now what?
we identify SNPs with GWAS associated with disease
estimate SNP based heritability and build candidate predictors
build polygenic risk scores
composite score for personalised risk prediction
example of PRS
they identified 4 alleles on 4 loci with different effect A- +1.5 C - -0.5 T - +2.0 A - -1.5
individual 1 has AT CG TT CC
1.5 (1x A) - 0.5(1x C) + 4.0(2x T) - 0.0 (0 x A0 = 5.0
When are PRS beneficial?
For risk calculation in European populations = LIMITATION
• Conditions with proven preventative measures
• The risk of disease outweighs the psychological impact of knowing you are at high genetic risk of disease
GWAS downstream analyses: Interpretation
causal variant genotyped = direct association
causal variant in LD with other genotyped variants = indirect association
Moving from association to causation
Variants are merely associated with a trait
We can use further genomic analysis tools to determine:
• Coding vs regulatory
variants
• Fine mapping
• Gene expression
Future in vitro, animal studies, and clinical trials