Week 4.7.8: Genetic trait associations Flashcards
Genetic trait associations
Genetic association studies, genome-wide association studies, missing heritability, genetic disease associations, single gene disorders, polygenic disorders
In previous lectures we have been looking at;
What the human genome looks like
How we sequence genomes
How history shapes genomes
How human genomes differ from the genomes of other species
BUT HOW DO GENOMES INFLUENCE OUR PHENOTYPES? WHAT DO GENOMES DO?
How? What? Why?
A major goal of genomics is to identify which parts of the genome are responsible for which traits. We know that the genome is having a major influence on traits like height – but how is it doing this? How do we get from our DNA too our heart or lungs etc.
Two lines of evidence in genetic trait associations
<!--[if !supportLists]-->
- <!--[endif]-->Genetics
<!--[if !supportLists]-->
- <!--[endif]-->Traits
Two lines of evidence in genetic trait associations
Genetics
Traits
So we have to look at both of those things the genes and the traits –

We know that that baby didn’t come from that couple – because we know looking at the traits, the inheritance of particular traits that the adults have means that it is not likely that baby is from those parents.
We know a lot of the traits we see in this picture are heritable –
Heritability
Of a trait within a population is the proportion of observable differences in a trait between individuals within a population that is due to genetic differences.
Heritability is about the variability of a trait – how much is due to genes and how much is it due to environment, we know that all traits are a mixture of our DNA and environment, thus we know its not just our genes that are responsible for how large your stomach is but if you eat lots of doughnuts you are more likely to have a big stomach – the heritability might be one reason why someone has a big gut but the environment has affect
How do we untangle the difference between genetic and environment effect on traits
One way of doing this is using family studies, and twin studies we know that many human traits have a high heritability. If twins vary in their traits we know that that variability is not due to their genes but due to their environment and so by doing twin studies we can begin to untangle environment and genetic influences.
If we cannot use twins then we can use families instead
We could study plants; we can clone them, growing them in different environments thus controlling genes in that they are clones
However, we cannot do human cloning, even if we did a clone and then had to manipulate their environment it would be very unethical
Facebook experiment – tweeking peoples facebook feed to see if it effected there mood – with loads of backlash
We just can’t do these experiments
But we can work with twin – family studies to try to untangle genetics/environment
We have known about heritability since long before sequencing genome s
Sir Francis Galton’s (1889) data showing the relationship between offspring height (928 individuals) as a function of mean parent height (205 sets of parents)

Genetics without DNA
From 1850 to 1950 we did genetics without knowing DNA was the hereditary material
We knew about genes since Mendel – even before we knew about DNA
Genetic maps since 1913, Alfred Sturtevant made the first genetic map (a Drosophila chromosome)
Looking at heritability is something we have been able to do for a long time
Two lines of evidence
1.Patterns of heredity – Tracing the inheritance of traits through generations
2.Patterns of DNA variation – DNA sequencing and comparison in multiple individuals
What two type of traits are there?

Two types of traits… Monogenic or polygenic
**What is a monogenic trait? **
Monogenic
A monogeneic trait will often show a clear pattern of Mendelian inheritance, like the peas, either dominant or recessive that segregated in the F2 generation. They tend to be present/absent in phenotype, which are relatively easy to discover the genetic basis for when you can do controlled crosses and generate large families of progeny.
In humans they are a bit harder to work on than in pea plants but still they are fairly easy to work out
However,
Polygenic traits are not…
They are traits that involve many genes, they do not normally show clear Mendelian inheritance as they involve interactions of many genes (many loci interacting)
Polygenic traits are not…
They are traits that involve many genes, they do not normally show clear Mendelian inheritance as they involve interactions of many genes (many loci interacting)
Interact with environment in complex ways, genetic basis can be very hard to discover;
What appraoch do we use to study polygenic traits?
Quantitative trait association studies QTLs
Commonly studied with Genome Wide Association Studies (GWAS)
GWAS is a way of looking at highly polygenic traits
From the textbook chapter 6 figure 6.9
Shows a monogeneic trait and its inheritance in comparison with polygenic traits – we are looking at disorder traits
As we know from Mendelian genetics we have simple inheritance patterns observed on the left – were are polygenic traits are not
We know there are many polygenic traits

Three examples of monogenic traits?
Monogenic
<!--[if !supportLists]-->
· <!--[endif]-->Cystic fibrosis I that is why we have known about its genetic basis for a long time
<!--[if !supportLists]-->
· <!--[endif]-->Sickle cell disease
<!--[if !supportLists]-->
· <!--[endif]-->Phenylketonuria
Three examples of polygenic traits?
Polygenic
<!--[if !supportLists]-->
· <!--[endif]-->Type 2 diabetess
<!--[if !supportLists]-->
· <!--[endif]-->hypertension
<!--[if !supportLists]-->
· <!--[endif]-->rheumatoid arthritis
People are still working on locating loci
Study sampling designs
With humans, we can’t design experiments on genetics as we can with other organisms
We have to make use of what variation and genealogical relationships we can discover existing in human populations
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->
Which people do I study?
How much of the genome do I study?
Which people do I study?
Two issues when we try to do a study,
Which people do I study?
How much of the genome do I study?
The more humans and the more genome studied the more expensive it will be – but obviously you might be able to learn a lot more looking at more people and their whole genome
Case control studies
Compare a large group of people showing a trait with a large group of people not showing a trait. For example type 2 D, you get as many people who suffer and as many that don’t then look at all the alleles of those who have type 2 D with those that don’t have – so that if you can find a single allele found in those with type 2 D, you can infer that that allele is something to do with type 2 D
But you have to take account of;
<!--[if !supportLists]-->
· <!--[endif]-->genetic background (everyone from Manchester/Munich)
<!--[if !supportLists]-->
· <!--[endif]-->environmental exposure
<!--[if !supportLists]-->
· <!--[endif]-->same trait but different genetic cause
Works best for discrete traits (Cases/Controls)
Family-based studies
In a family based study you can know the genealogies (you know who the mother was and the father, the granddad and uncle etc.) and you can look at linkage analysis – often the environmental studies will be similar this can help control for environmental effects. Family based studies have been very successful are discovering many Mendelian traits.
<!--[if !supportLists]-->
· <!--[endif]-->More powerful methods
<!--[if !supportLists]-->
· <!--[endif]-->Genetic background and environmental exposures often similar among family members
<!--[if !supportLists]-->
· <!--[endif]-->Problem of numbers – families small
<!--[if !supportLists]-->
· <!--[endif]-->Used to discover basis of many Mendelian traits
<!--[if !supportLists]-->
· <!--[endif]-->May discover rare mutations unique to a family
Might give great results but it might only be particular to that family
Cohort study
You do not just take people at a certain time you study them over a long period of time
This allows for a better understanding of environment, so good for G x E studies
Hard to manage and fund experiments like this is hard in practical
Large population studies
Often used for polygenic quantitative traits that show continuous variation (most polygenic traits do)
Need many sequence data
But this can be hard to get accurate phenotypes – and its expensive to get lots of genotypes and phenotypes
Study design:
How much of the genome do I study?
Candidate gene studies
The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest and phenotypes or disease states. This is in contrast to genome-wide association studies (GWAS), which scan the entire genome for common genetic variation.
Focus on particular gene at particular locus, how will you concentrate on a chosen genomic region. Prior knowledge points to that region a previous family based study, a study of a gene function in mice or another organism. (Relatively cheap). Allows you to see if there is variation in that gene. Cheap because you are looking at one thing but it can be very hard to replicate these studies because you might pick up something unique to the sample that is studied
But if you don’t look at candidate genes you do a Genome-wide study
Hypothesis free
Look across the whole genome, little prior knowledge needed
<!--[if !supportLists]-->
Ø <!--[endif]-->using SNP markers or WGS
Expensive – lots of data needed and can be hard to replicate
Complex statistics: big possibility of false positives and negatives, because you are doing MANY statistical tests
Two major types of study
1.Linkage analysis (Linkage mapping)
2.Genome wide association studies (Linkage disequilibrium mapping)
These two approaches take eliminates from types of studies and how many people and genome do you study
Linkage analysis (Linkage mapping)
Brings together the two lines of evidence
·heredity patterns of traits
·genome sequence
You need to know pedigree of every individual
You can start off as a genome-wide search, but then you need to do sequential studies needed to gradually narrow down the genomic region for a trait’s locus. Can begin with a candidate region of the genome
The main aim of linkage analysis is that you want to identify genetic markers that segregate with a trait of interest
Two lines of evidence
·Patterns of segregation of a trait in families
·Patterns of segregation of genetic markers in same families
Segregation happens because:
·Chromosomes segregate in meiosis (between mum and dad)
· Recombination segregates loci within chromosomes
We have known about this for a very long time,

Genetic distance

Genetic distance between two loci is measured by the recombination fraction
If 1% of progeny from a cross are recombinant, then they are 1 centimorgan apart (1 cM)
i.e. if a trait co-occurs with a marker in 99% of the progeny of a cross, the marker is likely to be 1cM from the trait locus
Genetic distance and physical distance are somewhat different, in terms of bases, because recombination is more frequent at parts on the chromosome – they look further apart than they actually are
LOD Scores
Measure of linkage between loci, Log10 of the likelihood ratios between the observed linkage and the null hypothesis of no linkage at all
·LOD score above 3 may suggest significant linkage
·LOD score of less than -2 may suggest no linkage

Genome wide association studies (GWAS) (Linkage disequilibrium mapping)
Linkage disequilibrium mapping
Do not need to know pedigrees
Hypothesis free
Look across the whole genome
<!--[if !supportLists]-->
· <!--[endif]--> using SNP markers or WGS
How often does each locus have a variant that co-occurs with the disease?
How often does each locus have a variant that co-occurs with the disease?
Little prior knowledge needed
Expensive – lots of data needed
Can be hard to replicate different cohorts of samples can give different results
Complex statistics: false positives and negatives
Bus Analogy
3.2 billion seats, something goes wrong when seat number 116572 is occupied by a male
But all we know is which buses have gone wrong, and who was on each bus
How do we associate “male in seat 116572” with the problem?
This is very difficult what we have to do is look across all the buses and find occupants of the seats of all the different buses and find the seat that is always the same in the individual buses with the problem – mathematically that is a difficult problem – look for something very small in very large data set
What helps us is linkage disequilibrium
our 3.2 billion loci are not randomly assorting in our genomes we don’t have 3.2 billion chromosomes – we only have 23 chromosomes
Although recombination is happening within those chromosomes its not enough that the 3.2 billion are segregation randomly in EVERY generation – because it does not happen all that frequently along the chromosome we find that there are lots of block in the structure of the variation within the human genomes so lots of loci are linked when we look at human populations. That is linkage disequilibrium if something was in equilibrium, it would mean that everything is randomly assorted as if we have 3.2 billion chromosomes.
As we get further away on both sides they are less linked in many ways conceptually they are similar to linkage mapping (but in linkage mapping we are just looking at one family or one pedigree where we are tracing a lineage and tracing recombination events)
Linkage disequilibrium we are just looking at populations we are not looking at linkage pedigree we are just observing these patterns as a phenomena that is arising – but we are exploitation the fact that linkage disequilibrium occurs so we can associate blocks with one another and this means that when we are looking for an allele associated with a trait we can look for a block of loci that are linked –

Different human populations have different patterns of linkage disequilibrium – and this partly depends on there history and so the longer it is back to a common ancestor the more linkage disequilibrium you will find in a population

Here is part of chromosome 7 around a gene that is involved in metabolic risk complication of obesity genes (MRC-OB) project cohort from Northen europe
Each line is a SNP marker in that chromosome – the chart shows how associated the SNPs are – if it is RED it means the two SNPS are highly associated – (they are in linkage disequilibrium with each other) – whereas if it is in white that means there is NO linkage disequilibrium (thus in equilibrium)
Imagine diagonal lines going up from each SNP, we can see that the big block is often found as one block – recombination doesn’t happen often within that block and if we look within that block it seems that recombination hardly every happens within that block and s what ever SNP apply is there if there are two variable SNPs they will always vary the same way –
Within the block you only need to know what allele is present in one of these particular bases to be able to know what is present in all the blocks – given your knowledge of the variation in the population – because they are all closely linked and we call those haplotypes – a little block where all the variation is inherited together is known as a haplotype block – we can see that along chromosome 7 there are a few haplotype blocks
We can infer the identify of ALL SNP alleles given knowledge given one of them, that is a process called imputation
If you know one allele and you use that to infer what alleles are present at other loci that is called imputation

Collection based on people in Utah from people with North European heritage.
To some extent the linkage plots are smaller in the MRCOB cohort smaller than is often found
Tends to be broken up into 4 sub block in the Utah population
Even within European population you can see slight differences that is even more the case when we look at the rest of Africa
Bone-mass you can see in Europe two big blocks that are found in linakge disequilibrium where as in Africa they are smaller –
Higher linage disequilibrim in Europe than in Africa
Linkage disequilibrium is absolutely crucial in genome wide association studies
They are crucial in genome wide association studies because we are trying to associate these haploid wide blocks with traits – we don’t want to associated 3.2 billion alleles
associated 3.2 billion alleles
We are trying to identify one SNP per block at least, then try to associate different sections of the genome with different traits – each column is a different case – half are cases (trait) – half controls (no trait) Looking at 8 blocks – we want to know to what extent these different loci are present in these cases – the number 4 is always blank squares but controls are diamonds filled
Whereas the one at the bottom is pretty much the same – one allele very slightly different but probably not the particular phenotype we are looing at
GWAS looks at thousands of loci scattered across the genome, at least one per halpotyde block, and asks is there a particular type of allele associated
GWAS looks at thousands of loci scattered across the genome, at least one per halpotyde block, and asks is there a particular type of allele associated
This is normally shown in something called a Manhattan plot (because it looks like the NY skyline with lots of sky scrapers) here we have the whole genome 22 + X (female)
·X axis: distance along chromosome
· Y axis: negative log of p-value estimated for the association between locus and trait
Common genetic variants on 5p14.1 associate with autism spectrum disorders,
A lot of maths has gone into Manhattan plot to get the values, chromosomes 5, shows higher probability of being associated with autism – so we zoom in on chromosome 5

Common genetic variants on 5p14.1 associate with autism spectrum disorders,
A lot of maths has gone into Manhattan plot to get the values, chromosomes 5, shows higher probability of being associated with autism – so we zoom in on chromosome 5
Important to remember that the Y-axis is a P value – we have to set threshold for P value but because we are doing multiple statistical tests we have to be very conservative and thus have to have very low P-value the more tests you do
GWAS will do thousands of tests normally it is 0.000001
GWAS significance
GWAS significance
Null hypothesis
<!--[if !supportLists]-->
· <!--[endif]-->There is no difference between cases and controls
<!--[if !supportLists]-->
· <!--[endif]-->There is no relationship between a genetic variable and a quantitative trait
Statistical significance: P-value
Y axis is showing significance, not strength of effect
Threshold must be set high due to multiple hypothesis testing
Loci that just cross the significance threshold may have a stronger effect than loci that cross it comfortably, its not the highest ones it’s the ones with strongest effect
Strength of effect in GWAS is normally given with an odds ratio
**Effect size: odds ratio **
<!--[if !mso]-->
Odds Ratio is calculated once we know a trait is significantly associated with a locus
Odds
The odds is the ratio of the probability that the event of interest occurs to the probability that it does not.
The odds that a single throw of a die will produce a six are 1 to 5, or 1/5 = 0.2
The probability of a 6 is 1/6 = 0.166666667
The probability of a not 6 is 5/6
(1/6)/(5/6)=0.2
Odds ratio (OR)
A ratio of two ratios
OR= Odds of having the trait given you have the trait associated allele
/
of having the trait given you have the trait associated allele
Odds ratio example
Disease associated locus: biallelic A/T
The odds of getting a disease given you have a T allele are 1 in 3
The odds of getting the disease if you have an A allele is 1 in 9
Odds ratio is (1/3)/(1/9) = 3 (three times as likely to get the disease, relatively speaking)
Odds Ratio (OR)
The odds ratio is an indicator of the strength of the relationship between a genetic variant and a trait
OR = 1
It doesn’t make a difference which allele you have
OR > 1
You are more likely to have the trait if you have the allele
OR < 1
You are less likely to have the trait, but statistic is not directly interpretable
Missing heritability”
Loci identified by GWAS generally only explain a small proportion of the known heritability of a trait.
We have known that height is heritable but we don’t find enough loci that are responsible for the heritability that we see because each locus that we find by itself doesn’t explain big enough proportions that explains the total height differences that we observe in human population.
Diease A is highly heritable, B equally, C not so much D hardly at all
Black box is the environmental content, that explains the environmental and the SNPs explain the genetic influence but the ? is unknown it is “missing heritability”
It is found within almost every genus – many reasons why that can be found

Nature 2010 paper, they looked at 180,000 individuals and found 180 loci influencing adult height but those 180 only explained 10% of the phenotypic variation of height but when you take out the environment you can only explain 20% so 80% of the heritability is missing
Various things can be responsible for this could be due to epistasis (the interaction of genes that are not alleles, in particular the suppression of the effect of one such gene by another.)
Could be due to smaller P values

Where is the missing heritability?
Epistatic interactions among loci, if you change just one locus it can have effects on everything else
Small effect variants that is hard to detect
Rare variants
Gene by environment (GxE) interactions
Heritability was over-estimated in the first place
Common disease-common variant hypothesis
GWAS works best if common diseases are due to common variants
It can’t pick up cases where multiple recent mutations give rise to the same disease phenotype (mutation-selection hypothesis). This is likely to happen because any mutation that gives rise to disease is likely to be selected against and so natural selection should weed it out of populations
It should be caught by rare variants, but should be weeded out by natural selection – most disease are rare deleterious mutation it will be hard to find if each one is giving rise to the same phenotype
Ethnicity
GWAS results vary among ethnicities
SNP rs7612463 is associated with Type 2 diabetes in East Asian populations, it does not have this association in Caucasian populations. Could be because there is a different linkage plot – it could be that there are other genes also involved in type 2 diabetes that also differ and those mean that the locus near SNP 761243 don’t have the same effect
Calculating personal risk
You discover you carry an allele with a significant association with a disease and a high odds ratio
What is your risk of getting that disease?
No generally accepted way of calculating this – pretty ad hock
Interpretation of trait associated data
Example from p.122-123 – in EPG (correction posted on QMplus)
Locus rs2230199
Biallelic SNP: C/G
Frequency in European populations:
19% C / 81% G (doesn’t mean that 19 carry C and 81 carry G – some can be heterozygous)
C allele associated with age-related macular degeneration (ARMD) (we know that from the GWAS)
p < 5x10-29
Odds ratio for C allele is 1.53
Average population incidence of ARMD is 8% (8% of people in Europe have it) even if we didn’t know the odds ratio we would know that not everyone who carries the C allele have ARMD – because 19% carry but only 8% have it
We want to know if you do carry the C allele how likely are you to develop the ARMD
(Gets a bit dodge now…)
Assume odds ratio for alleles are multiplicative (i.e. not dominant/recessive)
<!--[if !supportLists]-->
· <!--[endif]-->Odds ratio for C is 1.53
<!--[if !supportLists]-->
· <!--[endif]-->Odds ratio for G is 1.0 (we assume) (G makes no difference doesn’t make you more or less likely to get it)
<!--[if !supportLists]-->
Ø <!--[endif]-->CC à 1.53 x 1.53 = 2.43
<!--[if !supportLists]-->
Ø <!--[endif]-->CG à 1.53 x 1 = 1.53
<!--[if !supportLists]-->
Ø <!--[endif]-->GG à 1 x 1 = 1
Assume Hardy-Weinberg equilibrium to calculate population genotype frequencies
<!--[if !supportLists]-->
Ø <!--[endif]-->CC à 19% x 19% = 3.6%
<!--[if !supportLists]-->
Ø <!--[endif]-->CG à 2 x 19% x 81% = 30.8%
<!--[if !supportLists]-->
Ø <!--[endif]-->GG à 81% x 81% = 65.6%
Relative risk of ARMD for whole population
= 2.43 (odds ratio) x 0.036 + 1.53 x 0.308 + 1 x 0.656 = 1.22
Relative risk of ARMD for a CC individual
=2.43/1.22 = 1.99
If average population incidence of ARMD is 8%
Overall risk of ARMD for a CC individual
=0.08 x 1.99 = 16%
Likelihood ratio (LR)

A ratio of two probabilities
Needs accurate measures of population frequencies of genotypes in affected and unaffected samples
These are often not available even when a GWAS has been done
You can often learn more about your probability of getting a disease by looking at the prevalence of the disease in your population than you can learn by looking at your genotype at disease-associated loci
Multilocus risk estimation
For a polygenic trait, we cannot assess our risk just from one locus
We need to combine information from many markers
Multilocus risk estimation
Need to be sure each locus had been associated with exactly the same trait
Need to be sure each locus’ association was determined rigorously
Need to check loci are not linked in a haplotype block
Need single OR for each locus even if different GWAS studies have given different ORs
Crohn’s disease
a chronic inflammation of the intestines which is usually found in the terminal portion of the small intestine, the ileum
Mapping Crohn’s disease
Segregation analyses suggested monogenic recessive mode of inheritance
Took an initial panel of 25 Caucasian families each containing at least two siblings with Crohn’s
Family members genotyped with 270 markers with known locations spread throughout the genome
Mapping Crohn’s disease
Linkage analysis with parametric LOD score method
LOD: Logarithm of Odds
the likelihood of obtaining the test data if two loci/markers/traits are linked, compared to the likelihood of observing the same data purely by chance