Week 11 (1000 Genomes Project) Flashcards
what is the 1000 genome project?
The 1000 Genomes Project is an international research consortium that was set up in 2007 with the aim of sequencing the genomes of at least 1,000 volunteers from multiple populations worldwide in order to improve our understanding of the genetic contribution to human health and disease.
what was the first model for the 1000 genomes project? why?
humans! human research is more funded so they had the money to do this.
what combination of sequencing tools did they use to complete the 1000 genome project?
- low coverage whole genome
- exome sequencing
the 1000s genome project validated a haplotype map of ____ _____ single nucleotide polymorphisms
38 million
why do low frequency variants tend to be recent?
a frequency is the amount of times something shows up, so something that is new tends to have a lower frequency (like a new variant or mutation in the population)
is it possible for mutations to occur over time? if so, how?
yes! possible mutations can occur during cell division
what is the equation that you use to determine the frequency of a mutation in a population?
1/2N (N=number of individuals)
what is the chance of transmission from parent to offspring?
50/50 (to transmit ot to not transmit)
in every generation recombination occurs, this is an example of _______ __________
linkage disequilibrium
while doing the 1000s genome project, they found 3.6 million SNPs per individual. On average, how many variants or how different is the genome?
0.1%
what is low coverage?
<5%
what is high coverage?
> 20%
why did the 1000 genomes project use 5x coverage?
it was really expensive to do more than that! (it cost $5 million)
what is the typical amount of coverage that we use today?
30x
we transmit ________ NOT _______ to the next generation
chromosomes; alleles
what amount of coverage did the 1000s genome project use?
low coverage (2-6x)
the 1000s genome project used wide sampling and low coverage, why?
they wanted to characterize common variation, they were able to sample more individuals but sequence at a lower coverage to achieve this
how did the 1000 genomes project contract an integrated map of variation?
- primary data
- canidate variants and quality metrics
- variant calls and genotype likelihoods
- integrated haplotypes
which would produce more accurate variant calls, low coverage WGS or high coverage exome?
high coverage exome
pro and con of low coverage WGS?
- pro: cost effective, can conduct large scale studies
- con: less accurate variant calls
pro and con to high coverage exome?
- pro: more accurate variant calls
- con: only sequencing 2% of the genome
what are exomes sequencing?
they sequence only exons (the protein coding regions) and nothing else in the genome, so only 2% of the genome is sequenced
why 0, 1, or 2 copies of a variant for an individual?
that is the amount of chromosomes available, so you can either have it on neither, one, or both
why is the evidence for a single genotype typically weak in low coverage regions?
(low coverage=5x), at each position we sequences only 5 reads so there are only 5 reads available to support reference calls
the evidence for a single genotype typically weak in low coverage regions. why is it more difficult for heterozygous traits?
a single read is sufficient for there to be error, but it could mean it is heterozygous, so your confidence on the call is low
the evidence for a single genotype typically weak in low coverage regions. how can we address this?
sequence deeper (increase coverage)
what procedure/ what is it called when you try to determine if a variant is true or not?
variant quality score calibration
the 1000 genomes froject identified 38 million variants. how many variants (SNPs) have been discovered today?
1.1 billion
remember that other type of variation we said we were NOT going to talk about?
structural variation
what was another name we gave to “regions of low complexity”?
repetitive sequence
what technology should we use in regions with low complexity? why?
long read sequencers, so we can span across the repeat
when we make a call about DNA at a position, what are the options for the condition?
- true positive
- false positice
- false negative
FDR
false discovery rate
FDR equation
FP / FP+TP
(FDR= false discovery rate, FP=false positive, TP = true positive)
de novo
new
accessible genome
the fraction of the reference genome in which short-read data can lead to reliable variant discovery
the 1000 genomes project had challenges identifying large and complex structural variants and shorter indels in regions of low complexity. so what conservative but high quality subsets did they focus on?
- balletic indels
- large deletions
everyone carries “bad” variants. however, not everyone shows them or they never cause issues. why can this happen?
we have two chromosomes, so if the other chromosome is functioning it can mask the bad variant