Week 8 (Variants & SNP Chips) Flashcards
_____ _________ based on multiple metrics (need to be determined empirically) for variant filtering
hard filtering
what are the two most important metrics in hard filtering?
- QualByDepth (QD)
- RMSMapping Quality (MQ)
VCF
variant call format
MQ (RMSMappingQuality) has a value of 40 associated with it. What does that mean?
it allows us to evaluate how good we think the gene mapped to the genome
where would you find a lower MQ (mapping quality)?
in repetitive sequences because there are multiple places it could go
sensitivity
identifying true positives
specificity
identifying true negatives
all variant callers produce errors. these errors can be classified as false positives and false negatives. when performing a genomic analysis, or any similar analysis for that matter, on has to balance sensitivity and specificity, what do the terms sensitivity and specificity mean in the context of variant calling?
sensitivity: trying to discover all the real variants
specificity: trying to limit the false positives that creep in when filters get too lenient
if you call a variant where one doesn’t exist this is a false __________
positive
if you fail to identify where a variant exists it is a false __________
negative
what type of error is considered the worst?
Type 1 error (we don’t want to say something is true if it is not)
Variant Quality Score Recalibration
does not actually recalibrate QUAL but creates a new score
what is the purpose of the variant quality score recalibration?
the purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity and specificity
sensitivity in variant quality score recalibration
trying to discover all the real variants
specificity in variant quality score recalibration
trying to limit the false positives that creep in when filters get too lenient
________ _______ ________ ___________ uses machine learning algorithms to learn from each dataset what the annotation profile of good variants vs bad variants
variant quality score recalibration
VQSR
variant quality score recalibration
key to VQSR is that you need a “________ _____” for training the model
truth set
what is 100% sensitivity?
calling every difference a variant
the recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality ____________
tranches
__________ are essentially slices of variants, ranked by VQSLOD
tranches
high tranche
if you want more variants and are willing to accept false positives
middle tranche
if you want to remove most false positives but are also willing to remove some true variants
low tranche
if you only want highly accurate true variants with few false positives and willing to miss perhaps many true positives
what are tranches?
slices in the variant quality scores, where to set the threshold to identify the amount of true positives and accept a number of false positives
slices in the variant quality scores, where to set the threshold to identify the amount of true positives and accept a number of false positives
tranches
what is a genotyping model and software that google has released?
Deep Variant
what is the standard genotyping model and software used in humans?
Deep Variant
WGS
whole genome sequence
what is a whole genome sequence (WGS)?
the sequence library
what is a SNP Chip used for?
to build a (relatively) low cost assay to genotype a large number of individuals
sample size is statistical ________
power
what is deep coverage?
30 x
approximately how much does it cost to run a whole genome sequencing on a mammalian genome at 30x coverage?
$1000
what is the difference between WGS and SNP chips related to variants?
- WGS captures “all” variation
- SNP chips have lower number of variants but also a lower cost per sample
_______ is the largest genotype provider in the world
Neogen
what was the purpose of in-silico digest of reference genome with multiple restriction enzymes?
every time a sequence was seen it would cut it, from there you could compile the amount of reads you had from each segment and the repetitive elements (that you are not interested in) could be found because they had the most sections cut out
what is illuminated infinium chemistry?
small beads have unique barcodes, for each SNP 50 mer oligos flank it called probes, attach these SNP specific probes to the beads and then create a chip that has microwells for each of the beads to sit in, then deposit the beads on the chip to produce an array
what is the basics of the beads used in illumina indium chemistry?
the small beads have oligos hanging off of them that correspond to the section of the sequence that you want
for each SNP, synthesize a ____ mer oligo that flanks the SNP (probe)
50
what is a probe?
50 mer oligo that flanks the SNP
_____ base probe
50
infinitum I = ____ probe(s)
2
infinitum II = ____ probe(s)
1
what color bead is G and C in illumina infinitum chemistry?
green
what color bead is A and T in illumina infinitum chemistry?
red
G/C and A/T = ____________
infinium I
in illumina infinium chemistry, what happens if your variant is As AND Ts? How do you solve this?
it would all show up as red, so you would need to use two probes
why do we cluster SNPs?
so we can determine genotype
is this a well clustered SNP or a poorly clustered SNP?
well clustered SNP
are these example of a well clustered SNP or a poorly clustered SNP?
poorly clustered SNP
what are these clustered SNPs an example of?
improperly clustered SNPs, the automated system just got it wrong, so you should manually fix it
SNP chips are really accurate but things can go wrong. remember this when making decisions based on chip genotypes for any single SNP specifically. Why?
it depends on what error rate you are comfortable with, your willingness to be wrong
call rate per SNP
best indicator of genotype quality
call rate per individual
best indicator sample of DNA quality
2 key metrics for looking for errors:
- call rate per SNP
- call rate per individual
why might for an individual have a low call rate?
poor DNA (for example taking from a live cow vs a cow that has been dead for 1000 years)
what does it mean to impute?
taking missing data from data that you have already observed and filling in the gaps
what do the signs on the right symbolize?
whole genome sequence: it is high density and all the variants have been found
what do the signs on the left symbolize?
SNP Chip: low density