Week 7 (Variant Calling) Flashcards
after aligning the genome, we will want to identify ___________
variants
what is a variant?
differences between our sample and the reference
what are the types of variants we are exploring in this class?
- SNPs
- Indels
SNPs
single nucleotide polymorphisms
Indels
insertions/deletions (small <50 bp)
unaligned sequence data file formats
- FASTA
- FASTQ
Aligned sequence data file formats
- SAM
- BAM
- BAI
- CRAM
SAM
sequence alignment map
BAM
binary (compressed) version of SAM
BAI
index for BAM
variant calls (SNPs and Indels)
- VCF
- BCF
VCF
variant call format
BCF
binary of vcf (compressed version)
mandatory fields of SAM
- what [4]
- where [5]
- how good (or bad) [2]
aligners ouput a _____ file that is then compressed to a ______ file
SAM; BAM
what is MAPQ?
in the SAM file, MAPQ will tell you how good of a job we do mapping the read to the reference. the lower the map qualities the worse we did at mapping.
what parts of the genome often have low MAPQ?
repetitive regions and simple regions
do we prefer systematic errors or random errors? why?
- we prefer random error
- random error occurs randomly and can be overcome. systematic errors are difficult to find and can mess up all of the data.
________ errors are preferred. ________ errors cause issues with downstream analysis.
random; systematic
what happens if there are errors that occur during PCR, what happens to your data?
the error is transferred all the way down your analysis
two types of duplicate reads
- PCR
- optical
what is optical duplicate read?
more prevalent on patterned flow cells (due to imaging on older flow cells and exAmp chemistry on patterned flow cells)
how are duplicates identified?
based on the starting positions of the read
what is a downside to using the patterned flow cell?
you end up with far more optical duplicates
BQSR
base quality score recalibration
almost all sequence analyses use the _____ _______
base quality
quality values reported by sequencer are __________ or error rate
estimates
BQSR uses a set of known variant positions (dbSNP) and considers all other variants from sequence data as _________
errors
how does base quality score recalibration work?
uses a set of known positions in the genome that are known to be variable, we expect an individual may be different at that position, so we will only look at where our data is different from other parts of the reference, if our individual is different it may be due to a sequencing error
when observing the reported quality vs the empirical quality (expected quality) of the data, we will just adjust the data to fit the distribution. For which sequencer may we not have to recalibrate the data?
element because it is so accurate
when comparing the reference genome to the DNA of the animal that the reference was made of, what would the expectation be?
there would be none to very little difference (maybe show some heterozygosity)
what are two common variant callers?
- GATK HaplotypeCaller
- DeepVariant (more common)
two approaches to variant calling
- single sample
- joint
_____ calling is always better
joint
why is joint calling better than single sample calling?
It improves variant detection by using data from multiple samples to enhance sensitivity, especially for rare variants. This approach increases accuracy, reduces false positives, and strengthens analysis by leveraging collective sample information, making it particularly useful for large studies.
key benefits to HaplotypeCaller
- call SNPs and indels simultaneously
- local de-novo assembly
- output gVCF for cohort (joint) calling
4 stages for HaplotypeCaller
- define active regions
- local assembly to build haplotypes
- estimate likelihoods of the haplotypes given the data
- assign sample genotypes
HaplotypeCaller: define active regions
regions of the genome based not eh presence of significant evidence for variation
as an example, if a kmer was 7 long, how many of the first kmer are TATGAAA vs TATGCAA?
- 4 Kmers with TATGAAA
- 4 Kmers with TATGCAA
HaplotypeCaller: determine haplotypes by local assembly of the active region
for each region, build a de bruijn-like graph to reassemble the active region and then identify the active haplotypes present and realign using the smith waterman algorithm
why would you use the smith waterman algorithm to identify the active hapltypes?
it is guaranteed to find the optimal local alignment
haplotype
a group of alleles (different versions of a gene) located close together on a single chromosome that are inherited together from one parent, essentially a set of DNA variations that tend to be passed down as a unit due to their proximity on the chromosome
key benefits of HaplotypeCaller
- call SNPs and indels simultaneously
- local de-novo assembly
- output of gVCF for cohort (joint) calling
4 stages of HaplotypeCaller
- define active regions
- local assembly to build haplotypes
- estimate likelihoods of the haplotypes given the data
- assign sample genotypes
gVCF ouput file allows _____ calling
cohort
VCF
variant call format
what are the 8 mandatory fields in a VCF file?
- what: ID/REF/ALT/INFO
- where: CHROM/POS
- how good: QUAL/FILTER
hard filtering
based on multiple metrics, used to filter variants
sensitivity
(true positive rate) measures the proportion of positives that are correctly identifies (the proportion of those who are affected and who were correctly identified as affected)
specificity
(true negative rate) measures the proportion of negatives that are correctly identified (the proportion that are not affected and identified as not being affected)
what is a type 1 error?
false positive
what is a type 2 error?
false negative
VQSLOD
variant quality score log-ODds
the purpose of this new score is to enable variant filtering in a new way that allows analysts to balance:
- sensitivity
- specificity
variant filtering: sensitivity
trying to discover all the real variants
variant filtering: specificity
trying to limit the false positives that creep in when filters get too lenient
variant quality score recalibration uses machine learning algorithms to learn from each dataset what is the annotation profile of ______ variants vs _______ variants
good variants vs bad variants