Week 7 (Variant Calling) Flashcards by Emma Sellers

after aligning the genome, we will want to identify ___________

variants

How well did you know this?

Not at all

Perfectly

what is a variant?

differences between our sample and the reference

How well did you know this?

Not at all

Perfectly

what are the types of variants we are exploring in this class?

SNPs
Indels

How well did you know this?

Not at all

Perfectly

SNPs

single nucleotide polymorphisms

How well did you know this?

Not at all

Perfectly

Indels

insertions/deletions (small <50 bp)

How well did you know this?

Not at all

Perfectly

unaligned sequence data file formats

FASTA
FASTQ

How well did you know this?

Not at all

Perfectly

Aligned sequence data file formats

SAM
BAM
BAI
CRAM

How well did you know this?

Not at all

Perfectly

SAM

sequence alignment map

How well did you know this?

Not at all

Perfectly

BAM

binary (compressed) version of SAM

How well did you know this?

Not at all

Perfectly

BAI

index for BAM

How well did you know this?

Not at all

Perfectly

variant calls (SNPs and Indels)

How well did you know this?

Not at all

Perfectly

VCF

variant call format

How well did you know this?

Not at all

Perfectly

BCF

binary of vcf (compressed version)

How well did you know this?

Not at all

Perfectly

mandatory fields of SAM

what [4]
where [5]
how good (or bad) [2]

How well did you know this?

Not at all

Perfectly

aligners ouput a _____ file that is then compressed to a ______ file

SAM; BAM

How well did you know this?

Not at all

Perfectly

what is MAPQ?

in the SAM file, MAPQ will tell you how good of a job we do mapping the read to the reference. the lower the map qualities the worse we did at mapping.

How well did you know this?

Not at all

Perfectly

what parts of the genome often have low MAPQ?

repetitive regions and simple regions

How well did you know this?

Not at all

Perfectly

do we prefer systematic errors or random errors? why?

we prefer random error
random error occurs randomly and can be overcome. systematic errors are difficult to find and can mess up all of the data.

How well did you know this?

Not at all

Perfectly

________ errors are preferred. ________ errors cause issues with downstream analysis.

random; systematic

How well did you know this?

Not at all

Perfectly

what happens if there are errors that occur during PCR, what happens to your data?

the error is transferred all the way down your analysis

How well did you know this?

Not at all

Perfectly

two types of duplicate reads

PCR
optical

How well did you know this?

Not at all

Perfectly

what is optical duplicate read?

more prevalent on patterned flow cells (due to imaging on older flow cells and exAmp chemistry on patterned flow cells)

How well did you know this?

Not at all

Perfectly

how are duplicates identified?

Study These Flashcards

based on the starting positions of the read

what is a downside to using the patterned flow cell?

Study These Flashcards

you end up with far more optical duplicates

BQSR

base quality score recalibration

almost all sequence analyses use the _____ _______

base quality

quality values reported by sequencer are __________ or error rate

estimates

BQSR uses a set of known variant positions (dbSNP) and considers all other variants from sequence data as _________

errors

how does base quality score recalibration work?

uses a set of known positions in the genome that are known to be variable, we expect an individual may be different at that position, so we will only look at where our data is different from other parts of the reference, if our individual is different it may be due to a sequencing error

when observing the reported quality vs the empirical quality (expected quality) of the data, we will just adjust the data to fit the distribution. For which sequencer may we not have to recalibrate the data?

element because it is so accurate

when comparing the reference genome to the DNA of the animal that the reference was made of, what would the expectation be?

there would be none to very little difference (maybe show some heterozygosity)

what are two common variant callers?

- GATK HaplotypeCaller - DeepVariant (more common)

two approaches to variant calling

- single sample - joint

_____ calling is always better

joint

why is joint calling better than single sample calling?

It improves variant detection by using data from multiple samples to enhance sensitivity, especially for rare variants. This approach increases accuracy, reduces false positives, and strengthens analysis by leveraging collective sample information, making it particularly useful for large studies.

key benefits to HaplotypeCaller

- call SNPs and indels simultaneously - local de-novo assembly - output gVCF for cohort (joint) calling

4 stages for HaplotypeCaller

1. define active regions 2. local assembly to build haplotypes 3. estimate likelihoods of the haplotypes given the data 4. assign sample genotypes

HaplotypeCaller: define active regions

regions of the genome based not eh presence of significant evidence for variation

as an example, if a kmer was 7 long, how many of the first kmer are TATGAAA vs TATGCAA?

- 4 Kmers with TATGAAA - 4 Kmers with TATGCAA

HaplotypeCaller: determine haplotypes by local assembly of the active region

for each region, build a de bruijn-like graph to reassemble the active region and then identify the active haplotypes present and realign using the smith waterman algorithm

why would you use the smith waterman algorithm to identify the active hapltypes?

it is guaranteed to find the optimal local alignment

haplotype

a group of alleles (different versions of a gene) located close together on a single chromosome that are inherited together from one parent, essentially a set of DNA variations that tend to be passed down as a unit due to their proximity on the chromosome

key benefits of HaplotypeCaller

- call SNPs and indels simultaneously - local de-novo assembly - output of gVCF for cohort (joint) calling

4 stages of HaplotypeCaller

1. define active regions 2. local assembly to build haplotypes 3. estimate likelihoods of the haplotypes given the data 4. assign sample genotypes

gVCF ouput file allows _____ calling

cohort

VCF

variant call format

what are the 8 mandatory fields in a VCF file?

- what: ID/REF/ALT/INFO - where: CHROM/POS - how good: QUAL/FILTER

hard filtering

based on multiple metrics, used to filter variants

sensitivity

(true positive rate) measures the proportion of positives that are correctly identifies (the proportion of those who are affected and who were correctly identified as affected)

specificity

(true negative rate) measures the proportion of negatives that are correctly identified (the proportion that are not affected and identified as not being affected)

what is a type 1 error?

false positive

what is a type 2 error?

false negative

VQSLOD

variant quality score log-ODds

the purpose of this new score is to enable variant filtering in a new way that allows analysts to balance:

- sensitivity - specificity

variant filtering: sensitivity

trying to discover all the real variants

variant filtering: specificity

trying to limit the false positives that creep in when filters get too lenient

variant quality score recalibration uses machine learning algorithms to learn from each dataset what is the annotation profile of ______ variants vs _______ variants

good variants vs bad variants

Week 7 (Variant Calling) Flashcards

(57 cards)