Week 7 (Variant Calling) Flashcards

1
Q

after aligning the genome, we will want to identify ___________

A

variants

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is a variant?

A

differences between our sample and the reference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the types of variants we are exploring in this class?

A
  • SNPs
  • Indels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

SNPs

A

single nucleotide polymorphisms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Indels

A

insertions/deletions (small <50 bp)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

unaligned sequence data file formats

A
  • FASTA
  • FASTQ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Aligned sequence data file formats

A
  • SAM
  • BAM
  • BAI
  • CRAM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SAM

A

sequence alignment map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

BAM

A

binary (compressed) version of SAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

BAI

A

index for BAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

variant calls (SNPs and Indels)

A
  • VCF
  • BCF
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

VCF

A

variant call format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BCF

A

binary of vcf (compressed version)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

mandatory fields of SAM

A
  • what [4]
  • where [5]
  • how good (or bad) [2]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

aligners ouput a _____ file that is then compressed to a ______ file

A

SAM; BAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is MAPQ?

A

in the SAM file, MAPQ will tell you how good of a job we do mapping the read to the reference. the lower the map qualities the worse we did at mapping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what parts of the genome often have low MAPQ?

A

repetitive regions and simple regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

do we prefer systematic errors or random errors? why?

A
  • we prefer random error
  • random error occurs randomly and can be overcome. systematic errors are difficult to find and can mess up all of the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

________ errors are preferred. ________ errors cause issues with downstream analysis.

A

random; systematic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what happens if there are errors that occur during PCR, what happens to your data?

A

the error is transferred all the way down your analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

two types of duplicate reads

A
  • PCR
  • optical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is optical duplicate read?

A

more prevalent on patterned flow cells (due to imaging on older flow cells and exAmp chemistry on patterned flow cells)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

how are duplicates identified?

A

based on the starting positions of the read

24
Q

what is a downside to using the patterned flow cell?

A

you end up with far more optical duplicates

25
Q

BQSR

A

base quality score recalibration

26
Q

almost all sequence analyses use the _____ _______

A

base quality

27
Q

quality values reported by sequencer are __________ or error rate

28
Q

BQSR uses a set of known variant positions (dbSNP) and considers all other variants from sequence data as _________

29
Q

how does base quality score recalibration work?

A

uses a set of known positions in the genome that are known to be variable, we expect an individual may be different at that position, so we will only look at where our data is different from other parts of the reference, if our individual is different it may be due to a sequencing error

30
Q

when observing the reported quality vs the empirical quality (expected quality) of the data, we will just adjust the data to fit the distribution. For which sequencer may we not have to recalibrate the data?

A

element because it is so accurate

31
Q

when comparing the reference genome to the DNA of the animal that the reference was made of, what would the expectation be?

A

there would be none to very little difference (maybe show some heterozygosity)

32
Q

what are two common variant callers?

A
  • GATK HaplotypeCaller
  • DeepVariant (more common)
33
Q

two approaches to variant calling

A
  • single sample
  • joint
34
Q

_____ calling is always better

35
Q

why is joint calling better than single sample calling?

A

It improves variant detection by using data from multiple samples to enhance sensitivity, especially for rare variants. This approach increases accuracy, reduces false positives, and strengthens analysis by leveraging collective sample information, making it particularly useful for large studies.

36
Q

key benefits to HaplotypeCaller

A
  • call SNPs and indels simultaneously
  • local de-novo assembly
  • output gVCF for cohort (joint) calling
37
Q

4 stages for HaplotypeCaller

A
  1. define active regions
  2. local assembly to build haplotypes
  3. estimate likelihoods of the haplotypes given the data
  4. assign sample genotypes
38
Q

HaplotypeCaller: define active regions

A

regions of the genome based not eh presence of significant evidence for variation

39
Q

as an example, if a kmer was 7 long, how many of the first kmer are TATGAAA vs TATGCAA?

A
  • 4 Kmers with TATGAAA
  • 4 Kmers with TATGCAA
40
Q

HaplotypeCaller: determine haplotypes by local assembly of the active region

A

for each region, build a de bruijn-like graph to reassemble the active region and then identify the active haplotypes present and realign using the smith waterman algorithm

41
Q

why would you use the smith waterman algorithm to identify the active hapltypes?

A

it is guaranteed to find the optimal local alignment

42
Q

haplotype

A

a group of alleles (different versions of a gene) located close together on a single chromosome that are inherited together from one parent, essentially a set of DNA variations that tend to be passed down as a unit due to their proximity on the chromosome

43
Q

key benefits of HaplotypeCaller

A
  • call SNPs and indels simultaneously
  • local de-novo assembly
  • output of gVCF for cohort (joint) calling
44
Q

4 stages of HaplotypeCaller

A
  1. define active regions
  2. local assembly to build haplotypes
  3. estimate likelihoods of the haplotypes given the data
  4. assign sample genotypes
45
Q

gVCF ouput file allows _____ calling

46
Q

VCF

A

variant call format

47
Q

what are the 8 mandatory fields in a VCF file?

A
  • what: ID/REF/ALT/INFO
  • where: CHROM/POS
  • how good: QUAL/FILTER
48
Q

hard filtering

A

based on multiple metrics, used to filter variants

49
Q

sensitivity

A

(true positive rate) measures the proportion of positives that are correctly identifies (the proportion of those who are affected and who were correctly identified as affected)

50
Q

specificity

A

(true negative rate) measures the proportion of negatives that are correctly identified (the proportion that are not affected and identified as not being affected)

51
Q

what is a type 1 error?

A

false positive

52
Q

what is a type 2 error?

A

false negative

53
Q

VQSLOD

A

variant quality score log-ODds

54
Q

the purpose of this new score is to enable variant filtering in a new way that allows analysts to balance:

A
  • sensitivity
  • specificity
55
Q

variant filtering: sensitivity

A

trying to discover all the real variants

56
Q

variant filtering: specificity

A

trying to limit the false positives that creep in when filters get too lenient

57
Q

variant quality score recalibration uses machine learning algorithms to learn from each dataset what is the annotation profile of ______ variants vs _______ variants

A

good variants vs bad variants