vcf Flashcards

1
Q

CHROM POS. ID REF ALT QUAL FILTER INFO

Question 1

What does each field correspond to ?

FORMAT NA12878

chr5 130076433 . G A 62.74 PASS later GT:AD:DP:GQ:PL. 1/1:0,2:2:6:90,6,0

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Question 2

Elaborate a phrase from the fields CHROM, POS, REF, and ALT of the table

A

The variant is a transition from G to A at position 130,076,433 of chromosome 5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Question 3

VCF

What does ID field means ?

A

The ID field can be used to specify the accession number of a variant in databases such as dbSNP, and might contain an entry such as rs25458. This field is not populated by default by GATK, and in our case, a period (.) is used to indicate that the field is empty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Question 4

What does QUAL field means ?

A

The QUAL field is a Phred-scaled quality score for the assertion made in ALT. If the ALT field contains a variant call, then QUAL reflects the estimated probability that the variant call is wrong

QUAL = - 10 log10 p(no variant)

where p(no variant) is the probability that there is no variant at this site. If on the other hand, the ALT field is “.” (i.e., no variant call), then QUAL is the estimated probability that the call of no variant is wrong:

QUAL = - 10 log10 p(variant)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Question 5

VCF

How does you count the number of positions passing all filters with BCFTOOLS?

A

bcftools query -f ‘%FILTER\n’ NA12878_filtered.vcf | \ grep -c PASS

138083

Thus, ~94.6% of all 146,000 positions represented in the VCF file passed all filters.

On the other hand, 7917 positions did not pass the snpFilter. This item represents the result of the hard filter we applied in Section 12.4. Recall that the hard filter we applied would not let a variant pass if it failed to fulfill any one of the criteria used in the filter.

For instance, our VCF file reports a G>A transition at position 22,646,441 of chromosome 22. This variant is marked with snpFilterin the FILTER field, because the value of the RMS Mapping Quality (MQ) was less than 40:

MQ=39.81

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Question 6

Describe each field of this line:

<p>Description="Allele count in genotypes, \</p>

<p>for each ALT allele, in the same order as listed"&gt;</p>

</id>

A

ID specifies the name of the item, which is often a two-letter abbreviation. Number specifies the number of values that the field should contain. In this case, the value “A” indicated that the field contains one value per alternate allele.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Question 7

The value for chr7:1936796C>T is AC=1

What represents AC=1 for this variant ?

A

This reflects the allele count in the genotype for this variant(0/1, heterozygous). Homozygous variants (1/1) have AC=2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Question 8

INFO field

What represents allele frequency in the field INFO ?

A

Allele Frequency, for each ALT allele

In VCF files, allele frequency refers to the allele balance of the alleles observed in the samples being sequenced, not the frequency of the variants in the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Question 9

INFO field

What represents AN field ?

A

AN field shows the total number of alleles in called genotypes. AF is calculated as AC/AN

For the heterozygous variants (0/1), AC=1,AN=2, and AF=0.500. For the homozygous variants (1/1), AC=2, AN=2, and AF=1.00.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Question 10

INFO field

Consider a VCF file that describes three samples. If a data line indicates a variant that was called homozygous (1/1) in sample 1, heterozygous (0/1) in sample 2, and heterozygous (0/1) in sample 3, then AC=…, AN=… and, the Allele Frequency is calculated as … or AF=…

A

Consider a VCF file that describes three samples. If a data line indicates a variant that was called homozygous (1/1) in sample 1, heterozygous (0/1) in sample 2, and heterozygous (0/1) in sample 3, then AC=4, AN=6 and, the Allele Frequency is calculated as 4/6 or AF=0.667.

AF=AC/AN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Question 11

INFO

What represents DP field in INFO ?

A

Approximate read depth (some reads may have been filtered).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Question 12

INFO field

What does DP=23 means ?

A

DP=23, meaning that 23 reads covered this position.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Question 13

FORMAT field

the FORMAT field has the following items

GT:AD:DP:GQ:PL

0/1:11,12:23:99:303,0,290

What are the two most common genotypes encountered?

A

The first entry refers to the genotype. The two most commonly encountered genotypes are simple to understand: 0/1 refers to a heterozygous variant, and 1/1 refers to a homozygous variant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Question 14

VCF FORMAT field

What represents the | or / in 0/1 ?

the FORMAT field has the following items

GT:AD:DP:GQ:PL

0/1:11,12:23:99:303,0,290

A

In general, the genotype is encoded as allele values separated by one of/ or |. The separator / is used if the variants are unphased, and the separator | is used for phased variants.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Question 15

VCF FORMAT field

What represent the number of 0/1 or 1/2 …

The FORMAT field has the following items:

GT:AD:DP:GQ:PL

0/1:11,12:23:99:303,0,290

A

The allele values are 0 for the reference allele (i.e., the allele that is shown in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.

The GT field (not shown above) has the value 1/2. This variant corresponds to the triallelic polymorphism rs11073131. The reference allele is A, but there are two variants that are known to occur in the population, C and T. In the NA12878 sample, the reference base does not occur, but rather both variants occur in heterozygous form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Question 16

FORMAT field

What does the allelic depth means ?

A

This field indicates the allelic depths (i.e., read counts) for the REF and ALT alleles in the order listed.

As mentioned already in the discussion of the DP field of INFO, there were 11 REF reads and 12 ALT reads, and thus the AD field is 11,12.

17
Q

Question 17

FORMAT field

What does DP field means ?

A

This field indicates the read depth at the indicated position for the current sample.

18
Q

Question 18

What does GQ field in VCF file represent ?

A

The GQ field represents a Phred-scaled quality
value that represents the probability of the genotype call being
wrong conditioned on the site being variant.

GQ = −10 log10 p(genotype call wrong|site is variant)

19
Q

Question 19

VCF file

What does PL field represent ?

A

Normalized, Phred-scaled likelihoods for genotypes

The PL field contains normalized, Phred-scaled likelihoods for each of the
three genotypes 0/0, 0/1, and 1/1, without priors

L(alignment data | true genotype is 0/1)

20
Q

Question 20

VCF file

The most likely genotype (given in the GT field) is scaled so that
its probability is P =… (… when Phred-scaled), and the other likelihoods
reflect their Phred-scaled likelihoods relative to this most likely
genotype.

A

The most likely genotype (given in the GT field) is scaled so that
its probability is P = 1.0 (0 when Phred-scaled), and the other likelihoods
reflect their Phred-scaled likelihoods relative to this most likely
genotype.

21
Q

Question 21

VCF file

For chr7:1936796C>T, the values of the PL field are 303,0,290

What is the most likelyt genotype in this case ?

A

The most likely genotype is 0/1, and the other two possible genotypes
are substantially less likely (10−30.3 and 10−29, respectively).

22
Q

Question 22

Phasing

What does | means ?

A

The pipe (|) symbol can be used for each sample to indicate that

  • *each of the alleles** of the genotype in question derive from the same
  • *haplotype**.

0|1 derive from the same haplotype

0/1 correspond to a different haplotype

23
Q

Question 23

Phasing

Describe the first line of this table.

A

The sample thus has the haplotypes TTT and AAC.

The first heterozygous genotype (at position 90) in the haplotype has / and not |, because otherwise, it would be a continuation of preceding haplotypes.

24
Q

Question 24

Phasing

Describe the second line of this table

A

The second genotype (at position 100) is 0|1 and indicates that the first allele goes with the first allele in the previous genotype 0/1 (thus, the A at position 100 is on the same haplotype as the A at position 90).

.

25
Q

Question 25

Phasing

Describe the third line of this table

A

The third genotype (at position 110) indicates that the first allele goes with the first allele in the previous genotype. Note here that the order of the “1” and the “0” is reversed in the third genotype, which indicates that the C goes with the A of the previous genotype

26
Q

Question 26

Describe this line:

chr5 112275379 . A G 322.77 PASS (…)

A

A variant at position 112,275,379 of chromosome 5, whereby the reference base is an A and the alternate base is a G.

27
Q

Question 27

VCF

Consider now the following deletion of five nucleotides on chromosome
5:

Ref: AGTATAGTTTAG
Alt: AGTA—–TAG

How does you represent this in the VCF file ?

A

chr5 113024752 . ATAGTT A 489.73 PASS (…)

The deleted nucleotides are located on chromosome at positions
113,024,753 to 113,024,757. Naively, one might expect the VCF file to
list TAGTT as REF and “-” as the ALT sequence, but instead, the VCF
format specifies that the base preceding the deletion is shown together
with the deleted sequence in REF, and only the preceding base is shown
as ALT. The position (POS) is corresponding that of the preceding base.
For the deletion mentioned above, the correct VCF format is as follows.

28
Q

Question 28

VCF file

Consider the insertion of a single nucleotide between the T at
position 113,040,194 and the A at position 113,040,195.

Ref: GCATGT-AGCC
Alt: GCATGTAAGCC

A

chr5 113040194 . T TA 53.70 PASS (…)

This is represented by the reference base immediately 5’ to the
insertion being specified in the REF field (and the position of this base
in the POS field), and the same base followed by the inserted base(s)
being shown in the ALT field.

29
Q

Question 29

VCF

What is the problem with dbSNP ?

A

10% of the indels polymorphisms in dbSNP were represented by multiple, redundant entries that used different notation to refer to the identical sequence change

30
Q

Question 30

VCF

A