vcf Flashcards
CHROM POS. ID REF ALT QUAL FILTER INFO
Question 1
What does each field correspond to ?
FORMAT NA12878
chr5 130076433 . G A 62.74 PASS later GT:AD:DP:GQ:PL. 1/1:0,2:2:6:90,6,0

Question 2
Elaborate a phrase from the fields CHROM, POS, REF, and ALT of the table

The variant is a transition from G to A at position 130,076,433 of chromosome 5.
Question 3
VCF
What does ID field means ?

The ID field can be used to specify the accession number of a variant in databases such as dbSNP, and might contain an entry such as rs25458. This field is not populated by default by GATK, and in our case, a period (.) is used to indicate that the field is empty.
Question 4
What does QUAL field means ?

The QUAL field is a Phred-scaled quality score for the assertion made in ALT. If the ALT field contains a variant call, then QUAL reflects the estimated probability that the variant call is wrong
QUAL = - 10 log10 p(no variant)
where p(no variant) is the probability that there is no variant at this site. If on the other hand, the ALT field is “.” (i.e., no variant call), then QUAL is the estimated probability that the call of no variant is wrong:
QUAL = - 10 log10 p(variant)

Question 5
VCF
How does you count the number of positions passing all filters with BCFTOOLS?
bcftools query -f ‘%FILTER\n’ NA12878_filtered.vcf | \ grep -c PASS
138083
Thus, ~94.6% of all 146,000 positions represented in the VCF file passed all filters.
On the other hand, 7917 positions did not pass the snpFilter. This item represents the result of the hard filter we applied in Section 12.4. Recall that the hard filter we applied would not let a variant pass if it failed to fulfill any one of the criteria used in the filter.
For instance, our VCF file reports a G>A transition at position 22,646,441 of chromosome 22. This variant is marked with snpFilterin the FILTER field, because the value of the RMS Mapping Quality (MQ) was less than 40:
MQ=39.81
Question 6
Describe each field of this line:
<p>Description="Allele count in genotypes, \</p>
<p>for each ALT allele, in the same order as listed"></p>
</id>
ID specifies the name of the item, which is often a two-letter abbreviation. Number specifies the number of values that the field should contain. In this case, the value “A” indicated that the field contains one value per alternate allele.
Question 7
The value for chr7:1936796C>T is AC=1
What represents AC=1 for this variant ?
This reflects the allele count in the genotype for this variant(0/1, heterozygous). Homozygous variants (1/1) have AC=2.
Question 8
INFO field
What represents allele frequency in the field INFO ?
Allele Frequency, for each ALT allele
In VCF files, allele frequency refers to the allele balance of the alleles observed in the samples being sequenced, not the frequency of the variants in the population.
Question 9
INFO field
What represents AN field ?
AN field shows the total number of alleles in called genotypes. AF is calculated as AC/AN
For the heterozygous variants (0/1), AC=1,AN=2, and AF=0.500. For the homozygous variants (1/1), AC=2, AN=2, and AF=1.00.
Question 10
INFO field
Consider a VCF file that describes three samples. If a data line indicates a variant that was called homozygous (1/1) in sample 1, heterozygous (0/1) in sample 2, and heterozygous (0/1) in sample 3, then AC=…, AN=… and, the Allele Frequency is calculated as … or AF=…
Consider a VCF file that describes three samples. If a data line indicates a variant that was called homozygous (1/1) in sample 1, heterozygous (0/1) in sample 2, and heterozygous (0/1) in sample 3, then AC=4, AN=6 and, the Allele Frequency is calculated as 4/6 or AF=0.667.
AF=AC/AN
Question 11
INFO
What represents DP field in INFO ?
Approximate read depth (some reads may have been filtered).
Question 12
INFO field
What does DP=23 means ?
DP=23, meaning that 23 reads covered this position.
Question 13
FORMAT field
the FORMAT field has the following items
GT:AD:DP:GQ:PL
0/1:11,12:23:99:303,0,290
What are the two most common genotypes encountered?
The first entry refers to the genotype. The two most commonly encountered genotypes are simple to understand: 0/1 refers to a heterozygous variant, and 1/1 refers to a homozygous variant.
Question 14
VCF FORMAT field
What represents the | or / in 0/1 ?
the FORMAT field has the following items
GT:AD:DP:GQ:PL
0/1:11,12:23:99:303,0,290
In general, the genotype is encoded as allele values separated by one of/ or |. The separator / is used if the variants are unphased, and the separator | is used for phased variants.
Question 15
VCF FORMAT field
What represent the number of 0/1 or 1/2 …
The FORMAT field has the following items:
GT:AD:DP:GQ:PL
0/1:11,12:23:99:303,0,290
The allele values are 0 for the reference allele (i.e., the allele that is shown in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
The GT field (not shown above) has the value 1/2. This variant corresponds to the triallelic polymorphism rs11073131. The reference allele is A, but there are two variants that are known to occur in the population, C and T. In the NA12878 sample, the reference base does not occur, but rather both variants occur in heterozygous form
Question 16
FORMAT field
What does the allelic depth means ?
This field indicates the allelic depths (i.e., read counts) for the REF and ALT alleles in the order listed.
As mentioned already in the discussion of the DP field of INFO, there were 11 REF reads and 12 ALT reads, and thus the AD field is 11,12.
Question 17
FORMAT field
What does DP field means ?
This field indicates the read depth at the indicated position for the current sample.
Question 18
What does GQ field in VCF file represent ?
The GQ field represents a Phred-scaled quality
value that represents the probability of the genotype call being
wrong conditioned on the site being variant.
GQ = −10 log10 p(genotype call wrong|site is variant)
Question 19
VCF file
What does PL field represent ?
Normalized, Phred-scaled likelihoods for genotypes
The PL field contains normalized, Phred-scaled likelihoods for each of the
three genotypes 0/0, 0/1, and 1/1, without priors
L(alignment data | true genotype is 0/1)
Question 20
VCF file
The most likely genotype (given in the GT field) is scaled so that
its probability is P =… (… when Phred-scaled), and the other likelihoods
reflect their Phred-scaled likelihoods relative to this most likely
genotype.
The most likely genotype (given in the GT field) is scaled so that
its probability is P = 1.0 (0 when Phred-scaled), and the other likelihoods
reflect their Phred-scaled likelihoods relative to this most likely
genotype.
Question 21
VCF file
For chr7:1936796C>T, the values of the PL field are 303,0,290
What is the most likelyt genotype in this case ?
The most likely genotype is 0/1, and the other two possible genotypes
are substantially less likely (10−30.3 and 10−29, respectively).
Question 22
Phasing
What does | means ?
The pipe (|) symbol can be used for each sample to indicate that
- *each of the alleles** of the genotype in question derive from the same
- *haplotype**.
0|1 derive from the same haplotype
0/1 correspond to a different haplotype
Question 23
Phasing
Describe the first line of this table.
The sample thus has the haplotypes TTT and AAC.
The first heterozygous genotype (at position 90) in the haplotype has / and not |, because otherwise, it would be a continuation of preceding haplotypes.
Question 24
Phasing
Describe the second line of this table
The second genotype (at position 100) is 0|1 and indicates that the first allele goes with the first allele in the previous genotype 0/1 (thus, the A at position 100 is on the same haplotype as the A at position 90).
.
Question 25
Phasing
Describe the third line of this table
The third genotype (at position 110) indicates that the first allele goes with the first allele in the previous genotype. Note here that the order of the “1” and the “0” is reversed in the third genotype, which indicates that the C goes with the A of the previous genotype
Question 26
Describe this line:
chr5 112275379 . A G 322.77 PASS (…)
A variant at position 112,275,379 of chromosome 5, whereby the reference base is an A and the alternate base is a G.
Question 27
VCF
Consider now the following deletion of five nucleotides on chromosome
5:
Ref: AGTATAGTTTAG
Alt: AGTA—–TAG
How does you represent this in the VCF file ?
chr5 113024752 . ATAGTT A 489.73 PASS (…)
The deleted nucleotides are located on chromosome at positions
113,024,753 to 113,024,757. Naively, one might expect the VCF file to
list TAGTT as REF and “-” as the ALT sequence, but instead, the VCF
format specifies that the base preceding the deletion is shown together
with the deleted sequence in REF, and only the preceding base is shown
as ALT. The position (POS) is corresponding that of the preceding base.
For the deletion mentioned above, the correct VCF format is as follows.
Question 28
VCF file
Consider the insertion of a single nucleotide between the T at
position 113,040,194 and the A at position 113,040,195.
Ref: GCATGT-AGCC
Alt: GCATGTAAGCC
chr5 113040194 . T TA 53.70 PASS (…)
This is represented by the reference base immediately 5’ to the
insertion being specified in the REF field (and the position of this base
in the POS field), and the same base followed by the inserted base(s)
being shown in the ALT field.
Question 29
VCF
What is the problem with dbSNP ?
10% of the indels polymorphisms in dbSNP were represented by multiple, redundant entries that used different notation to refer to the identical sequence change
Question 30
VCF