vcf Flashcards
CHROM POS. ID REF ALT QUAL FILTER INFO
Question 1
What does each field correspond to ?
FORMAT NA12878
chr5 130076433 . G A 62.74 PASS later GT:AD:DP:GQ:PL. 1/1:0,2:2:6:90,6,0
Question 2
Elaborate a phrase from the fields CHROM, POS, REF, and ALT of the table
The variant is a transition from G to A at position 130,076,433 of chromosome 5.
Question 3
VCF
What does ID field means ?
The ID field can be used to specify the accession number of a variant in databases such as dbSNP, and might contain an entry such as rs25458. This field is not populated by default by GATK, and in our case, a period (.) is used to indicate that the field is empty.
Question 4
What does QUAL field means ?
The QUAL field is a Phred-scaled quality score for the assertion made in ALT. If the ALT field contains a variant call, then QUAL reflects the estimated probability that the variant call is wrong
QUAL = - 10 log10 p(no variant)
where p(no variant) is the probability that there is no variant at this site. If on the other hand, the ALT field is “.” (i.e., no variant call), then QUAL is the estimated probability that the call of no variant is wrong:
QUAL = - 10 log10 p(variant)
Question 5
VCF
How does you count the number of positions passing all filters with BCFTOOLS?
bcftools query -f ‘%FILTER\n’ NA12878_filtered.vcf | \ grep -c PASS
138083
Thus, ~94.6% of all 146,000 positions represented in the VCF file passed all filters.
On the other hand, 7917 positions did not pass the snpFilter. This item represents the result of the hard filter we applied in Section 12.4. Recall that the hard filter we applied would not let a variant pass if it failed to fulfill any one of the criteria used in the filter.
For instance, our VCF file reports a G>A transition at position 22,646,441 of chromosome 22. This variant is marked with snpFilterin the FILTER field, because the value of the RMS Mapping Quality (MQ) was less than 40:
MQ=39.81
Question 6
Describe each field of this line:
<p>Description="Allele count in genotypes, \</p>
<p>for each ALT allele, in the same order as listed"></p>
</id>
ID specifies the name of the item, which is often a two-letter abbreviation. Number specifies the number of values that the field should contain. In this case, the value “A” indicated that the field contains one value per alternate allele.
Question 7
The value for chr7:1936796C>T is AC=1
What represents AC=1 for this variant ?
This reflects the allele count in the genotype for this variant(0/1, heterozygous). Homozygous variants (1/1) have AC=2.
Question 8
INFO field
What represents allele frequency in the field INFO ?
Allele Frequency, for each ALT allele
In VCF files, allele frequency refers to the allele balance of the alleles observed in the samples being sequenced, not the frequency of the variants in the population.
Question 9
INFO field
What represents AN field ?
AN field shows the total number of alleles in called genotypes. AF is calculated as AC/AN
For the heterozygous variants (0/1), AC=1,AN=2, and AF=0.500. For the homozygous variants (1/1), AC=2, AN=2, and AF=1.00.
Question 10
INFO field
Consider a VCF file that describes three samples. If a data line indicates a variant that was called homozygous (1/1) in sample 1, heterozygous (0/1) in sample 2, and heterozygous (0/1) in sample 3, then AC=…, AN=… and, the Allele Frequency is calculated as … or AF=…
Consider a VCF file that describes three samples. If a data line indicates a variant that was called homozygous (1/1) in sample 1, heterozygous (0/1) in sample 2, and heterozygous (0/1) in sample 3, then AC=4, AN=6 and, the Allele Frequency is calculated as 4/6 or AF=0.667.
AF=AC/AN
Question 11
INFO
What represents DP field in INFO ?
Approximate read depth (some reads may have been filtered).
Question 12
INFO field
What does DP=23 means ?
DP=23, meaning that 23 reads covered this position.
Question 13
FORMAT field
the FORMAT field has the following items
GT:AD:DP:GQ:PL
0/1:11,12:23:99:303,0,290
What are the two most common genotypes encountered?
The first entry refers to the genotype. The two most commonly encountered genotypes are simple to understand: 0/1 refers to a heterozygous variant, and 1/1 refers to a homozygous variant.
Question 14
VCF FORMAT field
What represents the | or / in 0/1 ?
the FORMAT field has the following items
GT:AD:DP:GQ:PL
0/1:11,12:23:99:303,0,290
In general, the genotype is encoded as allele values separated by one of/ or |. The separator / is used if the variants are unphased, and the separator | is used for phased variants.
Question 15
VCF FORMAT field
What represent the number of 0/1 or 1/2 …
The FORMAT field has the following items:
GT:AD:DP:GQ:PL
0/1:11,12:23:99:303,0,290
The allele values are 0 for the reference allele (i.e., the allele that is shown in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
The GT field (not shown above) has the value 1/2. This variant corresponds to the triallelic polymorphism rs11073131. The reference allele is A, but there are two variants that are known to occur in the population, C and T. In the NA12878 sample, the reference base does not occur, but rather both variants occur in heterozygous form