Postprocessing the alignment Flashcards
Question 1
Which command allows to verify the SAM/BAM format ?
java -jar picard.jar **ValidateSamFile**
INPUT=NIST7035_aln.bam\
MODE=SUMMARY
Question 2
What is the PCR duplicates can entails in the variant calling ?
PCR duplicates can lead to incorrect coverage assessments and erroneous variant calls
Question 3
What happens if you have a substantial amplification of a fragment having a variant following a PCR amplification for the variant calling ?
This is because variant calling algorithms assume that all reads are independent, and if one particular fragment underwent substantial amplification such that say 90% of all reads covering a certain true, heterozygous variant originate from this fragment, then the variant might be called as homozygous instead of heterozygous (because only about 50% of the remaining reads, or a total of 5% of all reads, would have the reference sequence at this position).
Question 4
What command allow to sort the aligned reads with picard ?
java -jar picard.jar SortSam \
INPUT=NIST7035_aln.bam \ OUTPUT=NIST7035_sorted.bam \
SORT_ORDER=coordinate
Question 5
Mark duplicates
What is the prerequisite to use the markduplicates command ?
The files need to be sorted.
Question 6
Mark duplicates
How works the tool MarkDuplicates ?
The MarkDuplicates tool works by comparing sequences with identical 5’ positions.
Reads with identical 5’ positions and identical sequences are marked as duplicates.
Question 7
What happens in this case for the variant calling ?
PCR amplification of an individual molecule with the erroneous base leads to overrepresentation of the corresponding variant in the alignment. Removal of the duplicates removes all but one instance of the error. Downstream variant calling algorithms might be led to call a (false positive) variant if duplicates were not removed.
Question 8
What does this command ?
samtools view -c -f 0x400 NIST7035_dedup.bam
This command uses SAMtools to view a BAM file as a SAM file and count without printing them, matching entries the bitflag bit 0x400, we see that there are a total of 2,134,576 duplicates.
Question 9
What does this command do?
samtools view -c -F 0x100 NIST7035_dedup.bam
The -F 0x100 flag filters out alignments that are not the primary alignment. If we do not use this filter, then SAMtools will output the total number of mapped alignments, including those marked as secondary.
Question 10
Marking duplicates
What does the TAGGING_POLICY do ?
java -jar picard.jar MarkDuplicates \ INPUT=NIST7035_sorted.bam \
OUTPUT=NIST7035_dedup.bam \ METRICS_FILE=NIST7035.metrics \
TAGGING_POLICY=All
If we run MarkDuplicates with the TAGGING_POLICY=ALL option, we can get a readout of the number of duplicates judged to be PCR duplication artifacts compared to the number of optical duplicates.
This will cause the records of duplicated reads to be annotated with values for the DT tag as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ).
Question 11
Describe this table in terms of optical duplicate(SQ with yes)
Four read pairs that map to chr1:245,155,319.
The second read was marked as an optical duplicate of the first because the start and end positions are identical, the reads come from the same tile, and the X and Y coordinates within the tile are near to one another.
The other two reads are from different tiles and also have distinct start and end coordinates.
The ‘SQ ?’ column shows whether MarkDuplicates marked the read as an optical duplicated (DT:Z:SQ in the BAM file).
Question 12
GATK
What are these following arguments:
-jar, -T, -R and -I
java -jar GenomeAnalysisTK.jar \
- T CountReads \
- R example_reference.fasta \
- I example_reads.bam
The -jar argument invokes the GATK engine itself
the -T argument tells it which tool you want to run.
Some arguments like -R for the genome reference
-I for the input file are passed to the GATK engine and can be used with all the tools.
Most tools also take additional arguments that are specific to their function. GATK options also have long forms; for instance -I is equivalent to –input_file.
Question 13
Realignment
The alignment of reads by BWA-MEM is done read-by-read, and may tend to accumulate erroneous single nucleotide variant (SNV) calls near true insertions or deletions (indels).
Why ? (for what reason ?)(pour quelle raison)
It’s due mainly due to misalignment because alignment algorithms penalize mismatches less than gaps.
Question 14
Realignment
What’s the point the IndelRealigner module of GATK ?
The IndelRealigner module of GATK performs a second pass over the BAM file and corrects some of the errors by performing a local realignment of reads around candidate indels.
Question 15
Local realignment
How happens the realignment of these insertion ?
In this example, there is an insertion of a cytosine residue directly 3’ to a stretch of six guanine residues. That is, the insertion changes the sequence AGGGGGGCT to AGGGGGGCCT.
The upper panel shows an IGV screenshot of the alignment prior to the realignment step, and the lower panel shows the same reads following the GATK local realignment step.
In the upper panel, we see that a missense change was identified in three reads, because the initial local aligner gave a better score to an alignment with one mismatch than to an alignment with one insertion. The full view would show 25 other reads with the insertion, in which the affected position was located towards the center of the read, so that not aligning the remaining reads with the insertion would lead to numerous mismatches in other parts of the alignment. The local realigner in essence used this information to correct the information of the three reads, so now all the reads show the insertion.