Postprocessing the alignment Flashcards

Question 1

Q

Question 1

Which command allows to verify the SAM/BAM format ?

Answer

A

java -jar picard.jar **ValidateSamFile**
INPUT=NIST7035_aln.bam\
MODE=SUMMARY

Question 2

Q

Question 2

What is the PCR duplicates can entails in the variant calling ?

Answer

A

PCR duplicates can lead to incorrect coverage assessments and erroneous variant calls

Question 3

Q

Question 3

What happens if you have a substantial amplification of a fragment having a variant following a PCR amplification for the variant calling ?

Answer

A

This is because variant calling algorithms assume that all reads are independent, and if one particular fragment underwent substantial amplification such that say 90% of all reads covering a certain true, heterozygous variant originate from this fragment, then the variant might be called as homozygous instead of heterozygous (because only about 50% of the remaining reads, or a total of 5% of all reads, would have the reference sequence at this position).

Question 4

Q

Question 4

What command allow to sort the aligned reads with picard ?

Answer

A

java -jar picard.jar SortSam \
INPUT=NIST7035_aln.bam \ OUTPUT=NIST7035_sorted.bam \
SORT_ORDER=coordinate

Question 5

Q

Question 5

Mark duplicates

What is the prerequisite to use the markduplicates command ?

Answer

A

The files need to be sorted.

Question 6

Q

Question 6

Mark duplicates

How works the tool MarkDuplicates ?

Answer

A

The MarkDuplicates tool works by comparing sequences with identical 5’ positions.

Reads with identical 5’ positions and identical sequences are marked as duplicates.

Question 7

Q

Question 7

What happens in this case for the variant calling ?

Answer

A

PCR amplification of an individual molecule with the erroneous base leads to overrepresentation of the corresponding variant in the alignment. Removal of the duplicates removes all but one instance of the error. Downstream variant calling algorithms might be led to call a (false positive) variant if duplicates were not removed.

Question 8

Q

Question 8

What does this command ?

samtools view -c -f 0x400 NIST7035_dedup.bam

Answer

A

This command uses SAMtools to view a BAM file as a SAM file and count without printing them, matching entries the bitflag bit 0x400, we see that there are a total of 2,134,576 duplicates.

Question 9

Q

Question 9

What does this command do?

samtools view -c -F 0x100 NIST7035_dedup.bam

Answer

A

The -F 0x100 flag filters out alignments that are not the primary alignment. If we do not use this filter, then SAMtools will output the total number of mapped alignments, including those marked as secondary.

Question 10

Q

Question 10

Marking duplicates

What does the TAGGING_POLICY do ?

java -jar picard.jar MarkDuplicates \ INPUT=NIST7035_sorted.bam \
OUTPUT=NIST7035_dedup.bam \ METRICS_FILE=NIST7035.metrics \
TAGGING_POLICY=All

Answer

A

If we run MarkDuplicates with the TAGGING_POLICY=ALL option, we can get a readout of the number of duplicates judged to be PCR duplication artifacts compared to the number of optical duplicates.

This will cause the records of duplicated reads to be annotated with values for the DT tag as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ).

Question 11

Q

Question 11

Describe this table in terms of optical duplicate(SQ with yes)

Answer

A

Four read pairs that map to chr1:245,155,319.

The second read was marked as an optical duplicate of the first because the start and end positions are identical, the reads come from the same tile, and the X and Y coordinates within the tile are near to one another.

The other two reads are from different tiles and also have distinct start and end coordinates.

The ‘SQ ?’ column shows whether MarkDuplicates marked the read as an optical duplicated (DT:Z:SQ in the BAM file).

Question 12

Q

Question 12

GATK

What are these following arguments:

-jar, -T, -R and -I

java -jar GenomeAnalysisTK.jar \

T CountReads \
R example_reference.fasta \
I example_reads.bam

Answer

A

The -jar argument invokes the GATK engine itself

the -T argument tells it which tool you want to run.

Some arguments like -R for the genome reference

-I for the input file are passed to the GATK engine and can be used with all the tools.

Most tools also take additional arguments that are specific to their function. GATK options also have long forms; for instance -I is equivalent to –input_file.

Question 13

Q

Question 13

Realignment

The alignment of reads by BWA-MEM is done read-by-read, and may tend to accumulate erroneous single nucleotide variant (SNV) calls near true insertions or deletions (indels).

Why ? (for what reason ?)(pour quelle raison)

Answer

A

It’s due mainly due to misalignment because alignment algorithms penalize mismatches less than gaps.

Question 14

Q

Question 14

Realignment

What’s the point the IndelRealigner module of GATK ?

Answer

A

The IndelRealigner module of GATK performs a second pass over the BAM file and corrects some of the errors by performing a local realignment of reads around candidate indels.

Question 15

Q

Question 15

Local realignment

How happens the realignment of these insertion ?

In this example, there is an insertion of a cytosine residue directly 3’ to a stretch of six guanine residues. That is, the insertion changes the sequence AGGGGGGCT to AGGGGGGCCT.

Answer

A

The upper panel shows an IGV screenshot of the alignment prior to the realignment step, and the lower panel shows the same reads following the GATK local realignment step.

In the upper panel, we see that a missense change was identified in three reads, because the initial local aligner gave a better score to an alignment with one mismatch than to an alignment with one insertion. The full view would show 25 other reads with the insertion, in which the affected position was located towards the center of the read, so that not aligning the remaining reads with the insertion would lead to numerous mismatches in other parts of the alignment. The local realigner in essence used this information to correct the information of the three reads, so now all the reads show the insertion.

Question 16

Q

Question 16

Indel realigner

Each read is aligned separately by read mappers; If an insertion or deletion is located towards the very beginning or end of a read, the aligner may favor alignments with mis- matches or soft-clips instead of opening(d’ouvrir un espace) a gap in either the read or the reference sequence.

Answer

Study These Flashcards

A

The local realignment process considers all of the reads that span a given position. By combining the evidence from all of the reads, the realigner may find a high-scoring consensus that supports the presence of an indel event.

Question 17

Q

Question 17

Base quality score recalibration

What’s the point of using the base quality score recalibration ?

Answer

Study These Flashcards

A

The reported quality scores may be inaccurate as the result (en raion de) of systematic biases. Base Quality Score Recalibration (BQSR) uses empirical data to identify and characterize systematic errors made by the sequencer during the estimation of quality scores.

Question 18

Q

Question 18

Base quality score recalibration

What’s the four covariates for the BQSR procedure ?

Answer

Study These Flashcards

A

Lane
Originally reported quality
Machine cycle (position in the read)
Sequence context (preceding and subsequent base)

Question 19

Q

Question 19

What assumption makes the BQSR procedure ?

Answer

Study These Flashcards

A

Any sequence mismatch(toute incompatibilité de séquence) compared to the reference genome is an error — except if it corresponds to a known variant(from dbSNP for example).

Question 20

Q

Question 20

Recalibration des scores de qualité.

Définir ce qu’est une covariable ?

Answer

Study These Flashcards

A

La covariable(contexte nucléotidique,…) est nommée ainsi, car on la suspecte de covarier avec la variable dépendante(ici la base appelée) et d’affecter indirectement la relation qui existe entre la variable indépendante(score de qualité) et la variable dépendante(la base appelée).

Question 21

Q

Question 21

BQSR

What does the BQSR procedure ?

Answer

Study These Flashcards

A

It estimates the influence of each of the covariates(lane, machine cycle…) by counting the number of errors and observations.

An important observation is that the great majority of variants in any sequenced human genome will have been previously observed.

Question 22

Q

Question 22

BQSR

If we call variants in a genome and remove the known variants from further consideration, the remaining variants will be highly enriched in … ….

Answer

Study These Flashcards

A

If we call variants in a genome and remove the known variants from further consideration, the remaining variants will be highly enriched in sequencing errors.

Question 23

Q

Answer

Study These Flashcards

A

Postprocessing the alignment Flashcards

(23 cards)