trimming Flashcards
Question 1
Define the trimming process
Trimming refers to the process of removing low-quality or artefactual sequences or portions of sequences prior to downstream analysis — essentially by surgically eliminating only low quality regions
Question 2
Trimmomatic
What is the advantage of trimming ?
Trimming has been shown to improve the overall data quality and to enable better results in downstream analysis
Il a été démontré que le découpage améliore la qualité globale des données et permet d’obtenir de meilleurs résultats dans l’analyse en aval.
Question 3
Trimmomatic
What is the drawback of excessive trimming ?
Excessive trimming may reduce the quality of downstream results
Question 4
What is quality trimming ?
A typical approach towards quality filtering is to assess the quality of bases and determine where to truncate the read, retaining the 5’ portion, and discarding the lower quality 3’ portion.
Question 5
Trimmomatic
How trimmomatic performs the quality trimming ?
Trimmomatic applies a sliding window approach that examines the AVERAGE quality of a set of contiguous bases by sliding a window over the read starting at the 5’ end and trimming if the (average) quality falls below a threshold.
Question 6
Trimmomatic
What is the simple mode of trimmomatic in the adapter trimming ?
In the simple mode (which is most useful for single-end reads), each read is scanned from the 5’ end to the 3’ end to determine if any of the user-provided adapter sequences are present. If the adapter overlaps with the 5’ end of the read, then the entire read is discarded. Otherwise, the 3’ terminus of the read is discarded starting from the first overlapping nucleotide.
Question 7
When the trimming must be performed ?
Quality trimming should be applied especially if the overall quality is poor towards the 3’ end of reads
Question 8
What is the palindrome mode of trimmomatic ?
Trimmomatic has a “palindrome mode” that is optimized for the detection of “adapter read-through”. When “read-through” occurs, both reads in a pair will comprise the same sequence (in reverse com- plementary orientation) followed by contaminating sequence from the “opposite” adapter
Question 9
Trimmomatic
What is the seed mismatches ?
ILLUMINACLIP:<fastawithadaptersetc>:<seed>:\ <palindrome>:<simple></simple></palindrome></seed></fastawithadaptersetc>
The parameter seed mismatches controls the maximum number of mismatches allowed between the adapter sequence and a subsequence of the read to still be considered a match.
Question 10
Trimmomatic
When the trimmimg occurs in the single-end case ?
ILLUMINACLIP:::\ :
The match calculated by the full alignment(between adapter and read subsequence) must exceed the simple clip threshold in order for trimming to be performed.
Question 11
How the alignment score between adapter and subsequence of the read is calculated ?
ILLUMINACLIP:::\ :
The full alignment score is calculated by increasing the alignment score by 0.6 for each matching base and by reducing the alignment score by Q/10 for each mismatched base (where Q is the Phred encoded quality score of the mismatched base). A perfect match of a sequence with a length of n bases is thus nx0.6, which is about 7 for a 12 base perfect match and about 15 for a 25-base perfect match.
Therefore, values of between 7–15 are recommended for this parameter.
Question 12
What does each parameter mean ?
java -jar trimmomatic-0.36.jar SE \
-phred64 \
-threads 2 \
-trimlog son.log \
Sons_exome_fastq_file_1.fq \
trimmed_output.fq \
ILLUMINACLIP:./adapters/TruSeq2-SE.fa:2:30:10
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36 \
TOPHRED33
SE
-phred64
Quality scores in the FASTQ file were encoded with Phred+64.
- threads 2The number of threads to be used by Trimmomatic.
- trimlog
Write a log to the indicated file.
Sons exome fastq file 1.fq and trimmed output.fq
The input and output files are indicated at this point in the command line. These files are required to be in FASTQ format and may be compressed (gz).
ILLUMINACLIP:./adapters/TruSeq2-SE.fa:2:30:10
The location of the file with the Illumina adapters is given, fol- lowed by seed mismatches, palindrome clip threshold, simple clip threshold, i.e., we allow up to 2 mismatches to the adapter se- quence, and require a score of at least 10 for the alignment be- tween any adapter sequence against a read. The value of 30 is for the palindrome clip threshold, but that is not used in SE mode.
LEADING:3 and TRAILING:3
Specifies the minimum quality required to keep a leading (5’) or trailing (3’) base (here, a minimum Phred score of 3 is indicated).
SLIDINGWINDOW:4:14
Window size of 4, minimum mean quality in window 14.
MINLEN:36
Discard all sequences that are smaller than 36 base pairs after the other trimming operations.
TOPHRED33
Convert quality scores to Phred+33 in the output file.
Question 13
Trimmomatic
Consider the average scores of several windows of quality characters (this can be done using the Unix command echo -n “5:4(” | od -A n -t d1 to get the scores for 5:4(, i.e., 53, 58, 52, 40.
Subtracting 33 from each number and taking the average gives us and average score of 17.75
The mean average score of (80( is 13
In which case have you the trimming (13 or 17,75)?
SLIDINGWINDOW:4:15
Consider the average scores of several windows of quality characters (this can be done using the Unix command echo -n “5:4(“ | od -A n -t d1 to get the scores for 5:4(, i.e., 53, 58, 52, 40. Subtracting 33 from each number and taking the average gives us and average score of 17.75, which is above threshold. However, the average score of (80( is only 13, which is below threshold and triggers the trimming of the read as shown in the illustration.
Question 14
In this case where the trimming starts ?
Trimming starts at the 3’ base of the first below-threshold window, which in our example corresponds to the “T” of the ACCT with quality string (80(. Note that the middle portion of the sequence and the corresponding quality string have been omitted for better legibility.