Module 4.2 Sequencing Run QC Flashcards
tiles
regions on a flow cell
Summary Table categories
14
- Number of lanes
- Number of tiles per lane
- Density (K/mm2)
- Cluster PF (%)
- Phase/Prephase (%)
- Reads (M)
- Reads PF (M)
- % > Q30
- Yield (G)
- Cycles Err Rated
- Aligned (%)
- Error Rate (%)
- Error Rate 35 cycle (%)
- Error Rate 75 cycle (%)
Summary Table
- statistics showing means and standard deviations over all the tiles used in one lane
Summary Table
Density
(K/mm2)
- concentration of clusters detected by image analysis
- expressed as thousands of clusters per square millimeter
Summary Table
Cluster PF
(%)
PF = Passing Filter
- reads that pass Chastity filter
- Reads PF / Reads = ANSWER
Chastity
used in Chastity Filter
- ratio of the brightest base intensity divided by the sum of the brightest and second brightest base intensities
- Illumina internal quality filtering procedure
- Clusters pass filter if no more than one base call has ANSWER value below 0.6 in first 25 cycles.
Summary Table
Reads
(M)
- total number of clusters
- measured in millions
Summary Table
Reads PF
(M)
- number of clusters that pass Chastity filter
- measured in millions
Yield
(G)
- number of bases sequenced that pass filter.
- 1 cluster = 1 read sequence
- answer = (number of clusters passing filter) x (number of bases sequenced in each read)
Summary Table
% >= Q30
percentage of bases with Quality Score of 30 or higher
Phred Quality Score
- prediction of probability of an error in base calling
- Q = -10log10P
- P = base calling error probability
- generally range 0-40
- error prone due to extreme GC bias, specific patterns or homopolymers
- data with less than Q 20 not valuable
Phred
Analyzing process
- calculates several parameters related to peak shape and peak resolution at each base position
- use parameters to predict error probabilities generated from sequence trace, where correct sequence was known
- score (gray bar) matches to each colored peak (base)
-No score = N base assignment (not good enough peak for base call)
Illumina quality scores calculation
steps (3)
- intensity profiles and signal to noise ratios measure base calling reliability
- compute quality predictor values for a new base call and compare to values in pre-calibrated Quality (Q) table/model. Quality scores recorded for each cycle in base call (BCL) files
- quality scores converted to FASTQ format
Quality Score
FASTQ format
4 lines with separated fields per sequence read
- @Sequence_Identifier
- ATGC Raw Sequencing Data
- +(sequence id again or description
- Quality values for each base (ASCII)
FASTQ file
- text-based format for storing both the nucleotide sequence and its corresponding quality data
- quality scores encoded in compact ASCII character format so it uses only one byte per quality value.
- Quality score = ASCII character code -33
Error probability (P)
- estimated probability of a base call being wrong
- P = 10(-Q/10)
- Q 30 -> P = 10-3 = 1 in 1000 (0.1%) -> 99.9% base call accuracy
Q Score plot features
- x-axis: position in read (bp)
- y-axis: Q score
- box plot with statistical values
- median value: central red line
- inter quartile range (25, 75%): yellow box
- 10% and 90%: lower and upper viscus
- mean Q score: solid blue line
- Green, orange, and red zones
- median quality score lower for first 5-7 bases then rises, then drop toward end of read
Overclustering
- small distance between clusters = signal overlap
- two clusters can be interpreted as a single cluster with mixed fluorescence signals,
- generates low quality scores across entire read.
Reduced Q score
some causes
- fluorescent signal decay
- overclustering
- instrument breakdown
Summary Table
Phase/Prephase
(%)
- calculates rate at which molecules in cluster fall behind or jump ahead during read
- an estimate for first 25 cycles of read
- relies on comparison to Phi X control
Summary Table
Cycles Err Rated
number of cycles that have been used for calculating error rate starting from cycle one (paired end cycles)
Summary Table
Aligned
(%)
percentage of reads that can be aligned to control sample genome (Phi X)
- should match percent of Phi X spiked in during experiment
- if different, possible that quantification of library was off or the sample library couldn’t cluster or sequence well.
- sequencing run might have quality issue reflected by another metric
Summary Table
Error rate
(%)
- actual calculated error rate which is determined by aligning actual reads produced in run from the control Phi X samples against the known Phi X genome sequence
- 3 different error rates: 35 cycles, 75 cycles, and entire run
Phi X Control Libraries
- small single stranded DNA genome.
- 5000bp long
- 45% GC / 55% AT.
- well-defined genome sequence which enables quick alignment and estimation of error rate
- mean insert size ~335 bp, 500bp with adapters
- added to sample libraries after pooling and denaturation for use as positive control during clustering and sequencing process.
- allows the quick calculation for % phasing/prephasing, percent alignment, and estimation of error rate
- spiking with Phi X can increase overall nucleotide diversity of sample (1% for diverse library samples)