Module 4.2 Sequencing Run QC Flashcards
tiles
regions on a flow cell
Summary Table categories
14
- Number of lanes
- Number of tiles per lane
- Density (K/mm2)
- Cluster PF (%)
- Phase/Prephase (%)
- Reads (M)
- Reads PF (M)
- % > Q30
- Yield (G)
- Cycles Err Rated
- Aligned (%)
- Error Rate (%)
- Error Rate 35 cycle (%)
- Error Rate 75 cycle (%)
Summary Table
- statistics showing means and standard deviations over all the tiles used in one lane
Summary Table
Density
(K/mm2)
- concentration of clusters detected by image analysis
- expressed as thousands of clusters per square millimeter
Summary Table
Cluster PF
(%)
PF = Passing Filter
- reads that pass Chastity filter
- Reads PF / Reads = ANSWER
Chastity
used in Chastity Filter
- ratio of the brightest base intensity divided by the sum of the brightest and second brightest base intensities
- Illumina internal quality filtering procedure
- Clusters pass filter if no more than one base call has ANSWER value below 0.6 in first 25 cycles.
Summary Table
Reads
(M)
- total number of clusters
- measured in millions
Summary Table
Reads PF
(M)
- number of clusters that pass Chastity filter
- measured in millions
Yield
(G)
- number of bases sequenced that pass filter.
- 1 cluster = 1 read sequence
- answer = (number of clusters passing filter) x (number of bases sequenced in each read)
Summary Table
% >= Q30
percentage of bases with Quality Score of 30 or higher
Phred Quality Score
- prediction of probability of an error in base calling
- Q = -10log10P
- P = base calling error probability
- generally range 0-40
- error prone due to extreme GC bias, specific patterns or homopolymers
- data with less than Q 20 not valuable
Phred
Analyzing process
- calculates several parameters related to peak shape and peak resolution at each base position
- use parameters to predict error probabilities generated from sequence trace, where correct sequence was known
- score (gray bar) matches to each colored peak (base)
-No score = N base assignment (not good enough peak for base call)
Illumina quality scores calculation
steps (3)
- intensity profiles and signal to noise ratios measure base calling reliability
- compute quality predictor values for a new base call and compare to values in pre-calibrated Quality (Q) table/model. Quality scores recorded for each cycle in base call (BCL) files
- quality scores converted to FASTQ format
Quality Score
FASTQ format
4 lines with separated fields per sequence read
- @Sequence_Identifier
- ATGC Raw Sequencing Data
- +(sequence id again or description
- Quality values for each base (ASCII)
FASTQ file
- text-based format for storing both the nucleotide sequence and its corresponding quality data
- quality scores encoded in compact ASCII character format so it uses only one byte per quality value.
- Quality score = ASCII character code -33
Error probability (P)
- estimated probability of a base call being wrong
- P = 10(-Q/10)
- Q 30 -> P = 10-3 = 1 in 1000 (0.1%) -> 99.9% base call accuracy
Q Score plot features
- x-axis: position in read (bp)
- y-axis: Q score
- box plot with statistical values
- median value: central red line
- inter quartile range (25, 75%): yellow box
- 10% and 90%: lower and upper viscus
- mean Q score: solid blue line
- Green, orange, and red zones
- median quality score lower for first 5-7 bases then rises, then drop toward end of read
Overclustering
- small distance between clusters = signal overlap
- two clusters can be interpreted as a single cluster with mixed fluorescence signals,
- generates low quality scores across entire read.
Reduced Q score
some causes
-fluorescent signal decay
-overclustering
-instrument breakdown
Summary Table
Phase/Prephase
(%)
- calculates rate at which molecules in cluster fall behind or jump ahead during read
- an estimate for first 25 cycles of read
- relies on comparison to Phi X control
Summary Table
Cycles Err Rated
number of cycles that have been used for calculating error rate starting from cycle one (paired end cycles)
Summary Table
Aligned
(%)
percentage of reads that can be aligned to control sample genome (Phi X)
- should match percent of Phi X spiked in during experiment
- if different, possible that quantification of library was off or the sample library couldn’t cluster or sequence well.
- sequencing run might have quality issue reflected by another metric
Summary Table
Error rate
(%)
- actual calculated error rate which is determined by aligning actual reads produced in run from the control Phi X samples against the known Phi X genome sequence
- 3 different error rates: 35 cycles, 75 cycles, and entire run
Phi X Control Libraries
- small single stranded DNA genome.
- 5000bp long
- 45% GC / 55% AT.
- well-defined genome sequence which enables quick alignment and estimation of error rate
- mean insert size ~335 bp, 500bp with adapters
- added to sample libraries after pooling and denaturation for use as positive control during clustering and sequencing process.
- allows the quick calculation for % phasing/prephasing, percent alignment, and estimation of error rate
- spiking with Phi X can increase overall nucleotide diversity of sample (1% for diverse library samples)
Diverse/balanced libraries
- has equal proportions of ATGC nucleotides at each base position
- Illumina optimized around balanced representation of ATGC
- sequencing software use images from first 47 cycles of Read 1 to identify location of each cluster
- first 25 cycles used for % passing filter evaluation in base calling and quality score calculations
diverse libraries
per base sequence content data plot
-QC measure that shows representation of each base position by cycle
- in well-balanced library percent of base plot shows even and horizontal curves centered around 25% for each base
low diversity libraries
- single base per cycle
- eg. same primers used = all clusters have exact same sequences at the beginning of the reads,
- per base plot shows large intensity spikes at each cycle.
unbalanced libraries
- have one base at much lower percentage than the others
QC metric
Percent of Adapter
- matches adapter sequences that can be subtracted from base sequence data
- some library inserts are shorter than read lengths so they’re read through to adapter at 3’ end
QC metric
Duplication Rate
- measures number of reads derived from one single original molecule
- high level could indicate PCR enrichment bias due to nature of sample or library method
- can be truly overrepresented sequences such as a very abundant transcripts in RNA sequencing library or in amplicon data (expected and normal)
Artificial causes
- optical
- clustering
artificial duplication
Optical duplication
- single cluster that has been called as two clusters by software
- NOT on patterned flow cells
Clustering duplication
- library molecule occupies two adjacent wells, due to overflow into neighboring wells
- two clusters of same original molecule
- Unique to patterned flow cells