Summary Table categories 14

1. Number of lanes 2. Number of tiles per lane 3. Density (K/mm2) 4. Cluster PF (%) 5. Phase/Prephase (%) 6. Reads (M) 7. Reads PF (M) 8. % > Q30 9. Yield (G) 10. Cycles Err Rated 11. Aligned (%) 12. Error Rate (%) 13. Error Rate 35 cycle (%) 14. Error Rate 75 cycle (%)

- statistics showing means and standard deviations over all the tiles used in one lane

- number of bases sequenced that pass filter. - 1 cluster = 1 read sequence - answer = (number of clusters passing filter) x (number of bases sequenced in each read)

- prediction of probability of an error in base calling - Q = -10log10P - P = base calling error probability - generally range 0-40 - error prone due to extreme GC bias, specific patterns or homopolymers - data with less than Q 20 not valuable

- text-based format for storing both the nucleotide sequence and its corresponding quality data - quality scores encoded in compact ASCII character format so it uses only one byte per quality value. - Quality score = ASCII character code -33

Module 4.2 Sequencing Run QC Flashcards by Maggie Pitt

tiles

regions on a flow cell

How well did you know this?

Not at all

Perfectly

Summary Table categories

Number of lanes
Number of tiles per lane
Density (K/mm2)
Cluster PF (%)
Phase/Prephase (%)
Reads (M)
Reads PF (M)
% > Q30
Yield (G)
Cycles Err Rated
Aligned (%)
Error Rate (%)
Error Rate 35 cycle (%)
Error Rate 75 cycle (%)

How well did you know this?

Not at all

Perfectly

Summary Table

statistics showing means and standard deviations over all the tiles used in one lane

How well did you know this?

Not at all

Perfectly

Summary Table

Density
(K/mm2)

concentration of clusters detected by image analysis
expressed as thousands of clusters per square millimeter

How well did you know this?

Not at all

Perfectly

Summary Table

Cluster PF
(%)

PF = Passing Filter

reads that pass Chastity filter
Reads PF / Reads = ANSWER

How well did you know this?

Not at all

Perfectly

Chastity

used in Chastity Filter

ratio of the brightest base intensity divided by the sum of the brightest and second brightest base intensities
Illumina internal quality filtering procedure
Clusters pass filter if no more than one base call has ANSWER value below 0.6 in first 25 cycles.

How well did you know this?

Not at all

Perfectly

Summary Table

Reads
(M)

total number of clusters
measured in millions

How well did you know this?

Not at all

Perfectly

Summary Table

Reads PF
(M)

number of clusters that pass Chastity filter
measured in millions

How well did you know this?

Not at all

Perfectly

Yield
(G)

number of bases sequenced that pass filter.
1 cluster = 1 read sequence
answer = (number of clusters passing filter) x (number of bases sequenced in each read)

How well did you know this?

Not at all

Perfectly

Summary Table

% >= Q30

percentage of bases with Quality Score of 30 or higher

How well did you know this?

Not at all

Perfectly

Phred Quality Score

prediction of probability of an error in base calling
Q = -10log₁₀P
P = base calling error probability
generally range 0-40
error prone due to extreme GC bias, specific patterns or homopolymers
data with less than Q 20 not valuable

How well did you know this?

Not at all

Perfectly

Phred

Analyzing process

calculates several parameters related to peak shape and peak resolution at each base position
use parameters to predict error probabilities generated from sequence trace, where correct sequence was known
score (gray bar) matches to each colored peak (base)

-No score = N base assignment (not good enough peak for base call)

How well did you know this?

Not at all

Perfectly

Illumina quality scores calculation

steps (3)

intensity profiles and signal to noise ratios measure base calling reliability
compute quality predictor values for a new base call and compare to values in pre-calibrated Quality (Q) table/model. Quality scores recorded for each cycle in base call (BCL) files
quality scores converted to FASTQ format

How well did you know this?

Not at all

Perfectly

Quality Score

FASTQ format

4 lines with separated fields per sequence read

@Sequence_Identifier
ATGC Raw Sequencing Data
+(sequence id again or description
Quality values for each base (ASCII)

How well did you know this?

Not at all

Perfectly

FASTQ file

text-based format for storing both the nucleotide sequence and its corresponding quality data
quality scores encoded in compact ASCII character format so it uses only one byte per quality value.
Quality score = ASCII character code -33

How well did you know this?

Not at all

Perfectly

Error probability (P)

Study These Flashcards

estimated probability of a base call being wrong
P = 10^(-Q/10)
Q 30 -> P = 10^-3 = 1 in 1000 (0.1%) -> 99.9% base call accuracy

Q Score plot features

Study These Flashcards

x-axis: position in read (bp)
y-axis: Q score
box plot with statistical values
median value: central red line
inter quartile range (25, 75%): yellow box
10% and 90%: lower and upper viscus
mean Q score: solid blue line
Green, orange, and red zones
median quality score lower for first 5-7 bases then rises, then drop toward end of read

Overclustering

Study These Flashcards

small distance between clusters = signal overlap
two clusters can be interpreted as a single cluster with mixed fluorescence signals,
generates low quality scores across entire read.

Reduced Q score

some causes

Study These Flashcards

fluorescent signal decay
overclustering
instrument breakdown

Summary Table

Phase/Prephase
(%)

Study These Flashcards

calculates rate at which molecules in cluster fall behind or jump ahead during read
an estimate for first 25 cycles of read
relies on comparison to Phi X control

Summary Table

Cycles Err Rated

Study These Flashcards

number of cycles that have been used for calculating error rate starting from cycle one (paired end cycles)

Summary Table

Aligned
(%)

Study These Flashcards

percentage of reads that can be aligned to control sample genome (Phi X)
- should match percent of Phi X spiked in during experiment
- if different, possible that quantification of library was off or the sample library couldn’t cluster or sequence well.
- sequencing run might have quality issue reflected by another metric

Summary Table

Error rate
(%)

Study These Flashcards

actual calculated error rate which is determined by aligning actual reads produced in run from the control Phi X samples against the known Phi X genome sequence
3 different error rates: 35 cycles, 75 cycles, and entire run

Phi X Control Libraries

Study These Flashcards

small single stranded DNA genome.
5000bp long
45% GC / 55% AT.
well-defined genome sequence which enables quick alignment and estimation of error rate
mean insert size ~335 bp, 500bp with adapters
added to sample libraries after pooling and denaturation for use as positive control during clustering and sequencing process.
allows the quick calculation for % phasing/prephasing, percent alignment, and estimation of error rate
spiking with Phi X can increase overall nucleotide diversity of sample (1% for diverse library samples)

Diverse/balanced libraries

- has equal proportions of ATGC nucleotides at each base position - Illumina optimized around balanced representation of ATGC - sequencing software use images from first 47 cycles of Read 1 to identify location of each cluster - first 25 cycles used for % passing filter evaluation in base calling and quality score calculations

# diverse libraries per base sequence content data plot

-QC measure that shows representation of each base position by cycle - in well-balanced library percent of base plot shows even and horizontal curves centered around 25% for each base

low diversity libraries

- single base per cycle - eg. same primers used = all clusters have exact same sequences at the beginning of the reads, - per base plot shows large intensity spikes at each cycle.

unbalanced libraries

- have one base at much lower percentage than the others

# QC metric Percent of Adapter

- matches adapter sequences that can be subtracted from base sequence data - some library inserts are shorter than read lengths so they're read through to adapter at 3’ end

# QC metric Duplication Rate

- measures number of reads derived from one single original molecule - high level could indicate PCR enrichment bias due to nature of sample or library method - can be truly overrepresented sequences such as a very abundant transcripts in RNA sequencing library or in amplicon data (expected and normal) Artificial causes - optical - clustering

# artificial duplication Optical duplication

- single cluster that has been called as two clusters by software - NOT on patterned flow cells

Clustering duplication

- library molecule occupies two adjacent wells, due to overflow into neighboring wells - two clusters of same original molecule - Unique to patterned flow cells

Module 4.2 Sequencing Run QC Flashcards

(32 cards)