Module 4.2 Sequencing Run QC Flashcards

1
Q

tiles

A

regions on a flow cell

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Summary Table categories

14

A
  1. Number of lanes
  2. Number of tiles per lane
  3. Density (K/mm2)
  4. Cluster PF (%)
  5. Phase/Prephase (%)
  6. Reads (M)
  7. Reads PF (M)
  8. % > Q30
  9. Yield (G)
  10. Cycles Err Rated
  11. Aligned (%)
  12. Error Rate (%)
  13. Error Rate 35 cycle (%)
  14. Error Rate 75 cycle (%)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Summary Table

A
  • statistics showing means and standard deviations over all the tiles used in one lane
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Summary Table

Density
(K/mm2)

A
  • concentration of clusters detected by image analysis
  • expressed as thousands of clusters per square millimeter
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Summary Table

Cluster PF
(%)

PF = Passing Filter

A
  • reads that pass Chastity filter
  • Reads PF / Reads = ANSWER
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Chastity

used in Chastity Filter

A
  • ratio of the brightest base intensity divided by the sum of the brightest and second brightest base intensities
  • Illumina internal quality filtering procedure
  • Clusters pass filter if no more than one base call has ANSWER value below 0.6 in first 25 cycles.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Summary Table

Reads
(M)

A
  • total number of clusters
  • measured in millions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Summary Table

Reads PF
(M)

A
  • number of clusters that pass Chastity filter
  • measured in millions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Yield
(G)

A
  • number of bases sequenced that pass filter.
  • 1 cluster = 1 read sequence
  • answer = (number of clusters passing filter) x (number of bases sequenced in each read)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Summary Table

% >= Q30

A

percentage of bases with Quality Score of 30 or higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Phred Quality Score

A
  • prediction of probability of an error in base calling
  • Q = -10log10P
  • P = base calling error probability
  • generally range 0-40
  • error prone due to extreme GC bias, specific patterns or homopolymers
  • data with less than Q 20 not valuable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Phred

Analyzing process

A
  1. calculates several parameters related to peak shape and peak resolution at each base position
  2. use parameters to predict error probabilities generated from sequence trace, where correct sequence was known
  3. score (gray bar) matches to each colored peak (base)

-No score = N base assignment (not good enough peak for base call)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Illumina quality scores calculation

steps (3)

A
  1. intensity profiles and signal to noise ratios measure base calling reliability
  2. compute quality predictor values for a new base call and compare to values in pre-calibrated Quality (Q) table/model. Quality scores recorded for each cycle in base call (BCL) files
  3. quality scores converted to FASTQ format
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Quality Score

FASTQ format

A

4 lines with separated fields per sequence read

  1. @Sequence_Identifier
  2. ATGC Raw Sequencing Data
  3. +(sequence id again or description
  4. Quality values for each base (ASCII)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

FASTQ file

A
  • text-based format for storing both the nucleotide sequence and its corresponding quality data
  • quality scores encoded in compact ASCII character format so it uses only one byte per quality value.
  • Quality score = ASCII character code -33
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Error probability (P)

A
  • estimated probability of a base call being wrong
  • P = 10(-Q/10)
  • Q 30 -> P = 10-3 = 1 in 1000 (0.1%) -> 99.9% base call accuracy
17
Q

Q Score plot features

A
  • x-axis: position in read (bp)
  • y-axis: Q score
  • box plot with statistical values
  • median value: central red line
  • inter quartile range (25, 75%): yellow box
  • 10% and 90%: lower and upper viscus
  • mean Q score: solid blue line
  • Green, orange, and red zones
  • median quality score lower for first 5-7 bases then rises, then drop toward end of read
18
Q

Overclustering

A
  • small distance between clusters = signal overlap
  • two clusters can be interpreted as a single cluster with mixed fluorescence signals,
  • generates low quality scores across entire read.
19
Q

Reduced Q score

some causes

A

-fluorescent signal decay
-overclustering
-instrument breakdown

20
Q

Summary Table

Phase/Prephase
(%)

A
  • calculates rate at which molecules in cluster fall behind or jump ahead during read
  • an estimate for first 25 cycles of read
  • relies on comparison to Phi X control
21
Q

Summary Table

Cycles Err Rated

A

number of cycles that have been used for calculating error rate starting from cycle one (paired end cycles)

22
Q

Summary Table

Aligned
(%)

A

percentage of reads that can be aligned to control sample genome (Phi X)
- should match percent of Phi X spiked in during experiment
- if different, possible that quantification of library was off or the sample library couldn’t cluster or sequence well.
- sequencing run might have quality issue reflected by another metric

23
Q

Summary Table

Error rate
(%)

A
  • actual calculated error rate which is determined by aligning actual reads produced in run from the control Phi X samples against the known Phi X genome sequence
  • 3 different error rates: 35 cycles, 75 cycles, and entire run
24
Q

Phi X Control Libraries

A
  • small single stranded DNA genome.
  • 5000bp long
  • 45% GC / 55% AT.
  • well-defined genome sequence which enables quick alignment and estimation of error rate
  • mean insert size ~335 bp, 500bp with adapters
  • added to sample libraries after pooling and denaturation for use as positive control during clustering and sequencing process.
  • allows the quick calculation for % phasing/prephasing, percent alignment, and estimation of error rate
  • spiking with Phi X can increase overall nucleotide diversity of sample (1% for diverse library samples)
25
Q

Diverse/balanced libraries

A
  • has equal proportions of ATGC nucleotides at each base position
  • Illumina optimized around balanced representation of ATGC
  • sequencing software use images from first 47 cycles of Read 1 to identify location of each cluster
  • first 25 cycles used for % passing filter evaluation in base calling and quality score calculations
26
Q

diverse libraries

per base sequence content data plot

A

-QC measure that shows representation of each base position by cycle
- in well-balanced library percent of base plot shows even and horizontal curves centered around 25% for each base

27
Q

low diversity libraries

A
  • single base per cycle
  • eg. same primers used = all clusters have exact same sequences at the beginning of the reads,
  • per base plot shows large intensity spikes at each cycle.
28
Q

unbalanced libraries

A
  • have one base at much lower percentage than the others
29
Q

QC metric

Percent of Adapter

A
  • matches adapter sequences that can be subtracted from base sequence data
  • some library inserts are shorter than read lengths so they’re read through to adapter at 3’ end
30
Q

QC metric

Duplication Rate

A
  • measures number of reads derived from one single original molecule
  • high level could indicate PCR enrichment bias due to nature of sample or library method
  • can be truly overrepresented sequences such as a very abundant transcripts in RNA sequencing library or in amplicon data (expected and normal)

Artificial causes
- optical
- clustering

31
Q

artificial duplication

Optical duplication

A
  • single cluster that has been called as two clusters by software
  • NOT on patterned flow cells
32
Q

Clustering duplication

A
  • library molecule occupies two adjacent wells, due to overflow into neighboring wells
  • two clusters of same original molecule
  • Unique to patterned flow cells