Week 6 (QC and Alignment) Flashcards
the ability to resolve a repetitive structure is dependent on the __________ of the molecules in your library
length
Sanger sequencing is ~_______bp accurate
1000
short read sequencers
illumina and element
short read sequencers use <____ bp and have a _______ error rate
<150bp; low error rate («1%)
long read sequencers
- PacBio
- Oxford Nanopore
long read sequencers have a lower number of reads but much longer being ____ to _____ kb, with a _______ error rate of ~____%
10’s to 100’s kb; higher error rate of ~1%
what are the standard file formats?
- FASTA
- FASTQ
FASTA has ___ parts
2
what are the parts of FASTA
- > sequence name (always has >)
- sequence
FASTQ has ___ parts
4
what are the parts of FASTQ
- @ sequence name (and other info)
- sequence
- (sometimes other info)
- quality value
when is FASTA used?
when per base quality is not needed
what does FASTA present?
presents only the sequence itself
the FASTA sequence name always starts with ______ symbol
>
when is FASTQ used?
when per base quality is needed
what does FASTQ present?
presents the sequence and the estimated base quality
FASTQ sequence name always starts with ____ symbol
@
when reading the sequence in FASTQ, what does the letter “N” symbolize?
any of the 4 nucleotides, it did not know which nucleotide it was, so the more N’s the worse the quality
the _______ the Q value the more accurate the sequence is
higher
Quality scores increase by a factor of _____
10
Qphred equation
Qphred = -10log10P(error)
At any given position in a sequence, the base present is either A/C/T/G but we cannot _________ observe the base. The base that is produced from a DNA sequencer is an observation based on some biochemical/physical property that has an ________.
directly; error
Q20 is _____ times more accurate then Q10
10
when using fluorescence in illumina, we notice a change in color distribution with each cycle. How does this affect our accuracy?
clear signal intensity decreases as you do more cycles. this occurs because their may have been failure to cleave previous fluors on nucleotides
Fred scale quality values of ___ or less (__, ___, ___, ___)
40 or less (30, 20, 10)
in the program, the phred scale quality values are a ______ character, saving a lot of space
single
what is a error that can occur in element? does this happen often?
single base error, however this does not happen very often
what is FASTQC?
- QC = quality control
- one of the many software tools to evaluate quality
FASTQC does not actually do any filtering it….?
provides summary metics and visuals
In Base Quality - is this good data or poor data?
good data
In Base Quality - is this good data or poor data?
poor data
in per tile sequence quality, which os the good data and which is the poor data? how do you know?
good data on the left and poor data on the right. on the poor data there must have been a problem between cycle 5 and 15 in this location in the flow cell because there are lines shown.
in these per sequence quality scores, which one has good data and which has poor data? how do you know?
good data on the left has a high sequence quality almost to the end of the reads. the poor data is on the right with sequence quality that begins to decrease earlier in the reads.
in the per base sequence content, which has good data and which has poor data? how do you know?
good data on the left because it has a uniform base composition to the end of the reads. the poor data on the right’s base composition diverges indicating problems.
error probability: Q10
0.1 (1 in 10)
error probability: Q20
0.01 (1 in 100)
error probability: Q30
0.001 (1 in 1000)
error probability: Q40
0.0001 (1 in 10000)
k-mer
a substring of a fixed length that appears in a biological sequence
4^(kmer length)
number of possible nucleotides
are K-mers odd or even integers?
use odd integers
sequence alignment
a string matching problem, using a reference genome to match your sequences
two main algorithms for alignment
(of short read sequencing data)
- smith-waterman
- burrow-wheeler transform
smith-waterman
guaranteed to find the optimal local alignment with respect to the coring matrices used
what is smith-waterman best used for?
good for aligning “Sanger” length data and it is too slow for high throughput data
burrow-wheeler transform
creates a suffix array of smaller k-mers, index the reference genome, match the seed of the read to the reference and extend seed to full alignment
what are the most used aligners?
BWA (WGS) and STAR (RNA)
what two algorithms for alignment are best used in short read sequencing data?
- smith-waterman
- burrow-wheeler transform
what algorithm is used for long read?
(in alignment)
minimap2
two main sequence file formats
FASTA and FASTQ
high throughput sequence data needs _____
QC
what softwatr tool provides summary and visualization to evaluate quality
FastQC
sequence alignment
- smith-waterman
- burrow-wheeler transform (BWA and STAR)
what does “depth” mean when sequencing?
the amount of times it has been sequenced
what is the ability to resolve a repetitive structure dependent on?
the length of the molecules in your library
when analyzing a pileup, you notice that one of the lines has equal parts A and C. What does this mean? Is this an error in the sequence?
this is most likely not an error, this is showing that on the chromosome from one parent you got an A and the other you had a C, this is representing both chromosomes (heterozygous)