Week 6 (QC and Alignment) Flashcards

Question 1

Q

the ability to resolve a repetitive structure is dependent on the __________ of the molecules in your library

Question 2

Q

Sanger sequencing is ~_______bp accurate

Question 3

Q

short read sequencers

Answer

A

illumina and element

Question 4

Q

short read sequencers use <____ bp and have a _______ error rate

Answer

A

<150bp; low error rate («1%)

Question 5

Q

long read sequencers

Answer

A

PacBio
Oxford Nanopore

Question 6

Q

long read sequencers have a lower number of reads but much longer being ____ to _____ kb, with a _______ error rate of ~____%

Answer

A

10’s to 100’s kb; higher error rate of ~1%

Question 7

Q

what are the standard file formats?

Answer

A

FASTA
FASTQ

Question 8

Q

FASTA has ___ parts

Question 9

Q

what are the parts of FASTA

Answer

A

> sequence name (always has >)
sequence

Question 10

Q

FASTQ has ___ parts

Question 11

Q

what are the parts of FASTQ

Answer

A

@ sequence name (and other info)
sequence
- (sometimes other info)
quality value

Question 12

Q

when is FASTA used?

Answer

A

when per base quality is not needed

Question 13

Q

what does FASTA present?

Answer

A

presents only the sequence itself

Question 14

Q

the FASTA sequence name always starts with ______ symbol

Question 15

Q

when is FASTQ used?

Answer

A

when per base quality is needed

Question 16

Q

what does FASTQ present?

Answer

A

presents the sequence and the estimated base quality

Question 17

Q

FASTQ sequence name always starts with ____ symbol

Question 18

Q

when reading the sequence in FASTQ, what does the letter “N” symbolize?

Answer

A

any of the 4 nucleotides, it did not know which nucleotide it was, so the more N’s the worse the quality

Question 19

Q

the _______ the Q value the more accurate the sequence is

Question 20

Q

Quality scores increase by a factor of _____

Question 21

Q

Qphred equation

Answer

A

Qphred = -10log10P(error)

Question 22

Q

At any given position in a sequence, the base present is either A/C/T/G but we cannot _________ observe the base. The base that is produced from a DNA sequencer is an observation based on some biochemical/physical property that has an ________.

Answer

A

directly; error

Question 23

Q

Q20 is _____ times more accurate then Q10

Question 24

Q

when using fluorescence in illumina, we notice a change in color distribution with each cycle. How does this affect our accuracy?

Answer

A

clear signal intensity decreases as you do more cycles. this occurs because their may have been failure to cleave previous fluors on nucleotides

Question 25

Q

Fred scale quality values of ___ or less (__, ___, ___, ___)

Answer

A

40 or less (30, 20, 10)

Question 26

Q

in the program, the phred scale quality values are a ______ character, saving a lot of space

Question 27

Q

what is a error that can occur in element? does this happen often?

Answer

A

single base error, however this does not happen very often

Question 28

Q

what is FASTQC?

Answer

A

QC = quality control
one of the many software tools to evaluate quality

Question 29

Q

FASTQC does not actually do any filtering it….?

Answer

A

provides summary metics and visuals

Question 30

Q

In Base Quality - is this good data or poor data?

Answer

A

good data

Question 31

Q

In Base Quality - is this good data or poor data?

Answer

A

poor data

Question 32

Q

in per tile sequence quality, which os the good data and which is the poor data? how do you know?

Answer

A

good data on the left and poor data on the right. on the poor data there must have been a problem between cycle 5 and 15 in this location in the flow cell because there are lines shown.

Question 33

Q

in these per sequence quality scores, which one has good data and which has poor data? how do you know?

Answer

A

good data on the left has a high sequence quality almost to the end of the reads. the poor data is on the right with sequence quality that begins to decrease earlier in the reads.

Question 34

Q

in the per base sequence content, which has good data and which has poor data? how do you know?

Answer

A

good data on the left because it has a uniform base composition to the end of the reads. the poor data on the right’s base composition diverges indicating problems.

Question 35

Q

error probability: Q10

Answer

A

0.1 (1 in 10)

Question 36

Q

error probability: Q20

Answer

A

0.01 (1 in 100)

Question 37

Q

error probability: Q30

Answer

A

0.001 (1 in 1000)

Question 38

Q

error probability: Q40

Answer

A

0.0001 (1 in 10000)

Question 39

Q

k-mer

Answer

A

a substring of a fixed length that appears in a biological sequence

Question 40

Q

4^(kmer length)

Answer

A

number of possible nucleotides

Question 41

Q

are K-mers odd or even integers?

Answer

A

use odd integers

Question 42

Q

sequence alignment

Answer

A

a string matching problem, using a reference genome to match your sequences

Question 43

Q

two main algorithms for alignment

(of short read sequencing data)

Answer

A

smith-waterman
burrow-wheeler transform

Question 44

Q

smith-waterman

Answer

A

guaranteed to find the optimal local alignment with respect to the coring matrices used

Question 45

Q

what is smith-waterman best used for?

Answer

A

good for aligning “Sanger” length data and it is too slow for high throughput data

Question 46

Q

burrow-wheeler transform

Answer

A

creates a suffix array of smaller k-mers, index the reference genome, match the seed of the read to the reference and extend seed to full alignment

Question 47

Q

what are the most used aligners?

Answer

A

BWA (WGS) and STAR (RNA)

Question 48

Q

what two algorithms for alignment are best used in short read sequencing data?

Answer

A

smith-waterman
burrow-wheeler transform

Question 49

Q

what algorithm is used for long read?

(in alignment)

Question 50

Q

two main sequence file formats

Answer

A

FASTA and FASTQ

Question 51

Q

high throughput sequence data needs _____

Question 52

Q

what softwatr tool provides summary and visualization to evaluate quality

Question 53

Q

sequence alignment

Answer

A

smith-waterman
burrow-wheeler transform (BWA and STAR)

Question 54

Q

what does “depth” mean when sequencing?

Answer

A

the amount of times it has been sequenced

Question 55

Q

what is the ability to resolve a repetitive structure dependent on?

Answer

A

the length of the molecules in your library

Question 56

Q

when analyzing a pileup, you notice that one of the lines has equal parts A and C. What does this mean? Is this an error in the sequence?

Answer

A

this is most likely not an error, this is showing that on the chromosome from one parent you got an A and the other you had a C, this is representing both chromosomes (heterozygous)