Week 6 (QC and Alignment) Flashcards

1
Q

the ability to resolve a repetitive structure is dependent on the __________ of the molecules in your library

A

length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Sanger sequencing is ~_______bp accurate

A

1000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

short read sequencers

A

illumina and element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

short read sequencers use <____ bp and have a _______ error rate

A

<150bp; low error rate («1%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

long read sequencers

A
  • PacBio
  • Oxford Nanopore
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

long read sequencers have a lower number of reads but much longer being ____ to _____ kb, with a _______ error rate of ~____%

A

10’s to 100’s kb; higher error rate of ~1%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are the standard file formats?

A
  • FASTA
  • FASTQ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

FASTA has ___ parts

A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are the parts of FASTA

A
  1. > sequence name (always has >)
  2. sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

FASTQ has ___ parts

A

4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the parts of FASTQ

A
  1. @ sequence name (and other info)
  2. sequence
    • (sometimes other info)
  3. quality value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

when is FASTA used?

A

when per base quality is not needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what does FASTA present?

A

presents only the sequence itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

the FASTA sequence name always starts with ______ symbol

A

>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

when is FASTQ used?

A

when per base quality is needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what does FASTQ present?

A

presents the sequence and the estimated base quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

FASTQ sequence name always starts with ____ symbol

A

@

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

when reading the sequence in FASTQ, what does the letter “N” symbolize?

A

any of the 4 nucleotides, it did not know which nucleotide it was, so the more N’s the worse the quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

the _______ the Q value the more accurate the sequence is

A

higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Quality scores increase by a factor of _____

A

10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Qphred equation

A

Qphred = -10log10P(error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

At any given position in a sequence, the base present is either A/C/T/G but we cannot _________ observe the base. The base that is produced from a DNA sequencer is an observation based on some biochemical/physical property that has an ________.

A

directly; error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Q20 is _____ times more accurate then Q10

24
Q

when using fluorescence in illumina, we notice a change in color distribution with each cycle. How does this affect our accuracy?

A

clear signal intensity decreases as you do more cycles. this occurs because their may have been failure to cleave previous fluors on nucleotides

25
Q

Fred scale quality values of ___ or less (__, ___, ___, ___)

A

40 or less (30, 20, 10)

26
Q

in the program, the phred scale quality values are a ______ character, saving a lot of space

27
Q

what is a error that can occur in element? does this happen often?

A

single base error, however this does not happen very often

28
Q

what is FASTQC?

A
  • QC = quality control
  • one of the many software tools to evaluate quality
29
Q

FASTQC does not actually do any filtering it….?

A

provides summary metics and visuals

30
Q

In Base Quality - is this good data or poor data?

31
Q

In Base Quality - is this good data or poor data?

32
Q

in per tile sequence quality, which os the good data and which is the poor data? how do you know?

A

good data on the left and poor data on the right. on the poor data there must have been a problem between cycle 5 and 15 in this location in the flow cell because there are lines shown.

33
Q

in these per sequence quality scores, which one has good data and which has poor data? how do you know?

A

good data on the left has a high sequence quality almost to the end of the reads. the poor data is on the right with sequence quality that begins to decrease earlier in the reads.

34
Q

in the per base sequence content, which has good data and which has poor data? how do you know?

A

good data on the left because it has a uniform base composition to the end of the reads. the poor data on the right’s base composition diverges indicating problems.

35
Q

error probability: Q10

A

0.1 (1 in 10)

36
Q

error probability: Q20

A

0.01 (1 in 100)

37
Q

error probability: Q30

A

0.001 (1 in 1000)

38
Q

error probability: Q40

A

0.0001 (1 in 10000)

39
Q

k-mer

A

a substring of a fixed length that appears in a biological sequence

40
Q

4^(kmer length)

A

number of possible nucleotides

41
Q

are K-mers odd or even integers?

A

use odd integers

42
Q

sequence alignment

A

a string matching problem, using a reference genome to match your sequences

43
Q

two main algorithms for alignment

(of short read sequencing data)

A
  • smith-waterman
  • burrow-wheeler transform
44
Q

smith-waterman

A

guaranteed to find the optimal local alignment with respect to the coring matrices used

45
Q

what is smith-waterman best used for?

A

good for aligning “Sanger” length data and it is too slow for high throughput data

46
Q

burrow-wheeler transform

A

creates a suffix array of smaller k-mers, index the reference genome, match the seed of the read to the reference and extend seed to full alignment

47
Q

what are the most used aligners?

A

BWA (WGS) and STAR (RNA)

48
Q

what two algorithms for alignment are best used in short read sequencing data?

A
  • smith-waterman
  • burrow-wheeler transform
49
Q

what algorithm is used for long read?

(in alignment)

50
Q

two main sequence file formats

A

FASTA and FASTQ

51
Q

high throughput sequence data needs _____

52
Q

what softwatr tool provides summary and visualization to evaluate quality

53
Q

sequence alignment

A
  • smith-waterman
  • burrow-wheeler transform (BWA and STAR)
54
Q

what does “depth” mean when sequencing?

A

the amount of times it has been sequenced

55
Q

what is the ability to resolve a repetitive structure dependent on?

A

the length of the molecules in your library

56
Q

when analyzing a pileup, you notice that one of the lines has equal parts A and C. What does this mean? Is this an error in the sequence?

A

this is most likely not an error, this is showing that on the chromosome from one parent you got an A and the other you had a C, this is representing both chromosomes (heterozygous)