Lecture 4 Flashcards

1
Q

Explain how Illumina reads are assembled

A

Look for overlaps in text sequences making contigs and assemble them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are contigs?

A

A continuous sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a read

A

DNA fragments that have been sequence (text)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do we still need to assess assembly quality even though human genome project is done?

A

There are many unknown genomes of organisms that are not well model organisms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 4 ways we measure how good an assembly is?

A
  1. % of reads assembled
  2. Number of contigs
  3. Length of contigs
  4. N50 → 50% of assembly is contained in contigs greater than or equal to this length (N50)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to find N50?

A
  1. Put contigs in order biggest to smallest or smallest to biggest
  2. Add them up and divide by 2 → this number is NOT N50 → find where this length of this number is
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe factors that limit assembly quality?

A
  1. Low coverage
  2. Difficult Sequences
  3. Low Accuracy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does low coverage mean?

A

Missing sequences. not enough overlaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to find average coverage?

A

of reads x read length / genome size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is number of reads determined by illumina sequencing

A

of clusters that machine generates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the read length determined by in illumina sequencing?

A

of pictures taken

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the recommended coverage that is enough to determine the complete sequence of a genome

A

30x - 50x coverage is recommended

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are difficult sequences

A

-Repeats → Trying to figure out how long the repeat region is difficult
-Heterozygosity → Some reads may have different bases at the same position

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is low accuracy

A

mistakes in sequencing (by polymerase) lead to bad assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does high coverage allow for checking for sequence errors

A

Allows for determining if base is correct or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to measure sequencing error?

A

-Collect raw intensities for each color and convert into a quality score
-The higher the quality score the higher the probability that it is the correct base → Most intense base is the correct one