Genome assembly Flashcards

1
Q

Why is genome assembly necessary?

A

no seq technology can produce long enough reads

it is the rate limiting step in genomics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

De Novo vs Re-Sequencing

A

De novo: determination of a full-genome sequence, NO known reference sequence. Needs a lot of sequence coverage and computing power.

Re-sequencing: a reference genome sequence is known. The assembly process is replaced by mapping the raw sequence reads onto the reference genome. Less sequence coverage and computing power needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is resequencing good and bad at detecting?

A

Good - SNPs

Bad - limited in the detection of structural rearrangements (insertions, deletions, inversions).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to calculate coverage

A

bases needed to assemble a sequence/bases in the sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do assembly algorithms work and what are 2 difficulties?

A

search for overlaps between sequence reads

  • data volume requires high comp power
  • repeat regions which are larger than the read
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

relationship btw seq coverage and probability of detection

A

sigmoidal (S curve)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is a contig?

A

partial assembly of data from overlapping fragments into a contiguous region of sequences. The order of the contigs is NOT known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

why are repetitive regions problematic in assembly

A

regions of receptive seqs can mean that contorts cannot join up.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

describe paired end sequencing

A
  • create library from sample DNA.
  • isolate fragments which are about 800bp long.
  • sequence 250bp from each end of the fragments using illumina.
  • now, the sequence of both ends is known and we know the ends are about 800 bp apart. in-between is unknown.
  • can map fragments to genome.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

when is paired end seq particularly useful?

A

sequencing of fragments that contain short repeat regions, because paired end reads have relatively small inserts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the difference btw paired end fragments and mate pair fragments?

A

mate pair frags have larger inserts (3kb-15kb), paired end has about 800bp.
Mate pair enables coverage of regions with large structural rearrangements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

example of 2 long read seq tehcs

A

PacBio and ox nanopore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how can long read seq tacos be used

A

can make v long reads but high error rates
so, make initial assembly with long read, then additional illumina short read seq for error correction.
New genome assemblers directly incorporate PacBio and Illumina.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Bionano

A

optical mapping system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does optical mapping work?

A
  • non-destructive restriction map (no fragmenting)
  • single-stranded nick is inserted into double-stranded DNA at positions of a seven-base recognition site. (uses 7bp cutter enzyme) cute every 4^7 bases.
  • nick is ligated by insertion of a fluorescent marker, dispersing visible signals along the genome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is optical mapping used?

A

By visualizing large-scale segments of the genome, and comparing with a reference, it is possible to detect translocations, repeats and deletions.
In conjunction with sequencing of fragments it is possible to apply the map to assemble a complete genome sequence.

17
Q

BionanoIrys system

A

used to check contig placement.

can detect large structural misassemblies

18
Q

what is a scaffold?

A

ordered stretch of contigs

contains sequence gaps NNNNN

19
Q

What approaches are used to build scaffolds?

A

using approaches such as optical maps
chromosome contact maps
10X genomics

20
Q

which machine does 10X genomics?

A

10X genomics chromium

21
Q

How does 10x genomics work?

A
  • partitions large DNA molecules (50-100 kb) into small droplets. Each droplet is assigned a unique barcode.
  • each molecule sequenced using illumina and each read can be certainly assigned to a large DNA molecule.
  • works with tiny amount of DNA 1ng, so good for single cell analysis.
22
Q

2 approaches to whole genome seq projects

A
  1. Hierarchial approach - whole genome fragmented and cloned into BACs. order of fragments established before seq
  2. whole genome shotgun method - large numbers of smaller fragments. harder assmebly
23
Q

how much DNA can a BAC carry

A

100-200kbp

can store foreign DNA 10x larger than normal plasmids

24
Q

BAC to BAC

A

DNA is cut into fragments of about 150 kb. Fragments are cloned into BACs.
do enough to get good coverage
- BAC clones are ordered by fingerprinting, e.g. based on overlapping restriction fragment size patterns.
- individual BAC clones are sequenced, sequences of each BAC clone assembled independently.
- since order is known, can infer order of sequence.

25
Q

advantage of BAC to BAC

A
  1. dramatically reduced the size and complexity of the 2.genome piece that needs to be considered.
  2. Assemble algorithm only has to deal with a frag of 150kb.
  3. repeat problem reduced. even if highly repetitive element occurs across genome, unlikely it will appear in the same BAC clone.
26
Q

Whole-genome shotgun approach differs from BAC to BAC?

A

Different to BAC to BAC - order of genome fragments is not determined prior to sequencing.
much higher computing power needed
bypass lengthy fingerprinting of thousands of BAC clones

27
Q

Which is used more now? whole genome shotgun approach or BAC to BAC

A

Shotgun now due to novel seq techniques and scaffolding approaches

usually combines different technologies for seq and also for scaffolding and error correction.

28
Q

How much coverage is needed for a whole-genome shotgun approach in order to receive a decent quality?

A

Lander and Waterman:

c=NL/G
= no. reads*read length/genome length

accepted coverage = 100

29
Q

What has seriously confounded whole-shotgun genome assemblies

A

presence of repetitive elements

not considered in the formulas for coverage

30
Q

barley genome

Size and how was it sequenced?

A

genome released in 2017
4.8Gb (largest sequenced at high quality)
Hieracrchal BAC to BAC approach used
87,075 BAC clones were sequenced, mainly using Illumina paired-end and mate-pair technology. Each BAC clone was assembled individually.
Scaffolding done using optical map and chromosome conformation capture sequencing (Hi-C).

31
Q

How to measure genome quality?

A
  1. length of the total assembly should be close to the estimated genome size
    Barley - 4.79Gb compared to 4.8Gb.
  2. low no. contigs. N50 should be high
  3. scaffold N50 should be high
32
Q

what is N50?

A

50% of the genome is formed of contigs or scaffolds of this size or larger.

33
Q

Annotated coding sequence (in assembly annotation statistics)

A

if it is low, makes whole genome approach challenging

34
Q

benchmarks for reference genomes, decided by Vertebrate genome project 2018

A

N50 size contig: >1 Mb

N50 size scaffold: >10 Mb

Sequence error frequency: < 1 in 10,000 bases (phred score = 40)

Structural variants be confirmed by multiple technologies

90% of the sequence be assigned to chromosomes.