Week 10 (Genome Assembly) Flashcards

1
Q

all assembly relies on what simple assumption?

A

that highly similar DNA fragments originate from the same position within a genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

all ______ approaches rely on the simple assumption that highly similar DNA fragments originate from the same position within a genome

A

assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

during genome assembly, it is important that highly similar DNA fragments originate from the same position within a genome. How does this apply to repetitive sequences in the genome? How do we resolve this?

A

repetitive sequences pose a challenge for genome assembly because they appear at multiple locations in the genome. we can resolve this by having longer sequences read so that they span the repeat or include the unique sequences around them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are 3 critical properties of sequencing reads

A
  1. length
  2. accuracy
  3. evenness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

comparatively to long read sequencing, ______ read genome sequencing is no longer relevant

A

short

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

contig

A

set of sequence reads that overlap to form a contiguous stretch of DNA sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

a _______ is a set of sequence reads that overlap to form a contiguous stretch of DNA sequence

A

contig

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

_______ numbers better = bigger contigs

A

lower

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

N50

A

shortest contig length such that 50% of the bases contained in contigs of length N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

_____ is the shortest contig length such that 50% of the bases contained in contigs of length N

A

N50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

for N50, is higher or lower better?

A

higher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

L50

A

smaller number of contigs whose length sum to N50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

_____ is the smaller number of contigs whose length sum to N50

A

L50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

for L50, is higher or lower better?

A

lower

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

De Bruin graph

A

assembly method that uses smaller sub-sequences (k-mers) of sequence reads to find overlaps and build a graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

__________ ________ is an assembly method that uses smaller sub-sequences (k-mers) of sequence reads to find overlaps and build a graph

A

De Bruijn graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

OLC assembly method

A

Overlap Layout Consensus

18
Q

what does each letter mean in OLC assembly method?

A
  • Overlap: find all pairwise overlaps between all reads
  • Layout: use those overlaps to determine how the reads should be put together
  • Consensus: produce a consensus based on the layout and overlap of reads

.

19
Q

what does the O mean in OLC assembly method?

A

Overlap: find all pairwise overlaps between all reads

20
Q

what does the L mean in OLC assembly method?

A

Layout: use those overlaps to determine how the reads should be put together

21
Q

what does the C mean in OLC assembly method?

A

Consensus: produce a consensus based on the layout and overlap of reads

22
Q

T2T assembly recipe

A
  1. error correction of accurate long reads
  2. assembly graph construction
  3. graph simplification with ultra long reads
  4. phasing and scaffolding
23
Q

______ _______ to establish parent origin

A

k-mer counting

24
Q

what makes full siblings genetically different from each other?

A

crossing over / recombination during meiosis

25
Q

on average there is ___ million differences in DNA from person to person

26
Q

if we have the DNA of the parents, how can we distinguish what chromosomes came from mom and which from dad?

A

check for variance in the individual and compare it to the parents and establish parent of origin using k-mer counting

27
Q

bubble

A

polymorphisms between haplotypes

28
Q

a ______ is used to show polymorphisms between haplotypes

29
Q

tangle

A

region with complicated haplotypes not able to be resolved

30
Q

a ______ is a region with complicated haplotypes not able to be resolved

31
Q

which structural features must be present for a chromosome to be considered T2T?

A

centromere and telomeres on both ends

32
Q

BUSCO

A

benchmarking universal single-copy orthologs

33
Q

how is BUSCO made?

A

a way to evaluate how good we did at assembling the genome. There are 3023 genes in all vertebrates, so this method allows you to compare the 3000 genes with your genome to see how many you assembled and if they were assembled correctly

34
Q

What are our options for the types of reads to use for k-mer counting? Which do you think would produce the most accurate estimate of assembly base quality?

A
  • raw reads, processed reads, error-corrects reads, short reads, long reads
  • For the most accurate estimate of assembly base quality, using long, error-corrected reads, ideally paired-end or mate-pair reads, is recommended.
35
Q

what is BUSCO used for?

A

a valuable tool for evaluating the completeness of an assembly

36
Q

it is now a common practice to use ______ to estimate the base accuracy of contig sequences, often measured in the Phred scale as quality value

37
Q

the count of kmers for heterozygous (30x coverage) is _____ of what the count for homozygous (60x coverage) is

38
Q

what are the three A’s?

A
  • assemblies
  • alignments
  • annotations
39
Q

describe the process that uses three A’s

A

first you make a haploid genome ASSEMBLY, then ALIGN them to find variation, then you must ANNOTATE them (meaning that you need to find where the genes and regulatory elements are)

40
Q

on average there are a lower amount of sites for structural variants but is a _______ percentage in the diploid genome