Week 10 (Genome Assembly) Flashcards
all assembly relies on what simple assumption?
that highly similar DNA fragments originate from the same position within a genome
all ______ approaches rely on the simple assumption that highly similar DNA fragments originate from the same position within a genome
assembly
during genome assembly, it is important that highly similar DNA fragments originate from the same position within a genome. How does this apply to repetitive sequences in the genome? How do we resolve this?
repetitive sequences pose a challenge for genome assembly because they appear at multiple locations in the genome. we can resolve this by having longer sequences read so that they span the repeat or include the unique sequences around them.
what are 3 critical properties of sequencing reads
- length
- accuracy
- evenness
comparatively to long read sequencing, ______ read genome sequencing is no longer relevant
short
contig
set of sequence reads that overlap to form a contiguous stretch of DNA sequence
a _______ is a set of sequence reads that overlap to form a contiguous stretch of DNA sequence
contig
_______ numbers better = bigger contigs
lower
N50
shortest contig length such that 50% of the bases contained in contigs of length N
_____ is the shortest contig length such that 50% of the bases contained in contigs of length N
N50
for N50, is higher or lower better?
higher
L50
smaller number of contigs whose length sum to N50
_____ is the smaller number of contigs whose length sum to N50
L50
for L50, is higher or lower better?
lower
De Bruin graph
assembly method that uses smaller sub-sequences (k-mers) of sequence reads to find overlaps and build a graph
__________ ________ is an assembly method that uses smaller sub-sequences (k-mers) of sequence reads to find overlaps and build a graph
De Bruijn graph
OLC assembly method
Overlap Layout Consensus
what does each letter mean in OLC assembly method?
- Overlap: find all pairwise overlaps between all reads
- Layout: use those overlaps to determine how the reads should be put together
- Consensus: produce a consensus based on the layout and overlap of reads
.
what does the O mean in OLC assembly method?
Overlap: find all pairwise overlaps between all reads
what does the L mean in OLC assembly method?
Layout: use those overlaps to determine how the reads should be put together
what does the C mean in OLC assembly method?
Consensus: produce a consensus based on the layout and overlap of reads
T2T assembly recipe
- error correction of accurate long reads
- assembly graph construction
- graph simplification with ultra long reads
- phasing and scaffolding
______ _______ to establish parent origin
k-mer counting
what makes full siblings genetically different from each other?
crossing over / recombination during meiosis
on average there is ___ million differences in DNA from person to person
3
if we have the DNA of the parents, how can we distinguish what chromosomes came from mom and which from dad?
check for variance in the individual and compare it to the parents and establish parent of origin using k-mer counting
bubble
polymorphisms between haplotypes
a ______ is used to show polymorphisms between haplotypes
bubble
tangle
region with complicated haplotypes not able to be resolved
a ______ is a region with complicated haplotypes not able to be resolved
tangle
which structural features must be present for a chromosome to be considered T2T?
centromere and telomeres on both ends
BUSCO
benchmarking universal single-copy orthologs
how is BUSCO made?
a way to evaluate how good we did at assembling the genome. There are 3023 genes in all vertebrates, so this method allows you to compare the 3000 genes with your genome to see how many you assembled and if they were assembled correctly
What are our options for the types of reads to use for k-mer counting? Which do you think would produce the most accurate estimate of assembly base quality?
- raw reads, processed reads, error-corrects reads, short reads, long reads
- For the most accurate estimate of assembly base quality, using long, error-corrected reads, ideally paired-end or mate-pair reads, is recommended.
what is BUSCO used for?
a valuable tool for evaluating the completeness of an assembly
it is now a common practice to use ______ to estimate the base accuracy of contig sequences, often measured in the Phred scale as quality value
kmers
the count of kmers for heterozygous (30x coverage) is _____ of what the count for homozygous (60x coverage) is
half
what are the three A’s?
- assemblies
- alignments
- annotations
describe the process that uses three A’s
first you make a haploid genome ASSEMBLY, then ALIGN them to find variation, then you must ANNOTATE them (meaning that you need to find where the genes and regulatory elements are)
on average there are a lower amount of sites for structural variants but is a _______ percentage in the diploid genome
higher