Genome Sequencing Flashcards

Question

Describe hierarchical shotgun sequencing after original BAC cloning (3 steps)

Answer 1

A BAC contacts 300 kb of base pairs which is still very big. The goal is to make these BACs smaller so that they're easier to sequence 1. Shear BACs by sonication (unique fragments) 2. Clone the fragments into phagemids (1 kb) or plasmids (2-10 kb) and transform into E.coli ("shotgun library"). 3. Sequence library clones, and assemble genome.

Answer 2

1. DNA extraction 2. DNA fragmentation (sonication) 3. Clone into vectors, transform bacteria for replication, purify vectors 4. Sequence library clones and assemble genome

Answer 3

HSS: Easier to assemble genome sequence but have to build physical map (labor intensive) WGSS: Bypasses physical map (mapping where the BACs are and any overlap), but assembly of the genome is more difficult especially for more complex genomes (like the human genome)

Answer 4

Each time a new sequence was found, it was put into the NCBI database. Venter used this public information to help him assemble the entire genome. This shows how profit-driven Celera was.

Answer 5

How many times a genome is sequenced (because nucleotides are resequenced often)

Answer 6

C (coverage) = LN/G L: sequence read length in bp (# of reads you get in a reaction) N: Number of reads sequenced (aka number of clones) G: Haploid genome length in bp

Answer 7

Sequencing reads will be randomly distributed in the genome (i.e. the ability to sequence a particular region of the genome does not differ)

Answer 8

1X= 5 Mb 2X = 10 Mb

Answer 9

Paired reads/mate pairs

Answer 10

Used when sequencing inserts in vectors because we already know the sequence of the vector

Answer 11

Aligning sequences and better coverage of the genome - More overlap

Answer 12

N= CG/L= (1)(4x10^6)/1000 = 4000 clones

Answer 13

Poisson distribution

Answer 14

P(y)= (λ^y ⋅ e^-λ)/y! - y= number of events in a given interval (number of times a nucleotide is sequenced) - λ= mean number of events in a given interval (genome coverage) - P= probability that a certain nucleotide will be sequenced a certain number of times

Answer 15

Any base is NOT sequenced

Answer 16

P(0)=e^-λ

Answer 17

10,000-1600=8400 base pairs

Answer 18

DNA fragments with overlapping sequences must be adjacent to one another. Overlaps are found until they are assembled into contigs (continuous sequence). The contigs are then assembled into a scaffold/supercontig (complete sequence of a chromosome, an ordered set of contigs usually derived from mate pairs)

Answer 19

>24 mer (nucleotides)

Answer 20

94% - Accounts for any spontaneous mutations, or any mistakes made by the polymerase during sequencing.

Answer 21

Repetitive sequences make it impossible to distinguish reads from two or more distinct places in the genome - More of an issue in eukaryotes - Assembly of reads will only detect on region instead of both regions, resulting in a repeat collapse

Answer 22

True Coding strand shown using arrows

Answer 23

False, usually shows base pairs of only one strand

Answer 24

Take the reverse strand (3' to 5') reverse it (5' to 3') then take the complement.

Answer 25

1. Ab initio (intrinsic) approach 2. Extrinsic (evidence-based) approach

Answer 26

The genomic DNA sequence alone is systematically searched for protein-coding genes (looking at the raw sequence and looking for signatures of genes in the raw data like promoter sequences for example)

Answer 27

The target genome is compared to other genomes to look for similarity to known mRNA and protein sequences in databases (NCBI, EMBL)

Answer 28

1. Presence of an open reading frame (start [ATG] and stop codons [TAA, TAG,TGA] > 300 bp 2. Presence of CpG islands (60-70% GC content) associated with 5' end of transcribed genes (indication of a promoter site) 3. Splicing sites 4. Sequence contains known protein domains (e.g. if you're looking for the gene coding for a transmembrane protein, you need to look for a transmembrane domain gene.

Answer 29

False; The open reading frame is conserved over multiple species (genome alignment)

Answer 30

Preliminary at best (interactions, localization, regulation are unknown)

Answer 31

1. Library construction: Fragment genomic DNA and PCR, bypassing vector cloning (used to make many fragments of the same DNA because many fragments are needed for the sequencing reaction). 2. Number of parallel reads ( the ability to sequence many DNA fragments simultaneously, rather than one at a time): up to 4 billion compared to 96 3. Read lengths: Generally shorter: 100-300 bp compared to >800 bp for Sanger (might be an issue when trying to assemble the genome, but who cares? The coverage is way higher anyways 4. Amount of genomic template: need only a few micrograms for second-generation

Answer 32

1. Denaturation of dsDNA (1 minute at 94 degrees C) 2. Annealing of forward and reverse primers (forward on bottom strand and reverse on top strand), 45 seconds at 54 degrees C 3. Extension (2 minutes at 72 degrees C, only dNTPs added)

Answer 33

False. Taq polymerase synthesized both DNA strands simultaneously

Answer 34

2^x, where x= # of cycles

Answer 35

1. Fragment DNA and ligate adaptors to ends (adaptors have known sequences which will allow for making primers) 2. Select fragments with two different adaptors (because if you use the same adaptor on both ends, you'll primer dimers when adding primer, since the ends primers would be complementary) 3. Certain adaptors have biotin on them. Add beads containing streptavidin, since biotin binds streptavidin. At this point, strands without biotin on either adaptor (same adaptor) will be selected against. 4. Nick nonbiotinylated strand to get sstDNA library (nick= breaking one strand of DNA). Strands with biotin on both adaptors will stay looped to the bead and also get selected against.

Answer 36

1. Add more DNA capture beads than DNA templates 2. Emulsify beads and PCR reagents in lipid molecules. 3. Clonal amplification occurs inside microreactors

Answer 37

1. Put beads in wells of picotiter plate (plate with lots of wells, one bead per well) 2. Add sequencing reaction components including adenosine 5'phosphosulfate (APS), luciferin, luciferase and primers. Basically everything but dNTPs at this point. 3. Flood dNTPs one at a time over the picotiter plate. 4. If nucleotide is added to new DNA strand, pyrophosphate is given off that results in light emission. This is because pyrophosphate reacts with APS to form ATP. ATP then reacts with luciferin, which results in light. 5. Take an image of picotiter plate and repeat with next dNTP.

Answer 38

1. Fragment DNA and add linkers (adaptors) at the ends 2. Denature and bind one end of the ssDNA fragments to surface of flow cell (glass, each glass slide is coated with a lawn of adaptors) 3. Free end of fragments hybridize to other adaptors on the flow cell surface (bridging reaction, when the DNA fragment randomly bends in the vicinity of a surface adaptor) 4. Add PCR components (e.g. dNTPs, Taq polymerase) and carry out PCR in flow cell - flow cell adaptors now act as primers 5. DNA fragments are amplified generating clusters of multiple copies (millions) of the same molecule

Answer 39

1. Initiate sequencing of clusters by adding primers, DNA polymerase and reversible ddNTPS (reversible = the ddNTPs will stop the rxn temporarily, but then the rxn will resume after). - Each type of ddNTP is labeled with a different fluorophore 2. Add all four ddNTPs at once, allow incorpration in sequencing reaction and image flow cell 3. Remove fluorophore from each ddNTP and then add new ddNTPs with fluorophore and continue sequencing 4. Repeat n times to create a read length of n nucleotides

Answer 40

- dNTPs can be used in Roche 454 sequencing because the nucleotides are added one at a time, and the system relies on a light signal produced by the reaction to indicate the addition of a nucleotide to the daughter strand. There's no risk of adding multiple different nucleotides back to back in one reaction, because only one type of nucleotide is flooded in each reaction. Overall, we can control the rate of nucleotide incorporation if dNTPs are used since they're flooded one at a time. - Reversible ddNTPs are needed for Illumina Soleca sequencing because all4 nucleotides are present in each reaction cycle. The ddNTPs ensure that only one nucleotide is sequenced per reaction, so the flow cell can be imaged accurately. Overall, the rate of addition of nucleotides is controlled by the use of ddNTPs in Illumina sequencing.

Answer 41

1. Library preparation similar to Roche 454 (beads and emPCR) 2. Universal primer hybridize to P1 adapter sequence at the end of fragments. 3. A set of 16 8-mers (single nucleotide sequences that are 8 nucleotides long) that are fluorescently labelled is flooded over the fragments. - First 2bases of each 8-mer are fixed (dibase probes, base pair with the template), and the remaining 6 bases are degenerate (no specificity/complementary binding) 4. Allow probe to bind template and ligate to primer (sequencing by ligation). 5. the fluorophore is cleaved off the probe. This removes the fluorescent label, leaving behind only the dinucleotide that was ligated to the DNA strand. The cleavage usually removes a small portion of the ligated probe (the last few bases, including the fluorophore), leaving an exposed 5’ end on the growing strand where the next probe can attach. 6. Following several ligation cycles, the template is removed (daughter strand is killed) and the process is repeated with a new primer (offset by one nucleotide)

Answer 42

Because you're sequencing the exact same template over and over again.

Answer 43

SOLiD uses ligase, not polymerase

Answer 44

- No PCR amplification required for third-generation sequencers (sequencing of single DNA molecules) - Read lengths: much longer (10,000 to 100,000 bp) and therefore, less coverage required - Error rate and costs are still much higher than 2nd generation platforms

Answer 45

1. Sequencing reaction carried out in extremely small wells (50 nm) called zero-mode waveguides (ZMV) allowing for high sensitivity to measure fluorescence 2. DNA and polymerase is embedded on the bottom of ZMVs 3. Fluorescent dNTPs are added all at the same time and incorporation is measured by intensity and colour of fluorescence.

Answer 46

1. Nanopore is the bacterial α-hemolysin protein embedded in a synthetic membrane on an array chip 2. Membrane has high electrical resistance and the application of a potential across the membrane cause a current to flow through the aperture of the nanopore 3. DNA is inserted in a nanopore by a DNA helicase and travels through the nanopore one nucleotide at a time 4. As each type of nucleotide travels through the nanopore, it causes a unique current disruption. 5. The current changes are measured to identify the nucleotide sequence

Answer 47

- Major breakthrough in 2021 - Strategy is to sequence a circularized DNA molecule instead of a linear DNA molecule by PacBio or Nanopore - Allows for multiple rounds of sequencing of a single DNA molecule generating a long sequencing read with multiple copies (subreads) - Comparison of subread sequences identifies errors and increases fidelity - 99.9% accuracy

Answer 48

Due to limitations of Sanger and NGS in obtaining sequence of highly repetitive regions and structural polymorphisms (section of a gene occurs in several different forms, such as copy number variation, rearrangements, inversions of regions greater than 1 kb)

Answer 49

Large sequencing gaps remain on short arms of acrocentric chromosomes, as well as the Y chromosome

Answer 50

The Telomere to TElomere (T2T) Consortium used high-fidelity PacBio and Nanopore platforms to complete sequencing of the human genome.

Answer 51

200 Mbp of new sequence, 2000 candidate genes including 99 new coding genes

Answer 52

New discoveries in gene regulation, genetic variability and new disease loci

Answer 53

Large number of sequence reads of genomes make it easier to identify SNPs linked to polygenic diseases and interesting traits

Answer 54

Sequence genetic material from environmental samples to determine identity and diversity of microbes (gut, volcanic vents, oil sands)

Answer 55

Sequencing capacity allows for greater coverage of a genome that is present in a low proportion of the total genetic material in a sample (e.g. Neanderthal DNA is <5% of sample; wooly mammoth)

Answer 56

Global identification of low abundant transcripts (including microRNAs) with higher sensitivity than microarrays

Answer 57

Global identification of binding sites of nucleic-acid binding proteins and chemical modifications (e.g. histone occupancy and acetylation, transcription factors)

Answer 58

Shorter reads (100-300 bp) in 2nd generation sequencing make assembly of denovo eukaryotic genomes difficult (no template to help assemble the reads) - okay for prokaryotes with little repetitive sequences or for resequencing projects

Answer 59

Have to increase coverage (20X-30X) for 2nd generation sequencing due to shorter reads than Sanger sequencing

Answer 60

Hi-fidelity 3rd generation sequencing reduces error rate, but cost is still expensive (but it is going down)

Answer 61

Few centers with strong infrastructure and support for assembly and analysis

Answer 62

Shortage of highly-trained bioinformaticians for assembly and analysis of genome sequences.

Answer 63

1. Personal genomic information can lay out the health roadmap of the individual (genetic makeup helps identify how we respond to certain therapies) 2. Provide advanced screening for disease (nanopore sequencing can identify bacteria pathogens within 7 hours compared to 2-4 days for culturing) 3. Select safer and more effective medications and dosages 4. Create better vaccines (DNA/RNA vaccines) 5. Lower health care costs

Answer 64

1. Personal genome project (Harvard Medical School and other countries including Canada): can volunteer to "share your genome information for the greater good" 2. 1000 Human Genomes Project (completed in 2012): genomes of over 1092 anonymous people from 14 populations around the world were sequenced. 3. The Cancer Genome Atlas (TCGA): started as a three-year pilot in 2006 funded by NCI and NHGRI to focus on the molecular understanding of brain, lung and ovarian cancer 4. UK10K: Sequence 4000 healthy humans and exomes of 6000 currently living with a genetic disease (obesity, schizophrenia and congenital heart disease) 5. 10K Autism Genome project: sequencing both the kid as well as the parents to determine the polymorphisms between parent and child.

Answer 65

We will gain a better molecular understanding of the disorder and develop better diagnostics and therapeutics

Genome Sequencing Flashcards

(93 cards)