Introduction Flashcards
Classical Genetics
- Also called “Forward genetics”
- Start with the phenotype due to natural or induced variation, and then trying to find the genetic basis for that observed phenotype
- This is the type of genetics Gregor Mendel was doing with his pea plants.
Mapping
- Finding the number and location of genes
- Mapping relies on genetic linkage
- Traits the always show up together are probably on the same chromosome
Sanger Sequencing
- Sanger sequencing requires the DNA you wish to sequence, a primer, some nucleotides, a polymerase, and dideoxynucleotides
- Dideoxynucleotides are lacking an -OH group on their 3’ carbon, making them unable to bind to another nucleotide, and thus stopping replication
- Sanger sequencing was originally carried out in four different tubes, with a different dideoxynucleotide added to each tube since there wasn’t a way to tell the dideoxynucleotides apart
- Once different fluorescents were attached to the different dideoxynucleotides so they could be told apart, all four nucleotides were able to be added at once.
- It then evolved to capillary electrophoresis, in which the synthesized DNA was run through a capillary so it would separate out by size with smallest on bottom, and a laser detected the dideoxynucleotide (last base added)
What limits the length of samples in capillary sequencing?
- The ratio of ddNTPs: dNTPs, as well as how well you can separate out the lengths
- This is because the longer the strands get, the less discrepancy there is between them, so the harder it is to tell if there is a one base difference between strands or not.
Multiplexing
- Multiplexing is just doing capillary electrophoresis, but on 96 or 184 capillary tubes at once. If capillary electrophoresis is like reading a book, multiplexing is like reading 96 books at once
What sequencing strategies were used during the Human Genome Project?
Primer Walking, which is a top-down strategy:
- Primer walking involves designing a primer for a part of the sequence you already know, sequencing out as far as you can go, designing a new primer based off the end of the read you just got, and repeating
- The problem with primer walking is that you can’t sequence multiple parts of the genome at the same time (multiplex), because you need one part of the genome to be able tome a primer for the next part of the genome.
- Primer walking was mainly only used in the beginning of the human genome project
Shotgun sequencing:
- Shotgun sequencing is basically fragmenting the genome, sequencing the fragments, and then trying to stitch the genome back together again by looking for overlaps between fragments
- The first step is to fragment the genome via sonication and
- Blunt ended ligation is then carried out on the fragments so it is possible to attach an oligo we design to the end of the fragments so we can sequence using a primer complimentary to the oligo
- In order to isolate the fragments from one another, and generate enough of a particular kind of fragment to sequence it, the fragments are isolated via clone isolation followed by amplification, in which they get inserted into a vector and taken up by bacterium via transformation
The third strategy was to do one section of the genome at a time, and was carried out by the public effort (Phase 1) of the HGP
- They would take long fragments, too long to do Sanger sequencing on, and transform them into BACs
- They would then map the fragments to the chromosome and separate out all the fragments that mapped to that particular chromosome, then they would do typical shotgun sequencing, in which these chosen fragments were fragmented into smaller, random fragments, but in small vectors, and sequenced
- The private sector also contributed to the HGP (they were Phase 2), and their strategy was to just do shotgun sequencing on the whole genome, by fragmenting the whole genome into small random fragments, inserting the fragments into small vectors, and sequencing them
Types of “Vectors”
- BAC (Bacterial Artificial Chromosome): big vectors that are 300 kb or less
- Little vectors:
- plasmids: 10 kb or less
- lamda viruses: 18 kb or less
Next Generation Sequencing
- Short read sequencing, that carries out millions of reads at a time
- Illumina is main company that does this
Illumina Sequencing
- Has four main steps: fragmentation, isolation and amplifying clones, sequencing by synthesis, and assembly and scaffolding
- These are the same steps as Sanger sequencing, they’re just done differently
Fragmentation: - Genome is fragmented into fragments shorter than in first-generation sequencing, and adapters are attached at the ends via blunt ended ligation
- adapters are short pieces of DNA (oligonucleotides) that have a defined sequence
- The adapters on the ends are different from one another, and are NOT complementary to one another
- There are four oligos on the lawn: the complements to each strand of the forward adapter (so the forward adapters), and the complements to each strand of the reverse adapters (so the reverse adapters)
Isolating and Amplifying Clones: - They use flow cells, which are cells that have channels running through it with ports on the ends so reagents can flow through
- The genome fragments are then loaded onto the cell at a concentration at which each fragment can find its complementary lawn oligo in an isolated location so it has enough room to multiply without overlapping with any other fragment
- The adapters then hybridize to the complementary lawn DNA
- Amplification is carried out in a process called ‘Bridge amplification’
- Bridge amplification allows you to make clusters of your DNA (both strands)
a. First, the polymerase, nucleotides, etc are added, giving us a complementary fragment to our original fragment
b. We wash away our original fragment, and are left with the complement, which is covalently bound to the cell because it contains a lawn oligo
c. The conditions in the flow cell are altered so this complementary fragment will kind of bend over to find its complementary oligo on the lawn and hybridize, forming a bridge
d. This synthesized fragment can then make a complementary copy of itself, which will be the same as the original fragment - This process is then repeated over and over again until there is a cluster of the fragment and its complement
- This will give a bunch of different clusters on the lawn, much like bacterial colonies in a petrie dish
Sequencing by Synthesis: - We then add polymerase, reversible dNTPs, and a primer for only ONE of the adapters so we only sequence one strand of DNA in each cluster and not both complementary strands
- The reversible dNTPs are able to stop elongation from occurring, because there is a molecule attached at the end of them that blocks the polymerase, but this molecule can easily be removed, so we can start the process again when we wish after each nucleotide is added
- At each nucleotide, the polymerase adds it, we flow away all other nucleotides, take a picture, remove the terminator molecule of the rtf dNTP and fluorophore, and repeat until we have the sequence of the fragment
- The raw data we generate is a stack of pictures showing each position
- To reconstruct the sequence, you have to go through the pictures (see clicker question from 10/4 lecture online)
Assembly and Scaffolding: - We then go through all the short reads to find overlaps by using algorithms to assemble contigs
Draw out the Process of Bridge amplification (refer to lectures 4 and 5 in notes)
Done
Measuring sequence assembly
- One way people measure how “good” an assembly is is by looking at how many reads have been able to be assembled into contigs
- Another way people measure this is by looking at the number of contigs that were made, or the length of the longest contain made
- The more commonly used metric for measuring assembly quality, however, is N50, in which 50% of the assembly contained in the contigs is greater than or equal to the N50 length. Basically, what it does is it takes all the contigs, calculates their total length by adding them all together, lines them up in order of size, finds the midpoint of the total length, and the length of the contig that the midpoint falls in is the N50
Factors that Limit assembly quality
- Low coverage
- average coverage = (# reads x read length)/ genome size
- Low coverage can lead to areas in which there is no overlap, or entire gaps - Difficult sequences
- Typically due to repeats and heterozygosity
- When there are a lot of repeating sequences, and there aren’t any reads that cover the entire sequence, it can be difficult if not impossible to know how many repeats there actually are
- Heterozygosity is when we get two different bases at the same position in roughly the same proportion. This is because we have one chromosome from mom and one from dad, and they could have different bases at different locations - Low accuracy
- Mistakes in sequencing can lead to bad assembly
- Mistakes can be due to missing the addition of a nucleotide, throwing all the reads off, by the wrong nucleotide getting incorporated, etc
- This can happen during either sequencing or amplification; obviously a bigger problem if it occurs during amplification, and more difficult to catch
- To measure sequencing error, a quality score is calculated by using different filters for each base on each position, so four pictures are taken at a given position and the intensity is measured. This gives us the fraction intensity of each color at each spot. The higher the fraction, the more confident we can be that that is the correct base at that position
Scaffolding with pairs
- This is a technique to improve assembly quality
- It involves sequencing complementary fragments after sequencing the first fragment strands and washing the synthesized strands away so you have two overlapping reads of a fragment.
- Any read pairs that are mismatched can then be thrown away
- Larger fragments that are larger than the read length can be used too, which is where the actually “scaffolding” comes in
- To do so, we generate reads for both sides of the long fragment
- We can then match up the ends of the long fragment to the ends of other fragments, and as long as we know how long the long fragment is, that can tell us how many base pairs are “missing” in a particular location, which can be helpful to know proper spacing, and maybe even know how many repeats a region has
3rd Generation Sequencing
Can also be used to improve assembly
Genome annotation
- The process of identifying genes
- There are 3 general strategies for genome annotation:
1) Inspection, which relies on the fact that genes have distinct sequence features and tries to find them
2) Homology, which involves comparing genomes with other, similar genomes
3) Experimentation, which involves isolating mRNA - Methods 1 and 2 are bioinformatic approaches, and method 3 is a wet-lab approach