Introduction Flashcards

1
Q

Classical Genetics

A
  • Also called “Forward genetics”
  • Start with the phenotype due to natural or induced variation, and then trying to find the genetic basis for that observed phenotype
  • This is the type of genetics Gregor Mendel was doing with his pea plants.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Mapping

A
  • Finding the number and location of genes
  • Mapping relies on genetic linkage
    • Traits the always show up together are probably on the same chromosome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sanger Sequencing

A
  • Sanger sequencing requires the DNA you wish to sequence, a primer, some nucleotides, a polymerase, and dideoxynucleotides
  • Dideoxynucleotides are lacking an -OH group on their 3’ carbon, making them unable to bind to another nucleotide, and thus stopping replication
  • Sanger sequencing was originally carried out in four different tubes, with a different dideoxynucleotide added to each tube since there wasn’t a way to tell the dideoxynucleotides apart
  • Once different fluorescents were attached to the different dideoxynucleotides so they could be told apart, all four nucleotides were able to be added at once.
  • It then evolved to capillary electrophoresis, in which the synthesized DNA was run through a capillary so it would separate out by size with smallest on bottom, and a laser detected the dideoxynucleotide (last base added)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What limits the length of samples in capillary sequencing?

A
  • The ratio of ddNTPs: dNTPs, as well as how well you can separate out the lengths
  • This is because the longer the strands get, the less discrepancy there is between them, so the harder it is to tell if there is a one base difference between strands or not.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Multiplexing

A
  • Multiplexing is just doing capillary electrophoresis, but on 96 or 184 capillary tubes at once. If capillary electrophoresis is like reading a book, multiplexing is like reading 96 books at once
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What sequencing strategies were used during the Human Genome Project?

A

Primer Walking, which is a top-down strategy:

  • Primer walking involves designing a primer for a part of the sequence you already know, sequencing out as far as you can go, designing a new primer based off the end of the read you just got, and repeating
  • The problem with primer walking is that you can’t sequence multiple parts of the genome at the same time (multiplex), because you need one part of the genome to be able tome a primer for the next part of the genome.
  • Primer walking was mainly only used in the beginning of the human genome project

Shotgun sequencing:

  • Shotgun sequencing is basically fragmenting the genome, sequencing the fragments, and then trying to stitch the genome back together again by looking for overlaps between fragments
  • The first step is to fragment the genome via sonication and
  • Blunt ended ligation is then carried out on the fragments so it is possible to attach an oligo we design to the end of the fragments so we can sequence using a primer complimentary to the oligo
  • In order to isolate the fragments from one another, and generate enough of a particular kind of fragment to sequence it, the fragments are isolated via clone isolation followed by amplification, in which they get inserted into a vector and taken up by bacterium via transformation

The third strategy was to do one section of the genome at a time, and was carried out by the public effort (Phase 1) of the HGP

  • They would take long fragments, too long to do Sanger sequencing on, and transform them into BACs
  • They would then map the fragments to the chromosome and separate out all the fragments that mapped to that particular chromosome, then they would do typical shotgun sequencing, in which these chosen fragments were fragmented into smaller, random fragments, but in small vectors, and sequenced
  • The private sector also contributed to the HGP (they were Phase 2), and their strategy was to just do shotgun sequencing on the whole genome, by fragmenting the whole genome into small random fragments, inserting the fragments into small vectors, and sequencing them
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of “Vectors”

A
  • BAC (Bacterial Artificial Chromosome): big vectors that are 300 kb or less
  • Little vectors:
    • plasmids: 10 kb or less
    • lamda viruses: 18 kb or less
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Next Generation Sequencing

A
  • Short read sequencing, that carries out millions of reads at a time
  • Illumina is main company that does this
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Illumina Sequencing

A
  • Has four main steps: fragmentation, isolation and amplifying clones, sequencing by synthesis, and assembly and scaffolding
  • These are the same steps as Sanger sequencing, they’re just done differently
    Fragmentation:
  • Genome is fragmented into fragments shorter than in first-generation sequencing, and adapters are attached at the ends via blunt ended ligation
  • adapters are short pieces of DNA (oligonucleotides) that have a defined sequence
  • The adapters on the ends are different from one another, and are NOT complementary to one another
  • There are four oligos on the lawn: the complements to each strand of the forward adapter (so the forward adapters), and the complements to each strand of the reverse adapters (so the reverse adapters)
    Isolating and Amplifying Clones:
  • They use flow cells, which are cells that have channels running through it with ports on the ends so reagents can flow through
  • The genome fragments are then loaded onto the cell at a concentration at which each fragment can find its complementary lawn oligo in an isolated location so it has enough room to multiply without overlapping with any other fragment
  • The adapters then hybridize to the complementary lawn DNA
  • Amplification is carried out in a process called ‘Bridge amplification’
  • Bridge amplification allows you to make clusters of your DNA (both strands)
    a. First, the polymerase, nucleotides, etc are added, giving us a complementary fragment to our original fragment
    b. We wash away our original fragment, and are left with the complement, which is covalently bound to the cell because it contains a lawn oligo
    c. The conditions in the flow cell are altered so this complementary fragment will kind of bend over to find its complementary oligo on the lawn and hybridize, forming a bridge
    d. This synthesized fragment can then make a complementary copy of itself, which will be the same as the original fragment
  • This process is then repeated over and over again until there is a cluster of the fragment and its complement
  • This will give a bunch of different clusters on the lawn, much like bacterial colonies in a petrie dish
    Sequencing by Synthesis:
  • We then add polymerase, reversible dNTPs, and a primer for only ONE of the adapters so we only sequence one strand of DNA in each cluster and not both complementary strands
  • The reversible dNTPs are able to stop elongation from occurring, because there is a molecule attached at the end of them that blocks the polymerase, but this molecule can easily be removed, so we can start the process again when we wish after each nucleotide is added
  • At each nucleotide, the polymerase adds it, we flow away all other nucleotides, take a picture, remove the terminator molecule of the rtf dNTP and fluorophore, and repeat until we have the sequence of the fragment
  • The raw data we generate is a stack of pictures showing each position
  • To reconstruct the sequence, you have to go through the pictures (see clicker question from 10/4 lecture online)
    Assembly and Scaffolding:
  • We then go through all the short reads to find overlaps by using algorithms to assemble contigs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Draw out the Process of Bridge amplification (refer to lectures 4 and 5 in notes)

A

Done

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Measuring sequence assembly

A
  • One way people measure how “good” an assembly is is by looking at how many reads have been able to be assembled into contigs
  • Another way people measure this is by looking at the number of contigs that were made, or the length of the longest contain made
  • The more commonly used metric for measuring assembly quality, however, is N50, in which 50% of the assembly contained in the contigs is greater than or equal to the N50 length. Basically, what it does is it takes all the contigs, calculates their total length by adding them all together, lines them up in order of size, finds the midpoint of the total length, and the length of the contig that the midpoint falls in is the N50
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Factors that Limit assembly quality

A
  1. Low coverage
    - average coverage = (# reads x read length)/ genome size
    - Low coverage can lead to areas in which there is no overlap, or entire gaps
  2. Difficult sequences
    - Typically due to repeats and heterozygosity
    - When there are a lot of repeating sequences, and there aren’t any reads that cover the entire sequence, it can be difficult if not impossible to know how many repeats there actually are
    - Heterozygosity is when we get two different bases at the same position in roughly the same proportion. This is because we have one chromosome from mom and one from dad, and they could have different bases at different locations
  3. Low accuracy
    - Mistakes in sequencing can lead to bad assembly
    - Mistakes can be due to missing the addition of a nucleotide, throwing all the reads off, by the wrong nucleotide getting incorporated, etc
    - This can happen during either sequencing or amplification; obviously a bigger problem if it occurs during amplification, and more difficult to catch
    - To measure sequencing error, a quality score is calculated by using different filters for each base on each position, so four pictures are taken at a given position and the intensity is measured. This gives us the fraction intensity of each color at each spot. The higher the fraction, the more confident we can be that that is the correct base at that position
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Scaffolding with pairs

A
  • This is a technique to improve assembly quality
  • It involves sequencing complementary fragments after sequencing the first fragment strands and washing the synthesized strands away so you have two overlapping reads of a fragment.
  • Any read pairs that are mismatched can then be thrown away
  • Larger fragments that are larger than the read length can be used too, which is where the actually “scaffolding” comes in
  • To do so, we generate reads for both sides of the long fragment
  • We can then match up the ends of the long fragment to the ends of other fragments, and as long as we know how long the long fragment is, that can tell us how many base pairs are “missing” in a particular location, which can be helpful to know proper spacing, and maybe even know how many repeats a region has
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

3rd Generation Sequencing

A

Can also be used to improve assembly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Genome annotation

A
  • The process of identifying genes
  • There are 3 general strategies for genome annotation:
    1) Inspection, which relies on the fact that genes have distinct sequence features and tries to find them
    2) Homology, which involves comparing genomes with other, similar genomes
    3) Experimentation, which involves isolating mRNA
  • Methods 1 and 2 are bioinformatic approaches, and method 3 is a wet-lab approach
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Inspection

A
  • A genome annotation approach that involves bioinformatics
  • In Inspection, you want to find ORFs that might be genes
  • The way you typically approach this is by looking for ‘start’ and ‘stop’ codons
  • To do so, one typically finds all the stop codons in every possible frame, and then loops for start codons upstream of every stop codon in the same frame without another stop codon in the middle; these are your ORFs
  • You then decide whether you think the ORF is a likely gene or not
  • To determine that, it is important to take into consideration that the average “by chance” ORF is 21 codons long, and that the average real protein length is around 200 amino acids (codons) long, so you can narrow it down by ORF size
  • A problem arises with inspection, however, and that problem is introns
  • In eukaryotes, we have introns that are not part of the coding sequence
  • These introns could have stop codons inside of them, making us think an ORF we found is shorter than it actually is, or starts before it actually does
  • One solution to overcoming the intron problem is to look at codon bias. Codons that are more favored by the organism are more likely to be in the actually coding sequence and codons favored less by the organism are more likely to be found in an intron
  • Another way to overcome the intron problem is to look for exon-intron boundaries
  • Introns and exons have semi-distinct features at their boundaries for splice sites; these sequences aren’t exact, but are similar
  • We can look at consensus plots to determine if we think the sequence is a boundary or not
  • A third solution is to look for features associated with transcription (mainly promoters). In that same fashion, we can look for CpG islands, since 40% of genes have these islands upstream of them
    The process undertaken when considering ORFs as potential proteins is the following:
    1) Look at the length of the ORF and see if it looks long enough to be a protein
    2) Match codon preference of the organism
    3) Locate introns directly in the exon/intron boundaries
    4) Look for factors associated with transcription
17
Q

Codon bias

A
  • Within redundant code groups, different organisms have different preferred codons that they use in their genes, dependent upon the prevalence of the different t-RNAS (anti-codons)
18
Q

Consensus plot

A

Consensus plots show the frequency of each base in a particular sequence, with the larger the base meaning the more likely it is to be at that position

19
Q

CpG islands

A
  • Repeats of C’s and G’s grouped together

- Typically found upstream of genes

20
Q

Genome Annotation via Homology

A
  • If you think you’ve identified a potential exon, but aren’t 100% sure, you can search in a database to see if that exon or a similar exon shows up as a protein-coding gene in another organism
  • This uses the idea of homology, in that similarity between organisms is often the result of a common ancestor
  • We call two proteins with similarity homologs
  • We can “rank” how similar things are by looking at their percent identity, which is the percentage of positions between the homologs with the same base or AA
  • To know if its actually a homolog or not, we look at E-value which is the number of matches we would expect in the database between two exons by sheer chance. When the E-value is <1, it is highly unlikely that the similarities are just by chance
  • Less than 50% of reads are annotated, the rest are considered “ORFans”
21
Q

Genome Annotation via Experiment

A
  • Involves finding genes by looking at what part of the genome is transcribed, and is done via RNA-seq
22
Q

RNA-seq

A
  • RNA-seq is a wet-lab technique used for a few different applications
  • The main use of RNA-seq is to see what genes in the genome are transcribed
  • RNA-seq involves extracting the RNA from an organism, converting it to cDNA and doing shot-gun sequencing on it, then matching the reads up to the genome
  • It can be used in genome annotation, with the places the RNA matches up with the genome are genes that are transcribed. Something that you have to be careful about, however, is that not all genes are transcribed at the same time, so there could very easily be some genes that weren’t expressed at the time the RNA was extracted
23
Q

Functional Annotation

A
  • Once a gene has been identified, functional annotation can be carried out to see what the gene does
    Experimental Strategies for Functional annotation are:
  • Knockout the gene
  • Over-express the gene
  • Look for expression patterns
  • Determine the structure of the gene product
    Computational Strategies include:
  • Homology
  • Structure prediction
24
Q

Knockout

A
  • A technique that can be employed to help determine the function of a gene
  • Involves knocking out the gene and observing if there is any change in phenotype due to that gene not being expressed anymore
  • The actually knock-out part can be done via recombination, RNAi, or CRISPR - Cas9
  • This is a gene-by-gene approach, not a genomic approach
25
Q

Over-Express

A
  • Over-expressing a gene is a way to help determine the function of a gene
  • In involves increasing the expression of the gene by placing a strong promoter in front of the gene and observing if there are any changes in phenotype
  • This is a gene-by-gene approach, not a genomic approach
26
Q

Measuring Expression Patterns

A
  • This can be done to help determine the function of a gene
  • One way we can measure expression patterns is take the gene we’re interested in and attach a reporter gene next to
    it which will give some visual indication when it is expressed, like by producing a colored pigment or a fluorescent protein
    -Whenever our gene of interest is expressed, our reporter gene will be expressed too, and we’ll be able to visualize the expression of our gene of interest
  • Another way to measure gene expression patterns is to measure mRNA abundance over time during different developmental/life stages, indicating when in the organism’s life the gene is expressed the most, and thus when it is the most important
  • These are gene-by-gene approaches, not genomic approaches
27
Q

Structure determination

A
  • Can be used to help determine the function of a gene
  • Involves expressing the gene is bacteria, collecting a lot of its gene product, and using methods such as crystallography to determine its structure
  • This is a gene-by-gene approach, not a genomic approach
28
Q

Functional Annotation via Homology

A
  • Basically just involves looking at what homologs of the gene of interest do in other species
  • This is a gene-by-gene approach, not a genomic approach
29
Q

Structure Prediction

A
  • A computational method to try and determine what a gene does
  • It involves putting the sequence of amino acids into a program that takes thermodynamics and chemical and physical interactions into account to predict what the structure of the gene product may be