Bioinformatics Flashcards
Name three applications for DNA sequencing.
- Confirm cloning of a specific gene
- Detect mutation in a specific gene
- Identify new species from environmental samples
- Sequence your own genome (usually not complete sequencing)
- Chromatin-IP (ChIP): identify binding positions for chromatin associated proteins
- Bisulfite sequencing: identify DNA methylation sites
- RNA expression analysis
And sooo much more!
In the history of DNA sequencing, what were the five most prominent discoveries/developments?
- 1953: discovery of DNA double helix
- 1977: Sanger sequencing developed, first possibility to start sequencing
- 1985: PCR developed (and further developed with Taq polymerase which held its integrity even in high temperatures –> much more efficient!)
- 2003: Human genome fully sequenced for the first time (took 13 years in total to complete)
- 2005: first HTS (high throughput sequencing method developed)
There are three kinds of sequencing methods in use today, for different purposes. Which, and what purposes are they used for?
- Low throughput sequencing (Sanger): still in use for short sequences, like looking at inserts etc.
- High throughput sequencing: e.g. Illumina, Ion Torrent, PacBio, Oxford Nanopore, for long sequences.
- Genotyping by SNP arrays: comparing SNPs from sample to key sites of SNP variations by seeing if sample hybridizes to known sequences. Used in commercial ancestry kits. NOTE! not really sequencing.
How does sanger sequencing work in detail?
Sanger sequencing, aka the chain termination method which involves making many copies of a target DNA region using the same principle as PCR but with some tweaks:
- You use the same “ingredients” as in a PCR but you have a mix of normal bases and dideoxybases (ddNTPs) which lack the 3’OH group and therefore blocks further elongation once added. You also only use one primer (forward or reverse) to only get the same fragment sequenced, otherwise you’d get conflicting signals.
- Run the reaction, several cycles basically guarantee that a ddNTP will have incorporated at every position
- outcome: Many fragments of different lengths, each ending in a ddNTP marked with a color.
- Run the fragments through capillary gel electrophoresis, and illuminate each fragment with a laser from small to big. The marked base at each fragment length will allow for detection and that way you can base call from each detected signal to get the sequence.
What are the three major limitations with sanger sequencing?
- Sanger sequencing requires a homogenous sample, otherwise there would be confliction light signal for each fragment size. Not required in HTS!
- The sample needs to be amplified by PCR or cloning first, not needed in HTS as separation (and sometimes amplification) integrated in sequencing method.
- Sanger sequencing is expensive and inefficient for larger-scale projects, such as the sequencing of an entire genome or metagenome.
What is de novo sequencing vs resequencing?
De novo sequencing is when you’re sequencing a genome for the first time, so there’s no reference genome to compare to. From the millions of reads, you need to align overlapping sequences and puzzle them together to get the whole genome sequence. This needs a lot of computing power!
Resequencing is when you align sequences to a reference genome and by that find their correct positions quickly. This uses less computing power as it is a lot less complex.
Explain the terms “read” vs “contig”.
- one read = one sequence you get from HTS
- Contig = longer sequence pieced together
from overlapping reads
The read length is very important when selecting HTS method, why?
It’s easier to align few long sequences than many short ones, but you get a lot more short reads than long ones with the current methods, so the applications short/long read methods differ. A combination of both is usually the surest way to go but expensive.
Long reads are better for de novo sequencing, but short reads are better to for example find point mutations or isoforms of a gene product.
What is meant by a “library” in HTS?
When performing HTS you need to do library prep, and the library is your prepped sample. Library prep often include fragmentation of the DNA and adding stuff to your sample to use it in the HTS method, like adapters/ends.
What is single- vs. paired-end reads?
End reads can be useful in short read sequencing methods, to determine both the order and orientation of the reads. Single-end reads have an added end on one side, while mate-paired end reads have a double added ends which give you the orientation and order for longer sequences which is good to use for building the scaffold.
What does “coverage” mean in a HTS context?
The coverage number is calculated by taking the total number of sequenced bases divided by the total number of bases in the genome. A high coverage is better, especially in de novo sequence but higher coverage also means more work, so a balance there is good.
Note: coverage is just an average, so a coverage of 1 can still mean you have gaps of un-sequenced DNA, while a lot in other places. This is important to have in mind when choosing a method. For example, you don’t need full coverage when building a scaffold, but if you’re determining a point mutation, you want a high enough coverage to confidently say that you have a true variation (majority of reads showing the same variation, which is hard if you only have two reads with conflicting info) rather than a sequencing error.
What is “metagenomics”?
Metagenomics = sequencing of mixed populations and then separating organisms by bioinformatics (= in silico).
What is metagenomics used for and what are the advantages?
Metagenomics is commonly used in ecology to determine the genome of a certain environment from a small sample. The advantages is that no cultivation of organisms required (cultivation-based experiments estimated to miss 99% of microorganisms).
Two main kinds:
1) Whole-genome sequencing
2) Targeted sequencing (usually 16S rDNA)
When preparing your library for HTS, there are many things to think about. What does GIGO stand for in this context and what does it mean?
GIGO = Garbage in, garbage out: It’s basically a warning reminding you to prepare your sample using the correct approach for your chosen method, to know your sample well and quality check it.
- Know your sample: make sure that only what you want sequenced is in your sample, enriching for your target and minimizing contaminants. Sometimes it’s not possible to have pure sample, so if you for example have amoeba DNA as target but you have to feed them bacteria while cultivating, you need to know the sequence of the bacteria to identify any reads of the bacterial DNA in your resulting sequencing data.
- choice of extraction method: With methods in commercial kits there are limitations, know what you’ll miss or get extra of and keep this in mind when interpreting the sequence data.
- Always quality control your sample. An easy way to know if you need to enrich it more or if you have contaminants. Use methods like NanoDrop/BioDrop to see concentration (be mindful of limitations) or gel to see if DNA/RNA is intact or not. Qubit good too.
Using HTS methods is very expensive, so sequencing a lot of samples cuts the cost per sample drastically. What approaches can you use if you want to sequence many samples at the same time?
- Barcoding: adding a “barcode” to the ends of the fragments to be able to sort them in different “folders” also gives you a starting point.
- Add known sequence to adaptors. Same idea as barcoding.
This removes error sources, saves money and workload!
Which three HTS methods are the biggest players today?
- Illumina: short read tech (150-300 bp) based on sequencing by synthesis (SBS). Provides a lot of data with high accuracy.
- PacBio: long read tech (average 10-25 kb) with lower yield based on real time sequencing and detection by fluorescence. Higher accuracy with HiFi read tech.
- Oxford nanopore: long read tech (up to several MB bp) with lower yield based on detection of bases by changes in current flow.
What are the four steps needed for library prep before HTS?
Target enrichment (optional)
- Depletion of short DNA molecules
- Targeted amplification of specific regions
- polyA-selection / rRNA depletion
- Enrichment of sRNA
Input DNA/RNA fragmentation and End-repair
- Size selection
- Not performed when maximum read length
is wanted (e.g. Nanopore)
Add adapters and barcodes
- PCR or ligation
- Barcoded samples can be pooled after this step
PCR amplification of library
- Not always necessary
- PCR bias (which need to be kept in mind when analyzing results)
Describe in as much detail as possible how the Illumina procedure works.
- Library prep:
- fragment your extracted and purified DNA, usually by physical shearing by shaking.
- adaptor sequences are added to fragmented DNA (or RNA). In Illumina the adaptor is added in two steps, first adding small pieces on both sides (primers) and then using sequences complementary to the primer, adding ends with complementary seq to those on flow cell (+indexes) - Cluster generation (amplification):
- Add fragments to flow cell, lanes filled with immobilized complementary adaptor sequences, two different oligos with the 5’end attached. The 3’ end of the sample DNAs adaptor sequence hybridize with the attached oligos, and DNA polymerase is added, which starts to polymerize from the attached oligo using the sample as the template strand (as usual in 5’ to 3’ direction) to get dsRNA.
- Then the template strand (your sample) is washed off so that the complimentary copy to your original strand remains, attached in the 5’ end.
- The 3’ end of the complimentary copy strand is now complimentary to the other sort of oligo that are on the flow cell so the strands can now bend to bind to the attached oligos.
- DNA polymerase generates a complementary strand to the template like a bridge, which is an identical copy to the original strand. These are denatured which result in two complimentary copies (identical, one forward=newly synthesized strand and one reverse=the template strand), both attached in the 5’ end.
- This is repeated to create clusters of the same DNA template on one spot (and this happens for several fragments, so you have different clusters for different fragments) on the flow cell.
- The reverse strand is then washed off, which result in clusters of identical forward strands. These clusters are then to be sequenced together to produce a light signal strong enough to detect in the next step. - Sequencing by synthesis (SBS):
- We add the forward sequencing primer, DNA polymerase and fluorescent labeled NTs (A,T,G,C) that block further polymerization and detecting the fluorescent signal from incorporation of one NT.
- The block is removed and the previous step is repeated, again and again until you have sequenced the whole sample. For each addition of a NT, the clusters are exited by a light source which makes them emit a fluorescence signal which is detected and then a block is removed to do it again.
- The fluorescence data from a whole cluster is translated into a sequence by base calling to read each NT in the fragments.
- The labels from each cluster is used to identify which sequences belong to which cluster and is then analyzed by comparing to the reference genome.
- If using paired end sequencing, step 3 is repeated for the reverse strand.
Analysis:
- forward and reverse strands are labeled if using paired end sequencing, the sequences are aligned and clustered if they contain similar base pairings.
pros: millions/billions of reads, fairly low error rate, cons: short reads so not great for de novo sequencing or exon determination.