Bioinformatics Flashcards by Lina Håkansson

Name three applications for DNA sequencing.

Confirm cloning of a specific gene
Detect mutation in a specific gene
Identify new species from environmental samples
Sequence your own genome (usually not complete sequencing)
Chromatin-IP (ChIP): identify binding positions for chromatin associated proteins
Bisulfite sequencing: identify DNA methylation sites
RNA expression analysis

And sooo much more!

How well did you know this?

Not at all

Perfectly

In the history of DNA sequencing, what were the five most prominent discoveries/developments?

1953: discovery of DNA double helix
1977: Sanger sequencing developed, first possibility to start sequencing
1985: PCR developed (and further developed with Taq polymerase which held its integrity even in high temperatures –> much more efficient!)
2003: Human genome fully sequenced for the first time (took 13 years in total to complete)
2005: first HTS (high throughput sequencing method developed)

How well did you know this?

Not at all

Perfectly

There are three kinds of sequencing methods in use today, for different purposes. Which, and what purposes are they used for?

Low throughput sequencing (Sanger): still in use for short sequences, like looking at inserts etc.
High throughput sequencing: e.g. Illumina, Ion Torrent, PacBio, Oxford Nanopore, for long sequences.
Genotyping by SNP arrays: comparing SNPs from sample to key sites of SNP variations by seeing if sample hybridizes to known sequences. Used in commercial ancestry kits. NOTE! not really sequencing.

How well did you know this?

Not at all

Perfectly

How does sanger sequencing work in detail?

Sanger sequencing, aka the chain termination method which involves making many copies of a target DNA region using the same principle as PCR but with some tweaks:

You use the same “ingredients” as in a PCR but you have a mix of normal bases and dideoxybases (ddNTPs) which lack the 3’OH group and therefore blocks further elongation once added. You also only use one primer (forward or reverse) to only get the same fragment sequenced, otherwise you’d get conflicting signals.
Run the reaction, several cycles basically guarantee that a ddNTP will have incorporated at every position
outcome: Many fragments of different lengths, each ending in a ddNTP marked with a color.
Run the fragments through capillary gel electrophoresis, and illuminate each fragment with a laser from small to big. The marked base at each fragment length will allow for detection and that way you can base call from each detected signal to get the sequence.

How well did you know this?

Not at all

Perfectly

What are the three major limitations with sanger sequencing?

Sanger sequencing requires a homogenous sample, otherwise there would be confliction light signal for each fragment size. Not required in HTS!
The sample needs to be amplified by PCR or cloning first, not needed in HTS as separation (and sometimes amplification) integrated in sequencing method.
Sanger sequencing is expensive and inefficient for larger-scale projects, such as the sequencing of an entire genome or metagenome.

How well did you know this?

Not at all

Perfectly

What is de novo sequencing vs resequencing?

De novo sequencing is when you’re sequencing a genome for the first time, so there’s no reference genome to compare to. From the millions of reads, you need to align overlapping sequences and puzzle them together to get the whole genome sequence. This needs a lot of computing power!

Resequencing is when you align sequences to a reference genome and by that find their correct positions quickly. This uses less computing power as it is a lot less complex.

How well did you know this?

Not at all

Perfectly

Explain the terms “read” vs “contig”.

one read = one sequence you get from HTS
Contig = longer sequence pieced together
from overlapping reads

How well did you know this?

Not at all

Perfectly

The read length is very important when selecting HTS method, why?

It’s easier to align few long sequences than many short ones, but you get a lot more short reads than long ones with the current methods, so the applications short/long read methods differ. A combination of both is usually the surest way to go but expensive.

Long reads are better for de novo sequencing, but short reads are better to for example find point mutations or isoforms of a gene product.

How well did you know this?

Not at all

Perfectly

What is meant by a “library” in HTS?

When performing HTS you need to do library prep, and the library is your prepped sample. Library prep often include fragmentation of the DNA and adding stuff to your sample to use it in the HTS method, like adapters/ends.

How well did you know this?

Not at all

Perfectly

What is single- vs. paired-end reads?

End reads can be useful in short read sequencing methods, to determine both the order and orientation of the reads. Single-end reads have an added end on one side, while mate-paired end reads have a double added ends which give you the orientation and order for longer sequences which is good to use for building the scaffold.

How well did you know this?

Not at all

Perfectly

What does “coverage” mean in a HTS context?

The coverage number is calculated by taking the total number of sequenced bases divided by the total number of bases in the genome. A high coverage is better, especially in de novo sequence but higher coverage also means more work, so a balance there is good.

Note: coverage is just an average, so a coverage of 1 can still mean you have gaps of un-sequenced DNA, while a lot in other places. This is important to have in mind when choosing a method. For example, you don’t need full coverage when building a scaffold, but if you’re determining a point mutation, you want a high enough coverage to confidently say that you have a true variation (majority of reads showing the same variation, which is hard if you only have two reads with conflicting info) rather than a sequencing error.

How well did you know this?

Not at all

Perfectly

What is “metagenomics”?

Metagenomics = sequencing of mixed populations and then separating organisms by bioinformatics (= in silico).

How well did you know this?

Not at all

Perfectly

What is metagenomics used for and what are the advantages?

Metagenomics is commonly used in ecology to determine the genome of a certain environment from a small sample. The advantages is that no cultivation of organisms required (cultivation-based experiments estimated to miss 99% of microorganisms).

Two main kinds:
1) Whole-genome sequencing
2) Targeted sequencing (usually 16S rDNA)

How well did you know this?

Not at all

Perfectly

When preparing your library for HTS, there are many things to think about. What does GIGO stand for in this context and what does it mean?

GIGO = Garbage in, garbage out: It’s basically a warning reminding you to prepare your sample using the correct approach for your chosen method, to know your sample well and quality check it.

Know your sample: make sure that only what you want sequenced is in your sample, enriching for your target and minimizing contaminants. Sometimes it’s not possible to have pure sample, so if you for example have amoeba DNA as target but you have to feed them bacteria while cultivating, you need to know the sequence of the bacteria to identify any reads of the bacterial DNA in your resulting sequencing data.
choice of extraction method: With methods in commercial kits there are limitations, know what you’ll miss or get extra of and keep this in mind when interpreting the sequence data.
Always quality control your sample. An easy way to know if you need to enrich it more or if you have contaminants. Use methods like NanoDrop/BioDrop to see concentration (be mindful of limitations) or gel to see if DNA/RNA is intact or not. Qubit good too.

How well did you know this?

Not at all

Perfectly

Using HTS methods is very expensive, so sequencing a lot of samples cuts the cost per sample drastically. What approaches can you use if you want to sequence many samples at the same time?

Barcoding: adding a “barcode” to the ends of the fragments to be able to sort them in different “folders” also gives you a starting point.
Add known sequence to adaptors. Same idea as barcoding.

This removes error sources, saves money and workload!

How well did you know this?

Not at all

Perfectly

Which three HTS methods are the biggest players today?

Illumina: short read tech (150-300 bp) based on sequencing by synthesis (SBS). Provides a lot of data with high accuracy.
PacBio: long read tech (average 10-25 kb) with lower yield based on real time sequencing and detection by fluorescence. Higher accuracy with HiFi read tech.
Oxford nanopore: long read tech (up to several MB bp) with lower yield based on detection of bases by changes in current flow.

How well did you know this?

Not at all

Perfectly

What are the four steps needed for library prep before HTS?

Target enrichment (optional)
- Depletion of short DNA molecules
- Targeted amplification of specific regions
- polyA-selection / rRNA depletion
- Enrichment of sRNA

Input DNA/RNA fragmentation and End-repair
- Size selection
- Not performed when maximum read length
is wanted (e.g. Nanopore)

Add adapters and barcodes
- PCR or ligation
- Barcoded samples can be pooled after this step

PCR amplification of library
- Not always necessary
- PCR bias (which need to be kept in mind when analyzing results)

How well did you know this?

Not at all

Perfectly

Describe in as much detail as possible how the Illumina procedure works.

Library prep:
- fragment your extracted and purified DNA, usually by physical shearing by shaking.
- adaptor sequences are added to fragmented DNA (or RNA). In Illumina the adaptor is added in two steps, first adding small pieces on both sides (primers) and then using sequences complementary to the primer, adding ends with complementary seq to those on flow cell (+indexes)
Cluster generation (amplification):
- Add fragments to flow cell, lanes filled with immobilized complementary adaptor sequences, two different oligos with the 5’end attached. The 3’ end of the sample DNAs adaptor sequence hybridize with the attached oligos, and DNA polymerase is added, which starts to polymerize from the attached oligo using the sample as the template strand (as usual in 5’ to 3’ direction) to get dsRNA.
- Then the template strand (your sample) is washed off so that the complimentary copy to your original strand remains, attached in the 5’ end.
- The 3’ end of the complimentary copy strand is now complimentary to the other sort of oligo that are on the flow cell so the strands can now bend to bind to the attached oligos.
- DNA polymerase generates a complementary strand to the template like a bridge, which is an identical copy to the original strand. These are denatured which result in two complimentary copies (identical, one forward=newly synthesized strand and one reverse=the template strand), both attached in the 5’ end.
- This is repeated to create clusters of the same DNA template on one spot (and this happens for several fragments, so you have different clusters for different fragments) on the flow cell.
- The reverse strand is then washed off, which result in clusters of identical forward strands. These clusters are then to be sequenced together to produce a light signal strong enough to detect in the next step.
Sequencing by synthesis (SBS):
- We add the forward sequencing primer, DNA polymerase and fluorescent labeled NTs (A,T,G,C) that block further polymerization and detecting the fluorescent signal from incorporation of one NT.
- The block is removed and the previous step is repeated, again and again until you have sequenced the whole sample. For each addition of a NT, the clusters are exited by a light source which makes them emit a fluorescence signal which is detected and then a block is removed to do it again.
- The fluorescence data from a whole cluster is translated into a sequence by base calling to read each NT in the fragments.
- The labels from each cluster is used to identify which sequences belong to which cluster and is then analyzed by comparing to the reference genome.
- If using paired end sequencing, step 3 is repeated for the reverse strand.

Analysis:
- forward and reverse strands are labeled if using paired end sequencing, the sequences are aligned and clustered if they contain similar base pairings.

pros: millions/billions of reads, fairly low error rate, cons: short reads so not great for de novo sequencing or exon determination.

How well did you know this?

Not at all

Perfectly

Describe in as much detail as possible how the SMRT PacBio procedure works.

Library prep:
- Isolate and purify extracted DNA/RNA.
- Fragment purified DNA/RNA and denature to get ssDNA
- Add adaptors but here they’re hairpin shaped (only one that fits on both sides of the dsDNA but one upside down), which result in circular DNA fragments. The adapter has a known sequence and functions as a primer.
Sequencing:
- Add fragments to SMRT wells with a fixed DNA polymerase in the bottom. The well contains fluorescent labeled NTs.
- When the DNA polymerase adds a NT the fluorescent signal is detected. The focal point is small and narrow to only get the signal from one NT.
- The circular shape of the fragment gives the opportunity to create many connected repeats of the same sequence, which provides very high accuracy for each fragment as you have multiple repeats you can align to minimize errors = HiFi reads. You can also be happy with one circle (maybe if you have one very long fragment in the sample). The time and length of the fragment decide which approach you use.

Pros: long reads with high accuracy, cons: less data than Illumina but better for de novo sequencing. Much faster than illumina, and can detect DNA methylation etc (based on how long it takes to incorporate the NT).

Describe in as much detail as possible how the Oxford Nanopore procedure works.

Library prep:
- fragment DNA and denature to ssDNA
- add motor proteins with an adaptor added. No need to add anything to fragments.
Sequencing:
- The sample is loaded onto a flow cell with many pores in a membrane.
- The adaptor is tethered to the well and the motor protein feeds the ssDNA through the well
- changes in current in the cell are detected. Different flow states indicate different nucleotides. The current flow data is then used for base calling to determine the sequence.

Note: New advancement, keeping the comp strand at the same pore and reading it after, which gives you double the info to correct for errors.

pros: accessible, very long reads. cons: higher error rate but is getting better. Can also be used to detect modifications.

Which HTS method(s) would you use for de novo sequencing and why?

For de novo sequencing you’d want to use long-read tech, PacBio or Nanopore, because longer reads are easier to puzzle together. For a big genome, Nanopore would probably be best as it generates the longest reads of the two.

Note: The best would be a combination of short- and long-read, as the long-read data is good for scaffolding but have more errors, while the short read data is more precise and more reads, which together gives the possibility of producing an accurate sequence. But that’s expensive!

Which HTS method would you use to evaluate gene expression in different cells?

Short-read tech, Illumina. For this question you’d want as much data as possible to be able to first, determine differences in gene expression but also to have enough data so say that the differences are true and not a false positive.

Which HTS method would you use to find new isoforms of a gene-product?

Long-read tech, so either PacBio or Nanopore. When using long read tech you can easily see gaps in the same read or alternative lengths of the sequence, which make it clear if any alternative isoforms are present. With short-read tech you would not get reads longer than exons, so you could easily miss isoforms.

What is bioinformatics?

Bioinformatics is an interdisciplinary field of science (biology, computer science and statistics) that develops methods and software tools for understanding biological data, especially when the data sets are large and complex.

What is bioinformatics used for?

Bioinformatics is used to make predictions from large datasets.

Name three applications for bioinformatics.

- Assemble genome sequences: In prokaryotes; if circular=complete, more complex in eukaryotes. Haploid organisms easier, for diploid more complex as databases often only have one "representative" chromosome. Genomes are also dynamic and change over time. - Annotation: Predicting genes and introns. In prokaryotes; look for start (ATG) and stop (TGA) codon = gene in between. When having found what you think is a gene you can BLAST it to evaluate function from related organisms with annotated genomes. In eukayotes; more complex as introns exist and are different lengths, often combined with RNA-seq to see what is expressed. - Differential expression analyses: How are genes regulated in different conditions? - Phylogeny: Looking at and comparing genomes to investigate how organisms are related to each other. Usually of genes are similar in sequence and function, they're related. Also to look at evolution of gene families. Note: if genes are present two times in one organism and only once in another (but exactly the same) it's probably because of gene duplication (or horizontal transfer in prokaryotes). - Molecular interactions: RNA/RNA, RNA/DNA, protein/RNA/DNA.

What is a database?

An organized collection of data and knowledge. Can be both local and public. Used in many bioinformatic tools to handle large amount of data. Eg, NCBI, wormbase, GEO, dictyBase etc.

Many bioinformatic databases have tons of data, which can make it hard to find what you're looking for. How can you go around this problem?

Looking for scientific articles and find one using the data you looking for, and use their data! Nowadays it's basically required to make your data public and refer to how to find it in your publication.

What is transcriptomics?

Transcriptomics is the global analysis of RNA levels. Gives a good picture of the abundance/level/expression of different mRNAs (and sometimes non-coding RNAs). Transcriptome = the sum total of all the messenger RNA molecules expressed from the genes of an organism.

Why is it important to differentiate between proteomics and transcriptomics?

- Many differences in gene expression are seen already at RNA level. Changes in RNA levels do not always correlate with changes in protein levels. - Transcriptomics is easier/cheaper/less biased than proteomics as not all proteins are detected with proteomics, RNA can base pair with RNA or DNA --> Non-coding RNAs (e.g. microRNAs, long ncRNAs) Remember: when doing RNA-seq, you look at steady-state levels of RNA! Cannot say whether increase/decrease is at transcriptional or post-transcriptional level.

There are three groups of methods for quantification of RNA (transcriptomics). What are these? Provide an example of each.

- PCR-based: Low throughput: RT-qPCR Hybridization-based: - Low throughput: Northern blot, high throughput: (Microarray). - Sequencing-based: High throughput: RNA-seq

How does microarrays work?

Microarrays is a hybridization based method, which means that only known sequences can be detected. Basically, you have a grid with probes of known sequences and then you add your sample and control tagged with fluorescent dyes, wash everything not hybridized off, illuminate and detect where things have hybridized. Note: Is/will soon be outcompeted by RNA-seq.

How does RNA-seq work? Why is it advantageous to microarrays?

RNA-seq = transcriptomics based on HTS. For most tecniques, RNA is reverse transcribed into cDNA and then sequenced and this is what we call RNA-seq. Note: When using Oxford nanopore, there's no need to convert to cDNA first! Advantages compared to microarrays: - No previous sequence information required, so you don't miss stuff due to unknown sequences. - Also, strand information is possible, although depending on library preparation protocol (strand specific sequencing, only use cDNA in one direction=yes, non-strand specific sequencing, using both cDNA strands=no)

Name three different applications of RNA-seq.

- Gene annotation: e.g. looking into alternative splicing (easiest to use long-read methods) - Identification of RNA editing sites: Either by direct RNA sequencing like Nanopore, or using antibodies against specific mods like methylation. - Cross-linking immunoprecipitation (CLIP): identify RNA binding site of RNA binding proteins. - Differential expression analysis: more reads on one exon in a sample = more expression, if all exons have the same coverage = equal expression. Remember that this can't be used to compare different samples as they are prepared independently (always other conditions).

How does Cross-linking immunoprecipitation (CLIP) work?

You add the proteins to the RNA, treat with UV light to create cross-links between protein and RNA, then you use antibodies that bind to the proteins, cut surrounding RNA and "fish out" the RNA and sequence it to get the binding site.

What is the difference between a technical and a biological replicate?

A technical replicate is used to account for technical errors, which is done by repeating the experiment with the same samples. Controls for e.g. pipetting errors. A biological replicate is used to account for biological differences, which is done by repeating the experiment with another sample. Controls for e.g. variation between experiments or organisms.

Why is it so important to always include controls?

To make sure that the result is due to your experiment and not anything else.

When designing experiments, it's important to minimize variation. Give two examples on how to do this.

- Standardize experiments: do everything exactly like you did before. - minimize batch effects by treating a mix of controls and treated samples in each batch.

When doing RNA-seq, the RNA quality matters a lot. What two characteristics to you want for a good sample to use in RNA seq?

- Pure RNA: enrich for what you want! Like removing rRNA, selecting for polyA-tail to only get mature mRNA or doing size selection to only get sRNA. - Intact RNA

How do you perform Poly(A)+ selection?

You use a TTTTT oligo (that hybridize with the AAAAA tail sequence) with a magnetic (or agarose) bead attached, then you can just hold a magnet to the bottom and pour everything else out! The idea is the same to remove rRNA but using probes that bind to rRNA instead.

When you have done the RNA-seq and go on to analyze the results, what is important to do?

1. Always visualize your data! it's important for clarity, understanding and interpretation. For example, if you see that you have many very long gaps in alignment, you should narrow the gap maximum down to make sure things aren't just aligned by chance. 2. Quantification: Just looking at data isn't enough, as small differences can be hard to see with the naked eye and you also want to check whether thing differ significantly or not. 3. Normalization: Read counts need to be normalized as multiple variables play in, especially in different samples. E.g the total number of reads differ, the length of gene differ, and biological variation in RNA composition differ.

Name one method that is used for normalization in differential expression (DE) analysis and explain how it is used.

A common method for DE analysis is DESeq2, which calculates the median of ratios between the sample and pseudo-reference and normalize the read count from that. Then statistical tests are made to find true shifts in expression and to determine the significance of the difference between genes. Note: With more tests you are more likely to get a false positive, so DESeq2 also outputs adjusted p-values and False discovery rates (FDR) which can be trusted.

An additional method is often used to verify your results after performing RNA-seq, give one example.

- RT-qPCR - Northern blot

There are three different levels of transcriptomics, which and what are they used for?

- Most studies are on population level (bulk): Expression levels are a mean of all cells in sample - Single cell: Expression levels for each individual cell in a sample - Spatial transcriptomics: Single cell and location in tissue

Proteomics left!