Flashcards Flashcards by Olivia Johnsson

Main difference between first and second generation sequencing

First generation (Sanger sequencing) can only sequence one fragment at a time, while second generation can perform parallel sequencing of multiple fragments

How well did you know this?

Not at all

Perfectly

Paired-end sequencing

Both ends of the same DNA fragment is sequenced

How well did you know this?

Not at all

Perfectly

Main difference between second and third generation sequencing

In third generation sequencing, sequencing is done for individual DNA molecules. There is no amplification step which it is in 2nd generation

How well did you know this?

Not at all

Perfectly

Duplicate (sequencing error)

Are caused by sequencing the same physical DNA fragment multiple times. The reads then all come from the same DNA molecule - don’t describe the true diversity in the sample. Duplicates are often caused by biases in the amplification step

How well did you know this?

Not at all

Perfectly

Factors contributing to errors in Illumina sequencing

Read position: probability of error increases for each sequenced bp
T has higher error rate than the other nucleotides. GC-rich patterns have high error rate
First read has lower error rate than second (for paired-end sequencing)

How well did you know this?

Not at all

Perfectly

Site specific error (SSE)

Errors that depend on the sequence of the site where the error has occured (example: GC-rich regions)

How well did you know this?

Not at all

Perfectly

Steps involved in pre-processing of NGS data

Pre-processing is used to “clean” the data.

Identifies erroneous reads and bp
Cleans data by removing errors, using for example filtering or trimming

How well did you know this?

Not at all

Perfectly

Coverage

The number of times a nucleotide in the reference in “covered” by reads

How well did you know this?

Not at all

Perfectly

Purpose of a variant caller of SNPs

Variant calling aims to identify SNPs in the sequenced genome compared to the reference, and then to distinguish between true mutations and sequencing errors. A good caller should have a high sensitivity (find all true mutaitons) and a high specificity (ignore all false positives)

How well did you know this?

Not at all

Perfectly

GATK

Stands for “Genome analysis toolkit” and contains the unified genotyper, which is an advanced mutation caller

How well did you know this?

Not at all

Perfectly

Post-processing (genome sequencing)

After SNP variant calling, extra filtering might be needed. Example: sequencing errors only discovered at the end of the reads, or in one certain read direction

How well did you know this?

Not at all

Perfectly

Global alignment

Two sequences are aligned over their full length. Can use the Needleman-Wunsch algorithm

How well did you know this?

Not at all

Perfectly

Local alignment

Two sequences (often of substantially different lengths) are aligned based on their best matching subsequences. Can use the Smith-Waterman algorithm (modifies NW)

How well did you know this?

Not at all

Perfectly

Steps in analysis of genome sequencing data

Pre-processing, read mapping, quality refinement, variant calling

How well did you know this?

Not at all

Perfectly

Quality refinement (genome sequencing)

Quality refinement is the step that comes after read mapping but before variant calling in genome sequencing. The quality refinement step aims to remove errors in the data and errors introduced in the read mapping.

How well did you know this?

Not at all

Perfectly

Three main steps in analysis of RNA seq data

Study These Flashcards

Quatification of the gene expression
Normalization
Identification of differentially expressed genes

Splice-aware mapper

Study These Flashcards

When mapping RNA-seq reads to a genome, the mapper needs to be able to handle splicing. In other words, the mapper should be allowed to make large gaps, corresponding to introns

Multiple matches (RNA-seq)

Study These Flashcards

One read matches two or more different regions. Can be explained by multiple similar regions in the genome, but also by errors.

Semiquantitative

Study These Flashcards

The quantitative data is relative and therefore influenced by for example one gene being substantially more expressed than others

Which are the three statistical approaches to identify DEGs?

Study These Flashcards

Methods based on normal assumptions
Methods based on non-parametric methods
Methods based on count distributions

Family-wise error rate

Study These Flashcards

The FWER is the probability of at least one false positive, and is equal to 1 – (1 - α)^m

Bonferroni correction

Study These Flashcards

Divide the significance level α by the number of performed tests m. Then use the cut-off α/m instead.

A Bonferroni adjusted p-value can be calculated by multiplying each p-value with m.

Bonferroni corrected p-values always control the FWER

False discovery rate

Study These Flashcards

FDR is the number of false positives in relation to the total number of rejected null hypotheses (significant tests)

Benjamini-Hochberg correction

Study These Flashcards

Order the p-values, then multiply each p-value with the number of tests and divide by its position.

Benjamini-Hochberg correction controls the FDR

Difference between supervised and unsupervised data analysis methods

Supervised methods rely on metadata, while unsupervised methods don't. Unsupervised methods instead focus on the identification of patterns in the data.

Distance measure (agglomerative hierarchical clustering)

Describes the separation between data points

Linkage criterion (agglomerative hierarchical clustering)

Measures the distance between clusters

Metagenomics

Metagenomics is the study of the metagenome, which is the collective genome in a microbial community

OTU

Operational taxonomic unit (OTU) is a putative species formed by clustering sequences from amplicons. Sequences that are sufficiently similar are clustered together and assumed to be from the same tyoe of organism.

Steps in amplicon sequencing

Reads from amplicons -> Pre-processing -> OTU identification -> OTU annotation -> Statistical analysis -> Results

Singleton

Sequences that don’t cluster with any other sequence. These sequences are OTUs but are, in many cases, discarded since they’re only observed once.

Diversity (amplicon sequencing)

The diversity is an estimate of how many different bacteria are present in the sample. This can reflect nutrient availability and other environmental factors

Alpha diversity

The diversity on the local level

Beta diversity

The diversity between habitats

Richness (alpha diversity)

Unique number of OTUs

Evenness (alpha diversity)

Checks the evenness of distribution of species

Rarefaction

The diversity indices are dependent on sequencing depth. In order to make indices between samples comparable they need to be rarefied, i.e., subsampled to the same sequencing depth.

Direct binning (shotgun metagenomic sequening)

* Search each metagenomic fragment for the presence of genes * A vast number of the microbial genes are not present in the databases. The search therefore require sensitive aligners and approximate matches are often accepted. * Requires relatively long reads and not generally possible to do for short reads * “Bins” are finally formed by counting the number of reads for each type of gene

Reference guided binning (shotgun metagenomic sequening)

* Guided binning uses an annotated reference database that contains the genomes of the microorganisms present in the sample * Each metagenomic fragment is mapped against the reference database * “Bins” are formed by counting the number of reads matching each type of gene present in the genomes * Typically done for data with short reads (<500bp)

Flashcards

(39 cards)