Flashcards

1
Q

Main difference between first and second generation sequencing

A

First generation (Sanger sequencing) can only sequence one fragment at a time, while second generation can perform parallel sequencing of multiple fragments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Paired-end sequencing

A

Both ends of the same DNA fragment is sequenced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Main difference between second and third generation sequencing

A

In third generation sequencing, sequencing is done for individual DNA molecules. There is no amplification step which it is in 2nd generation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Duplicate (sequencing error)

A

Are caused by sequencing the same physical DNA fragment multiple times. The reads then all come from the same DNA molecule - don’t describe the true diversity in the sample. Duplicates are often caused by biases in the amplification step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Factors contributing to errors in Illumina sequencing

A
  • Read position: probability of error increases for each sequenced bp
  • T has higher error rate than the other nucleotides. GC-rich patterns have high error rate
  • First read has lower error rate than second (for paired-end sequencing)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Site specific error (SSE)

A

Errors that depend on the sequence of the site where the error has occured (example: GC-rich regions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Steps involved in pre-processing of NGS data

A

Pre-processing is used to “clean” the data.

  • Identifies erroneous reads and bp
  • Cleans data by removing errors, using for example filtering or trimming
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Coverage

A

The number of times a nucleotide in the reference in “covered” by reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Purpose of a variant caller of SNPs

A

Variant calling aims to identify SNPs in the sequenced genome compared to the reference, and then to distinguish between true mutations and sequencing errors. A good caller should have a high sensitivity (find all true mutaitons) and a high specificity (ignore all false positives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

GATK

A

Stands for “Genome analysis toolkit” and contains the unified genotyper, which is an advanced mutation caller

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Post-processing (genome sequencing)

A

After SNP variant calling, extra filtering might be needed. Example: sequencing errors only discovered at the end of the reads, or in one certain read direction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Global alignment

A

Two sequences are aligned over their full length. Can use the Needleman-Wunsch algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Local alignment

A

Two sequences (often of substantially different lengths) are aligned based on their best matching subsequences. Can use the Smith-Waterman algorithm (modifies NW)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Steps in analysis of genome sequencing data

A

Pre-processing, read mapping, quality refinement, variant calling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Quality refinement (genome sequencing)

A

Quality refinement is the step that comes after read mapping but before variant calling in genome sequencing. The quality refinement step aims to remove errors in the data and errors introduced in the read mapping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Three main steps in analysis of RNA seq data

A
  1. Quatification of the gene expression
  2. Normalization
  3. Identification of differentially expressed genes
17
Q

Splice-aware mapper

A

When mapping RNA-seq reads to a genome, the mapper needs to be able to handle splicing. In other words, the mapper should be allowed to make large gaps, corresponding to introns

18
Q

Multiple matches (RNA-seq)

A

One read matches two or more different regions. Can be explained by multiple similar regions in the genome, but also by errors.

19
Q

Semiquantitative

A

The quantitative data is relative and therefore influenced by for example one gene being substantially more expressed than others

20
Q

Which are the three statistical approaches to identify DEGs?

A
  1. Methods based on normal assumptions
  2. Methods based on non-parametric methods
  3. Methods based on count distributions
21
Q

Family-wise error rate

A

The FWER is the probability of at least one false positive, and is equal to 1 – (1 - α)^m

22
Q

Bonferroni correction

A

Divide the significance level α by the number of performed tests m. Then use the cut-off α/m instead.

A Bonferroni adjusted p-value can be calculated by multiplying each p-value with m.

Bonferroni corrected p-values always control the FWER

23
Q

False discovery rate

A

FDR is the number of false positives in relation to the total number of rejected null hypotheses (significant tests)

24
Q

Benjamini-Hochberg correction

A

Order the p-values, then multiply each p-value with the number of tests and divide by its position.

Benjamini-Hochberg correction controls the FDR

25
Q

Difference between supervised and unsupervised data analysis methods

A

Supervised methods rely on metadata, while unsupervised methods don’t. Unsupervised methods instead focus on the identification of patterns in the data.

26
Q

Distance measure (agglomerative hierarchical clustering)

A

Describes the separation between data points

27
Q

Linkage criterion (agglomerative hierarchical clustering)

A

Measures the distance between clusters

28
Q

Metagenomics

A

Metagenomics is the study of the metagenome, which is the collective genome in a microbial community

29
Q

OTU

A

Operational taxonomic unit (OTU) is a putative species formed by clustering sequences from amplicons. Sequences that are sufficiently similar are clustered together and assumed to be from the same tyoe of organism.

30
Q

Steps in amplicon sequencing

A

Reads from amplicons -> Pre-processing -> OTU identification -> OTU annotation -> Statistical analysis -> Results

31
Q

Singleton

A

Sequences that don’t cluster with any other sequence. These sequences are OTUs but are, in many cases, discarded since they’re only observed once.

32
Q

Diversity (amplicon sequencing)

A

The diversity is an estimate of how many different bacteria are present in the sample. This can reflect nutrient availability and other environmental factors

33
Q

Alpha diversity

A

The diversity on the local level

34
Q

Beta diversity

A

The diversity between habitats

35
Q

Richness (alpha diversity)

A

Unique number of OTUs

36
Q

Evenness (alpha diversity)

A

Checks the evenness of distribution of species

37
Q

Rarefaction

A

The diversity indices are dependent on sequencing depth. In order to make indices between samples comparable they need to be rarefied, i.e., subsampled to the same sequencing depth.

38
Q

Direct binning (shotgun metagenomic sequening)

A
  • Search each metagenomic fragment for the presence of genes
  • A vast number of the microbial genes are not present in the databases. The search therefore require sensitive aligners and approximate matches are often accepted.
  • Requires relatively long reads and not generally possible to do for short reads
  • “Bins” are finally formed by counting the number of reads for each type of gene
39
Q

Reference guided binning (shotgun metagenomic sequening)

A
  • Guided binning uses an annotated reference database that contains the genomes of the microorganisms present in the sample
  • Each metagenomic fragment is mapped against the reference database
  • “Bins” are formed by counting the number of reads matching each type of gene present in the genomes
  • Typically done for data with short reads (<500bp)