Flashcards
Main difference between first and second generation sequencing
First generation (Sanger sequencing) can only sequence one fragment at a time, while second generation can perform parallel sequencing of multiple fragments
Paired-end sequencing
Both ends of the same DNA fragment is sequenced
Main difference between second and third generation sequencing
In third generation sequencing, sequencing is done for individual DNA molecules. There is no amplification step which it is in 2nd generation
Duplicate (sequencing error)
Are caused by sequencing the same physical DNA fragment multiple times. The reads then all come from the same DNA molecule - don’t describe the true diversity in the sample. Duplicates are often caused by biases in the amplification step
Factors contributing to errors in Illumina sequencing
- Read position: probability of error increases for each sequenced bp
- T has higher error rate than the other nucleotides. GC-rich patterns have high error rate
- First read has lower error rate than second (for paired-end sequencing)
Site specific error (SSE)
Errors that depend on the sequence of the site where the error has occured (example: GC-rich regions)
Steps involved in pre-processing of NGS data
Pre-processing is used to “clean” the data.
- Identifies erroneous reads and bp
- Cleans data by removing errors, using for example filtering or trimming
Coverage
The number of times a nucleotide in the reference in “covered” by reads
Purpose of a variant caller of SNPs
Variant calling aims to identify SNPs in the sequenced genome compared to the reference, and then to distinguish between true mutations and sequencing errors. A good caller should have a high sensitivity (find all true mutaitons) and a high specificity (ignore all false positives)
GATK
Stands for “Genome analysis toolkit” and contains the unified genotyper, which is an advanced mutation caller
Post-processing (genome sequencing)
After SNP variant calling, extra filtering might be needed. Example: sequencing errors only discovered at the end of the reads, or in one certain read direction
Global alignment
Two sequences are aligned over their full length. Can use the Needleman-Wunsch algorithm
Local alignment
Two sequences (often of substantially different lengths) are aligned based on their best matching subsequences. Can use the Smith-Waterman algorithm (modifies NW)
Steps in analysis of genome sequencing data
Pre-processing, read mapping, quality refinement, variant calling
Quality refinement (genome sequencing)
Quality refinement is the step that comes after read mapping but before variant calling in genome sequencing. The quality refinement step aims to remove errors in the data and errors introduced in the read mapping.
Three main steps in analysis of RNA seq data
- Quatification of the gene expression
- Normalization
- Identification of differentially expressed genes
Splice-aware mapper
When mapping RNA-seq reads to a genome, the mapper needs to be able to handle splicing. In other words, the mapper should be allowed to make large gaps, corresponding to introns
Multiple matches (RNA-seq)
One read matches two or more different regions. Can be explained by multiple similar regions in the genome, but also by errors.
Semiquantitative
The quantitative data is relative and therefore influenced by for example one gene being substantially more expressed than others
Which are the three statistical approaches to identify DEGs?
- Methods based on normal assumptions
- Methods based on non-parametric methods
- Methods based on count distributions
Family-wise error rate
The FWER is the probability of at least one false positive, and is equal to 1 – (1 - α)^m
Bonferroni correction
Divide the significance level α by the number of performed tests m. Then use the cut-off α/m instead.
A Bonferroni adjusted p-value can be calculated by multiplying each p-value with m.
Bonferroni corrected p-values always control the FWER
False discovery rate
FDR is the number of false positives in relation to the total number of rejected null hypotheses (significant tests)
Benjamini-Hochberg correction
Order the p-values, then multiply each p-value with the number of tests and divide by its position.
Benjamini-Hochberg correction controls the FDR
Difference between supervised and unsupervised data analysis methods
Supervised methods rely on metadata, while unsupervised methods don’t. Unsupervised methods instead focus on the identification of patterns in the data.
Distance measure (agglomerative hierarchical clustering)
Describes the separation between data points
Linkage criterion (agglomerative hierarchical clustering)
Measures the distance between clusters
Metagenomics
Metagenomics is the study of the metagenome, which is the collective genome in a microbial community
OTU
Operational taxonomic unit (OTU) is a putative species formed by clustering sequences from amplicons. Sequences that are sufficiently similar are clustered together and assumed to be from the same tyoe of organism.
Steps in amplicon sequencing
Reads from amplicons -> Pre-processing -> OTU identification -> OTU annotation -> Statistical analysis -> Results
Singleton
Sequences that don’t cluster with any other sequence. These sequences are OTUs but are, in many cases, discarded since they’re only observed once.
Diversity (amplicon sequencing)
The diversity is an estimate of how many different bacteria are present in the sample. This can reflect nutrient availability and other environmental factors
Alpha diversity
The diversity on the local level
Beta diversity
The diversity between habitats
Richness (alpha diversity)
Unique number of OTUs
Evenness (alpha diversity)
Checks the evenness of distribution of species
Rarefaction
The diversity indices are dependent on sequencing depth. In order to make indices between samples comparable they need to be rarefied, i.e., subsampled to the same sequencing depth.
Direct binning (shotgun metagenomic sequening)
- Search each metagenomic fragment for the presence of genes
- A vast number of the microbial genes are not present in the databases. The search therefore require sensitive aligners and approximate matches are often accepted.
- Requires relatively long reads and not generally possible to do for short reads
- “Bins” are finally formed by counting the number of reads for each type of gene
Reference guided binning (shotgun metagenomic sequening)
- Guided binning uses an annotated reference database that contains the genomes of the microorganisms present in the sample
- Each metagenomic fragment is mapped against the reference database
- “Bins” are formed by counting the number of reads matching each type of gene present in the genomes
- Typically done for data with short reads (<500bp)