Gene module Flashcards
- Explain why RNAseq reads should be mapped with splicing-aware
read mappers?
RNA-Seq reads need splicing-aware mappers because RNA comes from spliced transcripts where introns are removed, and some reads span exon-exon junctions. Regular mappers can’t handle these split reads, but splicing-aware tools (e.g., STAR, HISAT2) can align them correctly, ensuring accurate gene expression analysis and detection of splicing events.
What are the RPKM/FPKM and DESeq2/VST techniques?
Normalization techniques for bulk RNA-seq
RPKM/FPKM (Reads/Fragments Per Kilobase of transcript per
Million mapped reads):
- Normalizes for gene length and sequencing depth
- RPKM (single-end reads), FPKM (paired-end reads)
TPM (Transcripts per million):
- Normalizes for gene length first, then sequencing depth
- Makes expression levels comparable across genes and samples
DESeq2/VST (Varianze stabilising transformation)
: normalizes count data and performs differential gene expression analysis using a negative binomial model. VST (Variance Stabilizing Transformation) is a technique within DESeq2 that stabilizes variance across genes, making the data more suitable for visualization and clustering.
What are the key metrics for QC (in bulk DNA analysis)?
Read Quality: A measure of the accuracy and reliability of sequencing reads, often represented as a Phred score indicating the probability of an error in each base call.
Adapter Content: The presence of adapter sequences (used in library preparation) within the sequencing reads, which can interfere with downstream analysis if not removed.
Sequence Length Distribution: A summary of the lengths of the sequencing reads, used to check for consistency and identify potential trimming or sequencing issues.
GC Content: The proportion of guanine (G) and cytosine (C) bases in the sequences, often analyzed for biases that may affect sequencing coverage or downstream analysis.
Behavioral module
What is the purpose and general idea of a Linear Mixed-Effects Model (LME)?
Purpose: Account for fixed and random effects
* Fixed effects: consistent and systematic across all observations (e.g.
treatment or condition)
* Random effects: batch effects, individual variability
* LME allows to control for confounding variables (random effects) while estimating impact of variables of interest (fixed effects)
What is the issue with testing many genes and how can this be mitigated?
- With thousands of genes, massive number of statistical tests performed
- Some will be detected as differential purely by chance
- Correction methods mitigate the risk of false positives, but increase the
likelihood of false negatives (missing truly differentially expressed genes)
Multiple test correction
* Differential expression: many tests are performed
* Need to take this into account, e.g. using Benjamini–Hochberg
(BH) multiple testing correction
* BH adjusts the p-value based on the number of tests
* It controls the False Discovery Rate (FDR): among all genes called
significantly differentially expressed, which proportion is in reality
from the null model (i.e. not differentially expressed
Applications of PCA in RNA-seq)
- Visualizing relationships between samples
- Detecting outliers (problematic samples)
- Identifying patterns (e.g. influence of treatment
What are average linkage and complete linkage methods for and what is the difference between them?
methods used in hierarchical clustering to determine how clusters are formed by measuring the distance between groups of data points
Complete linkage uses maximal intercluster dissimilarity.
The largest of the pairwise dissimilarities is use
Average linkage uses mean intercluster dissimilarity.
The average of the pairwise dissimilarities is used
What is Enrichment Analysis for and what are the steps?
statistical techniques used to identify whether specific biological categories (e.g., pathways, gene sets, or functional annotations) are overrepresented or “enriched” in a given list of genes, compared to what would be expected by chance.
Steps:
1. Input Gene List:
A set of genes of interest (e.g., differentially expressed genes, genes from a specific cluster, or genes with mutations).
- Reference Background:A larger set of genes representing the entire genome, transcriptome, or experimental dataset.
- Gene Annotations:Categories or functional terms, often from curated databases such as:
* Gene Ontology (GO) terms (e.g., biological processes, cellular components, molecular functions).
* Pathway databases
* Disease databases
4.Statistical Testing:
Compares the overlap between the input gene list and annotated gene sets to assess overrepresentation. Methods include: * Fisher's Exact Test or Hypergeometric Test: Determines whether the overlap is statistically significant.
- Multiple Testing Correction
What procedure is commonly used to reduce the FDR?
Benjamini-Hochberg (BH)
What are the benefits with single cell-approaches compared to Bulk RNA-seq?
- bulk RNA-seq analyzes average gene expression: masks cell-to-cell variability
- Single-Cell Approaches: capture heterogeneity
- profiles gene expression at single-cell level
- insights into (rare) cell types, cell states
- dynamic processes
applications and workflow of Single-Cell RNA Sequencing (scRNA-seq) preprocessing
?
- Identifying rare cell types
- In bulk RNA-seq these would not be picked up
- Understanding differentiation
- Define “cell trajectories”
- Disease progression
WORKFLOW scTNA
1. Cell dissociation and isolation (e.g., FACS, microfluidics)
- Cell barcoding and amplification
* Amplification using PCR
* Barcoding needed to distinguish the individual cells during data analysis:
add a short nucleotide sequence to the mRNA
* All the molecules from a single cell will have the same barcode - after barcoding and amplification, the cells are pooled into one
sequencing library
Challenges in scRNA-seq
- single cell vs single Nucleus:
* Some cells are harder to capture during dissociation:
* Nuclei are more resistant to force: This makes it easier to isolate nuclei than whole cells in some cases.
* Nuclei reflect transcriptional patterns: Transcription in the nucleus can approximate gene expression but may lack full context. - Dropouts:
* A phenomenon where a gene is expressed in one cell but not detected in another cell of the same type, due to low expression levels or technical issues.
* Can complicate interpretation. - Batch Effects:
* Variations caused by technical differences between experiments (e.g., processing on different days or labs).
* These differences may overshadow true biological variation, requiring normalization to remove non-biological effects.
What is spatial transcriptomics and what is the applications of it?
- Maps gene expression to tissue locations, preserving spatial context
(Techniques: Slide-seq, Visium, MERFISH, stereo-seq)
Applications:
* Reveals spatial organization of tissues: map gene expression to brain anatomy
* Understanding cell-type diversity
* Interactions and cell-cell communication
What does single-cell ATAC-seq do and what insights can be gained from it?
Single-Cell ATAC-seq: Profiling Chromatin Accessibility
Purpose: Profiles chromatin accessibility at single-cell resolution to identify active regulatory regions (e.g., enhancers and promoters).
Insight: Reveals which regions of the genome are open and potentially regulating gene expression in specific cell types.
What does Single-Cell DNA Methylation Sequencing do and what insights can be gained from it?
*Purpose: Profiles DNA methylation (an epigenetic modification) at single-cell resolution, using bisulfite sequencing.
- Insight: Studies cell-to-cell variation in methylation, helping understand stable epigenetic regulation and its role in cell identity and developmen