Final Exam Flashcards
What can we do with sc-RNA-data?
- Explore which cell types are present in a tissue
- Identify unknown/rare cell types or states
- Elucidate the changes in gene expression during differentiation processes or across time or states
- Identify genes that are differentially expressed in a particular cell types between conditions (e.g. treatments or disease)
- Explore changes in expression among a cell type while incorporating spatial, regulatory, and/ or protein information
- Analyze the cell velocity to uncover processes’ direction and activity
- Identify cell-level mutations and study their expression
- Uncover molecular relationships and regulatory links
What tools can be used to do gene counts normalization and scaling when exploring sample heterogeneity?
Seurat and Cellenics
What tools can be used to perform dimensionality reduction when exploring sample heterogeneity?
PCA, UMAP, t-SNE
What tools can be used to perform cell clustering when exploring sample heterogeneity?
Seurat and Cellenics
What tools can be used to identify known cell types when exploring sample heterogeneity?
SingleR, scTYpe
What tools would you use to identify unknown/rare cell types?
Seurat, Cellenics, SingleR, scTYPE
What is Pseudotime?
a latent (unobserved) dimension which measures the cells’ progress through the transition
what does it mean to estimate pseudotime?
de-confound single cell time series and order the cells by pseudotime
what tools would you use to elucidate changes in gene expression during time or across states?
Slingshot, Monocle, PAGA (Partition-based graph abstraction)
what can you do with pseudo-time inference?
- analyze cell similarity and diversity
- trace differentiation processes
- clonal evolution
- cell state transitions of a specific cell type or between different cell types (from cell of origin to development A or B)
what makes scRNA-seq different than bulk RNA-seq?
cell level precision
what tools would you use for a differential gene expression analysis (DGE)?
Seurat, Cellenics, Deseq2
what do you measure to determine cell velocity?
spliced vs. unspliced transcripts
what tools could you use to measure cell velocity?
scVELO, Velocito
what tools could you use for multimodal analysis?
Seurat
what can you integrate multimodal analysis with?
- special data (sequence or image based)
- sc-ATAC-seq data
- cell surface protein and T-cell receptor (TCR)/immunoglobulin clonotyping (IG)
a cell is found to have more spliced transcripts than unspliced transcripts. is the expression increasing or decreasing?
decreasing
in cell velocity, what is the curve called when the unspliced counts are increasing? EDIT WORDING LOOK AT LECTURE 1
induction
in cell velocity, what is the curve called when the unspliced counts are increasing?EDIT WORDING LOOK AT LECTURE 1
repression
what tools could you use to study cell level mutations
cExecute + variant caller, scReadCounts
T/F averaged expression is equal to within cell molecular relationships
false
An inverse correlation between target and suppressor genes can indicate
potential regulation
what are the benefits associated with the technological advances in scRNA-seq?
- number of analyzed cells increased
- cost exponentially reduced
- number of published papers increased
- technology evolved using more sophisticated, accurate, high throughput analyses
how do you isolate single cells for scRNA-seq?
- limiting dilution (plate based)
- micromanipulation
- laser capture microdissection (LCM)
- fluorescence-activated cell sorting (FACS)
- Circulating Tumor Cells (CTC)
- micro fluids-based scRNA-seq
- droplet-based scRNA-seq
what are the cons of plate based single cell isolation
low throughout put and efficiency (historical significance)
what are the pros and cons of fluorescence-activated cell sorting
- targeted cell isolation
- high-precision sorting
- multiparameter sorting
- cell viability
- limited by marker availability
- throughput and time efficiency lower than other methods
what are the pros and cons of circulating tumor cells as a method of single cell isolation)
- utilization of antibodies to specifically target and capture CTCs from peripheral blood
- rarity of CTCs in the bloodstream
- potential for bias in antibody-based capture
- sensitivity and specificity of the chosen antibodies
- throughput and time-efficiency lower than other methods
what are the pros and cons of microfluidics-based scRNA-seq
- precise manipulation of cells and fluids at a microscope
- ability to integrate multiple steps into a single microfluidic chip, reducing sample loss and technical availability
- lower throughput
- complexity and cost of the microfluidic chips
- low efficiency for small or fragile cells
describe the process of microfluidics-based scRNA-seq
capturing and processing individual cells in microfluidic channels or chambers, aiming at controlled environment benefits studying of specific cell types or low-abundance transcripts
describe the process of droplet-based dcRNA-seq
encapsulating individual cells in oil droplets, each containing a unique barcode. designed to process a high number of cells in a single run
what are the pros and cons of droplet-based scRNA-seq?
- Scalability and parallel processing
- Reduced cost and time per cell
- Large scale and high throughput by barcoded beads in droplets, which tag the mRNA of individual cells
- Difficulty in capturing large/irregularly shaped cells
- Potential for capturing multiple cells in a droplet
- Many cells = lower depth of sequencing per cell
what company developed the drop-seq platform?
10X genomics
what are the features of the dip-seq platform?
- uses droplets for single-cell isolation
- no ERCC spike-ins
- 8 bp UMI
- no full length coverage
- PCR amplification
- not usable for bulk
- paired-end sequencing
what are the features of the SmartSeq2 platform?
- uses FACS for single cell isolation
- ERCC spike-ins
- no UMI
- full length coverage
- PCR amplification
- usable for bulk
- single-end sequencing
what are UMIs and why can they be helpful?
unique molecular identifiers- short nucleotide sequences added to RNA molecules before amplification with the aim to tag each original RNA molecule uniquely, allowing the differentiation between true RNA molecules and PCR duplicates. This significantly improves the quantitative accuracy of scRNA-seq
what is the aim of full length transcript sequencing and why can it be useful?
to sequence the entire RNA molecule from the 5’ to the 3’ end. provides comprehensive info about transcript isoforms, alternative splicing events, and other post-transcriptional modifications
why are UMIs and full length transcript sequencing incompatible?
full length sequencing requires reading the entire RNA transcript, so if a UMI is added only to one end, it becomes ineffective of gets lost in the process of sequencing the full length transcript
in what type of methods are UMIs particularly useful?
counting gene expression (i.e. counting transcripts)
which platform is the majority of existing scRNA-seq data generated on?
10X genomics
t/f smart-seq2 is a plate based method
false
t/f the smartseq2 analytical pipeline of the scRNA_seq data for each cell is analogous to bulk RNA-seq
true
in what kinds of cells do you expect higher than average mitochondrial content?
- myocytes
- brown adipocytes
- neurons
- sperm cells
- oocytes
- hepatocytes
- endocrine cells
what can high mitochondrial gene expression indicate?
cell stress
apoptosis
low RNA integrity
low-quality RNA extraction
technical errors in library prep
what can high ribosomal gene expression indicate?
RNA integrity
cell viability
technical artifacts (such as cell doublets)
batch effects
cell heterogeneity
are mitochondrial or ribosomal genes used more often as a QC metric for scRNA-seq
mitochondrial
t/f the use of UMIs in 10X genomics provide lower quantitation accuracy in ribosomal gene expression detection
true
t/f 10X genomics detects more genes than Smart-seq2
false
t/f 10X genomics identifies more cell clusters/types than Smartseq2
true
t/f 10X genomics has a higher dropout ratio than Smart-seq2
true
what are the pros and cons of 10X genomics visium
- integrates well with existing 10x genomics workflow
- offers a relatively large capture area, which is beneficial for analyzing tissue sections
- provides high quality data with robust technical support
- limited to predefined capture areas, which may not suit all experimental designs
- the cost can be relatively high
- the capture areas are not cell-resolution
what are the different platforms for single cell spatial transcriptomics
- 10X Genomics Vision
- StereoSeq
- Nanostring GeoMx digital spatial profiler
- Slide-seq
- Seq-scope
- Merfish
what are the pros and cons of StereoSeq?
- high spatial resolution
- comprehensive coverage
- flexibility in targeting (can target a wide variety of RNA species)
- compatibility with standard histological samples
- complexity and cost
- requires robust bioinformatics support
- instrumentation requirements
what are the pros and cons of nano string geomx digital spatial profiler
- high-plex analysis, enabling simultaneous assessment of numerous targets
- flexible in terms of target selection (RNA and protein)
- compatible with standard FFPE samples
- lower spatial resolution compared to other platforms
- dependency on predefined probes (limited novel transcript discoveries)
what are the pros and cons of SLIDE-SEQ?
- high spatial resolution
- allows for discovery of novel spatial biomarkers
- technically challenging and requires special equipment
- lower throughput, limits parallel sample processing
what are the pros and cons of SEQ-SCOPE
- exceptionally high spatial resolution
- still in developmental stages, potentially high cost and technical complexity
what are the pros and cons of MERFISH
- extremely high-plex capacity
- high spatial resolution
- requires specialized and expensive equipment
- complex data analysis pipeline
what are the common limitations of single cell spatial transcriptomics?
trade off between spatial resolution and throughput
what are the types of single cell DNA-seq?
- single cell Whole Genome Sequencing (scWGS)
- single cell Copy Number Variation (CNV) profiling
- single cell Whole Exam Sequencing (scWES) and single cell targeted DNA sequencing
what are the challenges associated with scDNA-seq
- higher technical noise compared to scRNA-seq
- need for high sequencing depth to detect rare mutations
- the potential for DNA amplification biases
- cost-efficiency
- currently rare, not a lot of data for reference/comparison
what does scATAC-seq stand for and what is it used for?
single cell Assay for Transposase-Accessible Chromatic using sequencing; surveys the physical structure of the genome by identifying regions of open chromatin
what is the goal of single cell immune profiling and what is measured?
comprehensive characterization of immune cells
- gene expression
- surface proteins
- cytokines
- functional states
what does the 10X genomics single cell immune profiling solution provides?
- 5’ transcriptome gene expression
- T and B cell repertoire
- antigen specificity
what is CITE-seq and what are its uses?
Cellular Indexing of Transcriptomics and Epitopes by sequencing; determines the interaction between different immune cell groups and identification of novel distinct immune cell subsets in health and disease
what is a common limitation across all platforms?
the trade off between spatial resolution and throughput
why is it difficult to pick a superior platform for spatial transcriptomics?
depends on research question, tissue type, and available resources
what are the biggest challenges of sc-DNA seq?
- high technical noise
- high cost
- potential for DNA amplification bias
what can sc-DNA be used to study?
- tumor heterogeneity (in terms of mutations)
- hematology
- gene editing
what other method is ATAC-seq a proxy to?
scRNA-seq
t/f ATAC-seq cannot be used to identify cell types
false
t/f scRNA-seq data is not zero-inflated relative to the sequencing depth
true
what are some confounding factors of scRNA-seq
- large volume of data
- low depth of sequencing per cell
- biological variability across cells/samples
- technical variability across cells/samples
what is the danger of scRNA-seq having a low depth of sequencing per cell?
a zero count can either mean the gene is not expressed or that the transcript was not detected (false negative)
what are uninteresting sources of biological variation in scRNA-seq (unless the study specifically is testing the variation)?
- transcriptional bursting
- varying rates of RNA processing
- continuous or discrete cell identities
- environmental stimuli
- temporal changes
what are sources of technical variation in scRNA-seq?
- cell-specific capture efficiency
- library quality
- amplification bias (drop out)
- batch effects
- dilution factor
what are factors that contribute to batch effects?
- RNA isolation not performed on the same day
- library prep not performed on the same day
- different people performing RNA isolation/library prep for all samples
- not using same reagents for all samples
- RNA isolation/library prep not performed at same location
how can you combat batch effects?
- split replicates of different sample groups across batches
- include batch info in experimental metadata
what method could you use to remove doublets from your data?
DoubletDecon
what kinds of quality filtering are performed on scRNA-seq data
- filter out cells based on mitochondrial reads (%)
- filter out cells with too few or too many reads
- filter out cells based on n features (genes) (too few or too many)
- filter out genes based on expression across the cells
- integrate and remove batch effects
what is the difference between CITE-seq and immune cell profiling?
CITE-seq integrates scRNA-seq with simultaneous protein-level data, enabling characterization of both transcriptomes and cell surface protein markers from single cells while immune cell profiling may include 5’-end transcript sequences, offering insights into transcriptional initiation patterns specific to immune cells without direct protein-level measurements
how does paired end sequencing improve mapping? what is it particularly useful for
because we know the approximate distance between the two reads. this is especially helpful with indels
what is splice-aware alignment?
find the genomics coordinates of the sequencing reads considering that RNA undergoes splicing (prioritize mapping in non-intronic regions)
what do you need for a splice-aware alignment?
- data (raw sequencing reads)
- high performance computing platform
- software
- reference genome sequence in FASTA format
- exotic/intronic genome coordinates or gene annotation file
what is a reference genome?
a digital nucleic acid sequence database assembles by scientists as a representative example of the set of genes in one idealized individual organism of a species (do not accurately represent the set of genes of any single individual organism)
what is a gene annotation file?
a description of where genetic elements (intron, exon, transcript, gene) are located in the genome, in the form begin and end coordinate
can you align RNA without a gene annotation file?
yes- RNA alignments that do not use gene annotation exist (some are called de novo aligners)
can you align RNA without a reference genome?
yes- RNA alignments can use. target transcriptome as a multi-FASTA file
what software can perform splice-aware alignment?
STAR, Hisat2, BBmap
what software can perform DNA (no splice-aware) alignment?
BWA, Bowtie2
what are the two steps of STAR alignment?
- seed searching
- clustering, stitching, and scoring
what essential features are involved in scRNA-seq preprocessing compared to bulk-RNA-seq? how are these features achieved?
- call calling
- removing PCR duplicates
- assigning reads to individual genes and cells
achieved through barcode and UMI sequences
what are the outputs of scRNA-seq alignment for SmartSeq2 and 10X?
SmartSeq2- each cell has its own .bam
10X- 1 combines .bam and barcodes.tsv, features.tsv, matrix.mtx
what is the scRNA-seq workflow?
- Process data (on a server/cloud) and obtain GE per cell values (small size manageable outputs)
- Filter out genes
- Filter cells
- Normalize expression values
- Identify highly variable genes
- Scale data, regress out unwanted variation
- Reduce dimensions
- Determine significant principal components
- Use the PCs to cluster cells with graph-based clustering
- Visualize clusters with no linear dimensional reduction (tSNE or UMAP)
- Detect and visualize marker genes for the clusters
- Classify the cells by cell type
t/f you should filter out genes that are expressed in any cells or in only a few of them
t- removing them makes the data smaller and computations faster
what factors contribute to the noise of single cell gene expression?
- low mRNA content in a cell
- variable mRNA capture
- variable sequencing depth
how would you perform global scale normalization?
- divide gene’s UMI count in a cell by the total number of UMIs in that cell
- multiply the ratio by a scale factor (10,000 by default)
- transform the results by taking natural log
in what case does global scale normalization not work well and what can be done instead?
high expressing genes; use SCTransform instead
what are the steps in SCTransform?
- modeling of gene expression data
- normalization and variance stabilization
- feature selection
- mitigation of batch effects
- scalability
does scRNA-seq data have a weak or strong mean-varaiance relationship?
strong, low expressing genes have higher variance
how would you perform Variance Stabilizing Transformation (VST)?
- compute the mean and variance of each gene using the unnormalized UMI counts
- take log10 of mean and variance
- fit curve to predict the variance of each gene as a function of its mean expression
- standardize count
- for each gene, compute the variance of the standardized values across all cells
- rank the genes based on standardized variance and use the top 2000 for PCA and clustering
why do we need to scale data prior to PCA?
gives equal weight in downstream analyses so the highly expressed genes do not dominate
how is scaling expression values prior to dimensional reduction done?
Z score normalization in Seurat’s ScaleData function:
- shifts the expression of each gene so that the mean expression across cells is 0
- scales the expression of each gene so that the variance across cells is 1
how can we remove unwanted sources of variation from expression values prior to dimensional reduction?
Seurat constructs linear models to predict gene expression based on user-defined variables
what are the steps in cell cycle phase regression?
- compute cell cycle scores for each gene based on its expression of G2/M and S phase markers
- model each gene’s relationship between expression and the cell cycle score
- regress: 2 options
- remove ALL signals assoc with cell cycle stage
- remove the difference between G2M and S phase scores (preserves signals for non-cycling vs cycling genes, only differences in cell cycle phase amongst the dividing cells are removes. useful when studying differentiating processes)