HC 6 - Analysis of Transcriptomics Data - Part 2: Single-cell RNA-seq Flashcards
hoorcollege 6
Experimental design and data collection in scRNA-seq
Frame the biological question: characterization of unknown samples of e.g. intestinal epithelial cells and their gene signatures
> 3’-droplet-based scRNA-seq
Quality control of scRNA-seq raw data
Phred scores
During read mapping and expression quantification, it is important to deal with barcodes and UMIs. What are those?
-Barcodes: reads unique for which cell/sample
-UMI: reads unique for each molecule (is the read unique or PCR duplicate) > count the UMIs per gene for quantification, these are different unique reads not due to PCR errors
Fast mapping procedures
-Mapping on a transcriptome downloaded from database
-Unspliced alignment: Bowtie2
After Phred quality control and read mapping and expression quantification you end up with the expression matrix which needs specific quality control: remove low quality cells. On which four QC metrics is the cell QC performed?
-Number of counts per barcode (cell) (count depth) > set a cutoff value for maintaining cell data
-Number of genes per barcode/cell
-Fraction of counts from mitochondrial genes per barcode/cell: the cells are damaged when low mitochondrial genes
-Fraction of ERCC spike-ins, if present
The expression matrix often needs batch correction after cell QC. How?
Using PCA > plotting all cells as dots in PCA plot
-batch effect is not very large if the set PCs describe low percentage of variance
-the biological variation across groups is confounded with technical variation from processing cells in different batches
-batch effect screws up the result
After the corrections, normalization of the scRNA-seq expression data is needed. What are the problems?
-many zero’s
-high variability
-often Funky Genes
Normalization by deconvolution workflow scRNA-seq
-Cluster cells together
-Pool the cells per cluster to increase counts and reduce zero’s
-Robust estimate of each pool size factor
-Repeat for multiple pools
-Solve linear system of equations to obtain per-cell size factor
Possible exam question: what is the difference in normalization with bulk RNAseq and scRNA seq?
In scRNA-seq: not a general scaling factor is used, but clusters of cells are taken and pooled to obtain estimate cell-specific normalization factors
Imputation is performed after normalization. What is the problem?
Too many zero’s
What are dropout events and where are they found in the plot of log(RPM)
Dropout event occurs when transcript is expressed but is entirely undetected in its mRNA profile
> 0-value in logRPM plot
Why does a dropout event occur?
Due to low amounts of mRNA in individual cells, and the low sequencing depths typical for scSeq experiments
The frequency of dropout events depends on the scRNA-seq …
protocols
10X genomics (droplet based) has got generally many dropouts. What is the trade-off?
For the same budget, it measures more cells, but with less sequencing depths and more dropouts
What is imputation?
The process of filling in the zero’s with expression values with the information from other cells that are ‘similar’
What is a statistical artifact?
An interference which causes bias/manipulation of data
Imputation: zero inflation
Log(count+1) transformation
> creates bias in analysis
> needed because zero’s in data cannot be log transformed
Why shouldn’t all zeros be imputed?
A zero can mean biological variance
There are different approaches for imputation. Why is it not ideal to impute all gene expressions? Give 2 reasons.
-Imputing expressions unaffected by dropout would introduce new bias
-Could also eliminate meaningful biological variation
Why is it appropriate to treat all zero expressions as missing values?
-Some zero expressions may reflect true biological non-expression
-Zero expression can be resulted from gene expression stochasticity (fluctuation)
What do the imputation methods search for?
Comparable genes which make it possible to fill in the missing values