HC 6 - Analysis of Transcriptomics Data - Part 2: Single-cell RNA-seq Flashcards
hoorcollege 6
Experimental design and data collection in scRNA-seq
Frame the biological question: characterization of unknown samples of e.g. intestinal epithelial cells and their gene signatures
> 3’-droplet-based scRNA-seq
Quality control of scRNA-seq raw data
Phred scores
During read mapping and expression quantification, it is important to deal with barcodes and UMIs. What are those?
-Barcodes: reads unique for which cell/sample
-UMI: reads unique for each molecule (is the read unique or PCR duplicate) > count the UMIs per gene for quantification, these are different unique reads not due to PCR errors
Fast mapping procedures
-Mapping on a transcriptome downloaded from database
-Unspliced alignment: Bowtie2
After Phred quality control and read mapping and expression quantification you end up with the expression matrix which needs specific quality control: remove low quality cells. On which four QC metrics is the cell QC performed?
-Number of counts per barcode (cell) (count depth) > set a cutoff value for maintaining cell data
-Number of genes per barcode/cell
-Fraction of counts from mitochondrial genes per barcode/cell: the cells are damaged when low mitochondrial genes
-Fraction of ERCC spike-ins, if present
The expression matrix often needs batch correction after cell QC. How?
Using PCA > plotting all cells as dots in PCA plot
-batch effect is not very large if the set PCs describe low percentage of variance
-the biological variation across groups is confounded with technical variation from processing cells in different batches
-batch effect screws up the result
After the corrections, normalization of the scRNA-seq expression data is needed. What are the problems?
-many zero’s
-high variability
-often Funky Genes
Normalization by deconvolution workflow scRNA-seq
-Cluster cells together
-Pool the cells per cluster to increase counts and reduce zero’s
-Robust estimate of each pool size factor
-Repeat for multiple pools
-Solve linear system of equations to obtain per-cell size factor
Possible exam question: what is the difference in normalization with bulk RNAseq and scRNA seq?
In scRNA-seq: not a general scaling factor is used, but clusters of cells are taken and pooled to obtain estimate cell-specific normalization factors
Imputation is performed after normalization. What is the problem?
Too many zero’s
What are dropout events and where are they found in the plot of log(RPM)
Dropout event occurs when transcript is expressed but is entirely undetected in its mRNA profile
> 0-value in logRPM plot
Why does a dropout event occur?
Due to low amounts of mRNA in individual cells, and the low sequencing depths typical for scSeq experiments
The frequency of dropout events depends on the scRNA-seq …
protocols
10X genomics (droplet based) has got generally many dropouts. What is the trade-off?
For the same budget, it measures more cells, but with less sequencing depths and more dropouts
What is imputation?
The process of filling in the zero’s with expression values with the information from other cells that are ‘similar’
What is a statistical artifact?
An interference which causes bias/manipulation of data
Imputation: zero inflation
Log(count+1) transformation
> creates bias in analysis
> needed because zero’s in data cannot be log transformed
Why shouldn’t all zeros be imputed?
A zero can mean biological variance
There are different approaches for imputation. Why is it not ideal to impute all gene expressions? Give 2 reasons.
-Imputing expressions unaffected by dropout would introduce new bias
-Could also eliminate meaningful biological variation
Why is it appropriate to treat all zero expressions as missing values?
-Some zero expressions may reflect true biological non-expression
-Zero expression can be resulted from gene expression stochasticity (fluctuation)
What do the imputation methods search for?
Comparable genes which make it possible to fill in the missing values
scImpute
-For each gene: to determine which expression values are most likely affected by dropouts
-For each cell: to impute the highly likely dropout values by borrowing information from the same genes’ expression in similar cells
Procedures for imputation are based on … from similar cells/genes
borrowing
Goals scRNA-seq data analysis
Identify clusters of cells with similar expression patterns, characterize these clusters with marker genes
Feature selection: what is it?
What are the relevant genes for clustering cells as subpopulations?
What do feature selected genes have to be?
Biologically active
Assumptions feature selection
-Variable genes are biologically active
-Variation in expression for most genes is driven by uninteresting processes like sampling noise
How can the technical component of variation be found under the assumption of variation in most genes is noise?
> the fitted value of the trend represents an estimate of its uninteresting variation (technical component)
How to find biological component variation
For each gene the difference between its total variance and technical component
(loess)
It is assumed that all genes are somewhat variable, but for most genes because of …
Randomness > Poisson
Clustering
Identifying cells which are somewhat similar in gene expression > subpopulation
- Because, you have analyzed many different cells individually but don’t know which sample is what
Unsupervised clustering : Hierarchical clustering is based on
Samples or genes
Euclidian distance
How dissimilar genes/samples 1 and 3 for example
> calculate distance across all samples (for gene) or vice versa
> square those
> sum all these and put in sqrt
Similarity and distance in comparing genes
Correlation: simialrity
Distance: 1 - correlation
Positive and negative correlation lines
+ /
- \
Workflow Hierarchical clustering
- Set up a matrix with all possible distances between all data points
- e.g. 5 genes in one dimension
- Agglomerative method: start with each datapoint in one cluster
- What is the smallest smallest distance
- Attach these points and put the distance on the vertical (boxes with connecting lines plot)
- Recalculate distance matrix (linkage: distance of newly formed cluster to other data points)
- What is the shortest distance
- Attach these points and put on vertical
- Recalculate distance matrix
- Shortest distance, attachment, vertical
- etc.
The height of the branches of the hierarchical clustering are proportional to the…
distance between genes
Distance measures
-Euclidean distance
>Always => 0
>Zero for identical profiles
>high for profiles of little similarity
-City Block distance
>large effects in single dimension are dampened
-Pearson Correlation
>for centered (equal mean and sd) data
>correlation coefficients from -1 to 1
>clustering on |r| will put anti-correlated and correlated in one cluster
Spearman correlation is more robust against .. than Pearson
outliers
Single linkage vs complete linkage for recalculating distance matrix in HC
-Single linkage: distance between two clusters is that between the nearest points > result: chaining (genes added to clusters one at a time)
-Complete linkage: based on furthest points > result: small compact clusters, not suited for fuzzy (vaag) data
-Many other methods like average linkage exist
Hierarchical clustering: isomorphism
-Horizontal order of genes is non-informative
-Each time a node is drawn, a decision is made where to put it
(vertical distances are informative! > distances)
Unsupervised clustering: K-means: why
it is interesting to know which genes have similar expression profiles, these are maybe involved in similar biological processes
K-means clustering groups genes by similarity in expression patterns. Name the algorithm steps
- Choose K initial cluster centers at random
- Partition objects (genes) into k clusters by assigning objects to the closest centroid
- Calculate the centroid of each of the k clusters.
- Assign each object to cluster i, by first calculating the distance from each object to all cluster centers, choose closest.
- If object changes clusters, recalculate the centroids
- Repeat until objects not moving anymore.
In K-means clustering, the dots are:
genes
For K-means clustering, genes are plotted for expression in samples 1 and 2 for example. Per cluster: there is a ….
difference between sample 1 and 2
We want to see structure or clusters in the data set: why might PCA not work well on very complex data sets?
PCA works if the first 2 PCs account for most of the variation and clustering in the data
tSNE solution for PCA problem
tSNE takes high dimensional dataset and reduces it to low dimensional graph, that retains a lot of the original information
tSNE is a … method
deduction-reduction method
Steps tSNE visualization
- Determine the similarity between all the points in scatter plot
- Randomly project the data in the low dimensional space
- Determine similarity between all the points on the line
- Similarity determination > clustering on similarity matrix (scores)
- Move the points (iterations) that the similarity matrix after randomness are similar to the first similarity matrix > adjusted points
- tSNE moves points a little it at a time and takes direction that makes the random matrix more like the similarity matrix.
A method that works similar like tSNE is
UMAP
Differences tSNE and UMAP
o Interpretation of the distance between objects or clusters
o tSNE preserves local structure in the data
> when different cells lay close to each other the distance is correct
> large distances are not accurate in the plot
o UMAP claims to preserve local and most of the global structure in the data
With tSNE you cannot interpret the distance between clusters A and B at different ends of the plot: this means
You cannot infer that these clusters are more dissimilar than A and C if C is closer than A in the plot.
> But: within A, the points closer to each other are more similar objects than those at different ends of cluster A.
With UMAP you can interpret
-Distances between points within clusters
-Distances between points/clusters between clusters