HC 6 - Analysis of Transcriptomics Data - Part 2: Single-cell RNA-seq Flashcards

hoorcollege 6

1
Q

Experimental design and data collection in scRNA-seq

A

Frame the biological question: characterization of unknown samples of e.g. intestinal epithelial cells and their gene signatures
> 3’-droplet-based scRNA-seq

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Quality control of scRNA-seq raw data

A

Phred scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

During read mapping and expression quantification, it is important to deal with barcodes and UMIs. What are those?

A

-Barcodes: reads unique for which cell/sample
-UMI: reads unique for each molecule (is the read unique or PCR duplicate) > count the UMIs per gene for quantification, these are different unique reads not due to PCR errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Fast mapping procedures

A

-Mapping on a transcriptome downloaded from database
-Unspliced alignment: Bowtie2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

After Phred quality control and read mapping and expression quantification you end up with the expression matrix which needs specific quality control: remove low quality cells. On which four QC metrics is the cell QC performed?

A

-Number of counts per barcode (cell) (count depth) > set a cutoff value for maintaining cell data
-Number of genes per barcode/cell
-Fraction of counts from mitochondrial genes per barcode/cell: the cells are damaged when low mitochondrial genes
-Fraction of ERCC spike-ins, if present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The expression matrix often needs batch correction after cell QC. How?

A

Using PCA > plotting all cells as dots in PCA plot
-batch effect is not very large if the set PCs describe low percentage of variance
-the biological variation across groups is confounded with technical variation from processing cells in different batches
-batch effect screws up the result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

After the corrections, normalization of the scRNA-seq expression data is needed. What are the problems?

A

-many zero’s
-high variability
-often Funky Genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Normalization by deconvolution workflow scRNA-seq

A

-Cluster cells together
-Pool the cells per cluster to increase counts and reduce zero’s
-Robust estimate of each pool size factor
-Repeat for multiple pools
-Solve linear system of equations to obtain per-cell size factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Possible exam question: what is the difference in normalization with bulk RNAseq and scRNA seq?

A

In scRNA-seq: not a general scaling factor is used, but clusters of cells are taken and pooled to obtain estimate cell-specific normalization factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Imputation is performed after normalization. What is the problem?

A

Too many zero’s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are dropout events and where are they found in the plot of log(RPM)

A

Dropout event occurs when transcript is expressed but is entirely undetected in its mRNA profile
> 0-value in logRPM plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why does a dropout event occur?

A

Due to low amounts of mRNA in individual cells, and the low sequencing depths typical for scSeq experiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The frequency of dropout events depends on the scRNA-seq …

A

protocols

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

10X genomics (droplet based) has got generally many dropouts. What is the trade-off?

A

For the same budget, it measures more cells, but with less sequencing depths and more dropouts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is imputation?

A

The process of filling in the zero’s with expression values with the information from other cells that are ‘similar’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a statistical artifact?

A

An interference which causes bias/manipulation of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Imputation: zero inflation

A

Log(count+1) transformation
> creates bias in analysis
> needed because zero’s in data cannot be log transformed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why shouldn’t all zeros be imputed?

A

A zero can mean biological variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

There are different approaches for imputation. Why is it not ideal to impute all gene expressions? Give 2 reasons.

A

-Imputing expressions unaffected by dropout would introduce new bias
-Could also eliminate meaningful biological variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is it appropriate to treat all zero expressions as missing values?

A

-Some zero expressions may reflect true biological non-expression
-Zero expression can be resulted from gene expression stochasticity (fluctuation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What do the imputation methods search for?

A

Comparable genes which make it possible to fill in the missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

scImpute

A

-For each gene: to determine which expression values are most likely affected by dropouts
-For each cell: to impute the highly likely dropout values by borrowing information from the same genes’ expression in similar cells

23
Q

Procedures for imputation are based on … from similar cells/genes

A

borrowing

24
Q

Goals scRNA-seq data analysis

A

Identify clusters of cells with similar expression patterns, characterize these clusters with marker genes

25
Q

Feature selection: what is it?

A

What are the relevant genes for clustering cells as subpopulations?

26
Q

What do feature selected genes have to be?

A

Biologically active

27
Q

Assumptions feature selection

A

-Variable genes are biologically active
-Variation in expression for most genes is driven by uninteresting processes like sampling noise

28
Q

How can the technical component of variation be found under the assumption of variation in most genes is noise?

A

> the fitted value of the trend represents an estimate of its uninteresting variation (technical component)

29
Q

How to find biological component variation

A

For each gene the difference between its total variance and technical component
(loess)

30
Q

It is assumed that all genes are somewhat variable, but for most genes because of …

A

Randomness > Poisson

31
Q

Clustering

A

Identifying cells which are somewhat similar in gene expression > subpopulation
- Because, you have analyzed many different cells individually but don’t know which sample is what

32
Q

Unsupervised clustering : Hierarchical clustering is based on

A

Samples or genes

33
Q

Euclidian distance

A

How dissimilar genes/samples 1 and 3 for example
> calculate distance across all samples (for gene) or vice versa
> square those
> sum all these and put in sqrt

34
Q

Similarity and distance in comparing genes

A

Correlation: simialrity
Distance: 1 - correlation

35
Q

Positive and negative correlation lines

A

+ /
- \

36
Q

Workflow Hierarchical clustering

A
  1. Set up a matrix with all possible distances between all data points
  2. e.g. 5 genes in one dimension
  3. Agglomerative method: start with each datapoint in one cluster
  4. What is the smallest smallest distance
  5. Attach these points and put the distance on the vertical (boxes with connecting lines plot)
  6. Recalculate distance matrix (linkage: distance of newly formed cluster to other data points)
  7. What is the shortest distance
  8. Attach these points and put on vertical
  9. Recalculate distance matrix
  10. Shortest distance, attachment, vertical
  11. etc.
37
Q

The height of the branches of the hierarchical clustering are proportional to the…

A

distance between genes

38
Q

Distance measures

A

-Euclidean distance
>Always => 0
>Zero for identical profiles
>high for profiles of little similarity
-City Block distance
>large effects in single dimension are dampened
-Pearson Correlation
>for centered (equal mean and sd) data
>correlation coefficients from -1 to 1
>clustering on |r| will put anti-correlated and correlated in one cluster

39
Q

Spearman correlation is more robust against .. than Pearson

A

outliers

40
Q

Single linkage vs complete linkage for recalculating distance matrix in HC

A

-Single linkage: distance between two clusters is that between the nearest points > result: chaining (genes added to clusters one at a time)
-Complete linkage: based on furthest points > result: small compact clusters, not suited for fuzzy (vaag) data
-Many other methods like average linkage exist

41
Q

Hierarchical clustering: isomorphism

A

-Horizontal order of genes is non-informative
-Each time a node is drawn, a decision is made where to put it
(vertical distances are informative! > distances)

42
Q

Unsupervised clustering: K-means: why

A

it is interesting to know which genes have similar expression profiles, these are maybe involved in similar biological processes

43
Q

K-means clustering groups genes by similarity in expression patterns. Name the algorithm steps

A
  1. Choose K initial cluster centers at random
  2. Partition objects (genes) into k clusters by assigning objects to the closest centroid
  3. Calculate the centroid of each of the k clusters.
  4. Assign each object to cluster i, by first calculating the distance from each object to all cluster centers, choose closest.
  5. If object changes clusters, recalculate the centroids
  6. Repeat until objects not moving anymore.
44
Q

In K-means clustering, the dots are:

A

genes

45
Q

For K-means clustering, genes are plotted for expression in samples 1 and 2 for example. Per cluster: there is a ….

A

difference between sample 1 and 2

46
Q

We want to see structure or clusters in the data set: why might PCA not work well on very complex data sets?

A

PCA works if the first 2 PCs account for most of the variation and clustering in the data

47
Q

tSNE solution for PCA problem

A

tSNE takes high dimensional dataset and reduces it to low dimensional graph, that retains a lot of the original information

48
Q

tSNE is a … method

A

deduction-reduction method

49
Q

Steps tSNE visualization

A
  1. Determine the similarity between all the points in scatter plot
  2. Randomly project the data in the low dimensional space
  3. Determine similarity between all the points on the line
  4. Similarity determination > clustering on similarity matrix (scores)
  5. Move the points (iterations) that the similarity matrix after randomness are similar to the first similarity matrix > adjusted points
  6. tSNE moves points a little it at a time and takes direction that makes the random matrix more like the similarity matrix.
50
Q

A method that works similar like tSNE is

A

UMAP

51
Q

Differences tSNE and UMAP

A

o Interpretation of the distance between objects or clusters
o tSNE preserves local structure in the data
> when different cells lay close to each other the distance is correct
> large distances are not accurate in the plot
o UMAP claims to preserve local and most of the global structure in the data

52
Q

With tSNE you cannot interpret the distance between clusters A and B at different ends of the plot: this means

A

You cannot infer that these clusters are more dissimilar than A and C if C is closer than A in the plot.
> But: within A, the points closer to each other are more similar objects than those at different ends of cluster A.

53
Q

With UMAP you can interpret

A

-Distances between points within clusters
-Distances between points/clusters between clusters