single cell transcriptomics Flashcards
1
Q
heterogeneity in cell populations
A
- cell types
- somatic mutations
- cell cycle stage
- epigenetic modifications
- stochastic gene expression
2
Q
limitations of bulk assays
A
- assuming homogeneous relationships can lead you to the wrong conclusion
- rare cell types can become lost
- can’t see real time changes
- need to order by differentiation progress, not time
3
Q
process of SCT
A
- isoolate cells
- lyse
- reverse transcribe and amplify cDNA
- qPCR or RNAseq
- up to 10,000s of genes in 10,000s of cells
4
Q
methods of single cell isolation
A
- low throughput:
- manual/automated micropippetting
- cytoplasmic aspiration
- high throughput:
- FACS
- microfluidics
5
Q
qPCR
A
- quantitative/real-time PCR
- gene specific PCR primers
- include housekeeping genes (GADPH)
- fluorescent dye to detect PCR product
- measure Ct value for each gene
- threshold cycle number
- normalise data
6
Q
qPCR normalisation
A
- higher Ct means less cDNA
- arbitrary maximum Ct value
- calculate ΔCt for each gene
- max - gene
- higher Δ means more cDNA
- normalise with hk genes
- assume hk expression constant
- calculate gene ΔCt - hk ΔCt
- doubling cycles so subtraction not division
7
Q
RNAseq
A
- sequence cDNA library
- map reads to reference
- count read number for each gene
- need quality control
- can have coverage bias (5’/3’) in some protocols
8
Q
technical dropouts
A
- zero counts
- common
- when some mRNA not captured during reverse transcription
- capture efficiency:
- % of mRNA molecules in cell lysate detected
- often 10-20%
- more frequent in low expression genes
- varies between cells
9
Q
RNAseq normalisation
A
- convert raw read counts into expression levels per cell
- correct for cell to cell variation
- in capture, amplification, sequencing efficiency
- method depends on protocol used
- spike in or UMI
10
Q
extrinsic spike-ins
A
- add RNA of known sequence and quantity to lysate
- internal control
- equal quantity in each lysate
- normalise counts by number of reads mapped by spike in RNA
- assumes same capture, amplification and sequencing efficiencies
- be cautious:
- no 5’ cap or polyA tail
11
Q
UMIs
A
- unique molecular identifiers
- barcode on each cDNA moelcule
- 6-10 nt added before amplification
- track how much of amplified DNA comes form original molecule
- count number of unique UMIs associated with each gene
- assume library sequenced to saturation
- corrects for variation in amplification efficiency but not other sources
- e.g. reverse transcription
12
Q
normalisation without spike in or UMI
A
- same as used by bulk RNAseq data
- assume hk gene expression or total mRNA content the same
- normalise read counts by hk expression/total mRNA
- cna also combine techniques
13
Q
SC data analysis techniques
A
- clustering
- dimensionality reduction
- differential expression
- pseudotemporal ordering
- network interference
14
Q
single cell clustering
A
- cluster by trancriptomic profile to:
- analyse sub-population structure
- identify cell sub-types/rare cell types
- cluster by cell expression states to:
- identify co-varying genes
15
Q
SC clustering methods
A
- partitional
- produces disjoint groups
- k-means
- hierarchical clustering
- divisive or agglomerative
- hierarchical tree
- can provide more information
16
Q
k-means clustering
A
- algorithms putting data points into k clusters
- cluster points with the most similar mean average
- have to choose k
17
Q
bi-clustering
A
- allows clustering by genes and cells simultaneously
- find genes that behave similarly within a cell cluster
- can give better resolution
- only some subsets are informative
18
Q
dimensionality reduction
A
- transform to lower dimensional space e.g. 3D to 2D
- easier to visualise data and detect patterns
- often before clustering
- distance can behave non-intuitively in high dimensions
- many algorithms with different assumptions
- PCA
19
Q
PCA
A
- linear transformation of uncorrelated principal components
- PCs:
- orthogonal
- ordered by contribution to variance in data
- weighted sum or original dimensions
- PC1:
- vector through dataset giving largest amount of variation
- PC2 gives second largest
- shows which genes contribute most to heterogeneity
- can combine with clustering to analyse distinct populations
20
Q
differential expression
A
- aim to detect difference in gene expression levels or distribution between 2 cell populations
- statistical tests
- T-test, Mann-Whitney
- specialised methods needed for noise/dropouts
- more noise in SC than bulk
- multiple testing corrections
21
Q
GO enrichment
A
- gene ontology
- describes gene functions and relationships between them
- molecular funciton, cellular component, biological process
- identify terms over/underrepresented in a given set of genes
- select input list of genes of interest
- DE genes, bi-clustering genes
22
Q
GO output
A
- calculate probability of seeing observed sample frequency by chance given the background frequency for each term
- sample frequency = no of genes annotated to that term in the input
- background frequency = no of genes annotated to a term in the background set
- background set = all genes in the genome
- identify which terms appear more frequently than expected
23
Q
pseudotemporal ordering
A
- aim to infer gene expression dynamics from snapshot data
- true temporal data unavailable
- measurement destroys cells by lysis
- cells ordered by progress through a biological process
- differentiation, response to stimuli
- sampling time may not correlate well with stages
- asynchrony
- assumes cells follow same response or differentiation path
- monocle algorithm
24
Q
gene regulatory network inference
A
- aim to ocnstruct a network graph where nodes represent genes and edges indicate regulatory interactions
- assumes that a strong statistical relationship between 2 gene expression profiles indicates a potential functional relationship
- correlation vs mutual information
- linked groups of genes indicate that they undergo coordinated changes in expression
- some correlations may be confounding factors e.g. cell cycle
- may need to choose subpopulations
25
Q
correlation vs mutual information
A
- correlation
- most common measure for strength of a statistical relationship
- mutual information
- alternative that can identify non-linear relationships