13 | DW-3 | tSNE, UMAP Flashcards
(QUIZ 6)
t-SNE was invented by ______ in ______
Laurens van der Maaten, 2008
(QUIZ 6)
t-SNE means ________________________ .
t-distributed stochastic neighbour embedding
(QUIZ 6)
PCA is a ______ dimensionality reduction technique, whereas t-SNE is a ______ technique.
PCA: linear
t-SNE: non-linear
https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/
(QUIZ 6)
PCA is focused on the ______ structure of the data, whereas t-SNE is focused on the ______ structure.
PCA: global
t-SNE: local
https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/
(QUIZ 6)
PCA is a ______ algorithm whereas t-SNE is ______.
The above means that the results from a data set for ______ are always the same whereas for ______ they might differ in each analysis.
PCA: deterministic
t-SNE: non-deterministic
https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/
(QUIZ 6)
We cannot preserve variance in ______, instead we can preserve distance using hyperparameters.
In ______ we decide on how much variance to preserve using eigen values.
t-SNE
PCA
What does UMAP stand for?
Uniform Manifold Approximation and Projection
What type of machine learning technique is UMAP?
A nonlinear dimensionality reduction technique
What is UMAP based on?
Topological and geometric principles of manifold learning
What kind of data is UMAP commonly used for?
High-dimensional data, such as transcriptomic, image, and clustering data
What is the main goal of UMAP?
To reduce the dimensionality of high-dimensional data while preserving local and global structure
How does UMAP compare to PCA in terms of aim?
Unlike PCA, which focuses on variance maximization, UMAP seeks to preserve the data’s intrinsic topological structure in a lower-dimensional space
What type of visualization does UMAP aim to provide?
A meaningful, interpretable 2D or 3D projection of complex datasets
What is the theoretical foundation of UMAP?
It is based on the concept of Riemannian geometry and algebraic topology
What does UMAP assume about high-dimensional data?
That it lies on a low-dimensional manifold embedded in a higher-dimensional space
What is the first step in UMAP?
Constructing a weighted graph representation of the data’s local manifold structure
What does UMAP optimize to generate the final embedding?
A low-dimensional graph layout that approximates the original high-dimensional structure
What are the main steps in UMAP’s algorithm?
Construct a k-nearest neighbors (kNN) graph
Apply a fuzzy simplicial set representation
Optimize a low-dimensional embedding that preserves the fuzzy topology
What role does the k-nearest neighbors (kNN) graph play in UMAP?
It captures the local structure of the data
How does UMAP embed the data in a lower-dimensional space?
By optimizing a cross-entropy loss between the high-dimensional and low-dimensional representations
What are the key hyperparameters in UMAP?
n_neighbors
min_dist
metric
spread
UMAP
What does n_neighbors control?
The balance between local and global structure preservation
UMAP
What does min_dist affect?
The compactness of clusters in the lower-dimensional space
How does the choice of metric influence UMAP?
It determines the distance function used to define neighborhood relationships
How does UMAP compare to t-SNE in terms of computational efficiency?
UMAP is generally faster and scales better to large datasets
UMAP
Which method preserves more of the global structure?
UMAP tends to preserve more of the global structure, while t-SNE focuses on local relationships
How do UMAP and t-SNE handle different perplexity-like parameters?
UMAP uses n_neighbors, while t-SNE uses perplexity, but they serve similar functions
What is a major difference in the cost function optimization between UMAP and t-SNE?
t-SNE minimizes Kullback-Leibler divergence, whereas UMAP optimizes a fuzzy topological representation
What is one major weakness of UMAP?
It can sometimes distort global relationships in favor of preserving local structures
How does UMAP handle density variation across clusters?
It may struggle with variable-density clusters, sometimes overcompressing sparse regions
Why is UMAP not always deterministic?
Because of its reliance on stochastic processes in initialization and optimization
What challenge does UMAP face with very high-dimensional sparse data?
It may require careful tuning of hyperparameters to avoid misleading embeddings
Why is UMAP commonly used in transcriptomic analysis?
It effectively visualizes complex gene expression patterns in a low-dimensional space
What type of transcriptomic data is UMAP often applied to?
Single-cell RNA sequencing (scRNA-seq) data
How does UMAP help in scRNA-seq analysis?
It reveals cell clusters and differentiation pathways in an intuitive way
What preprocessing steps are typically required before applying UMAP to transcriptomic data?
Normalization
Feature selection
Distance metric selection
Which of the following dimensionality reduction methods use density aware distances?
- PCA
- ICA
- MDS
- t-SNE
- UMAP
- t-SNE (via perplexity parameter)
- UMAP random ( via k-nearest neighbour graph and fuzzy simplical complex.)
What does t-SNE stand for?
t-Distributed Stochastic Neighbor Embedding
What is t-SNE used for?
A nonlinear dimensionality reduction technique for high-dimensional data visualization
In what field is t-SNE commonly used?
Single-cell RNA sequencing (scRNA-seq) and other biological data analyses
What problem does t-SNE aim to solve?
The challenge of representing high-dimensional data in 2D or 3D while keeping similar points close together
Embeds high dimensional data by preserving local similarity (probabilistic)
What are the main steps in t-SNE?
- Compute pairwise similarities in high-dimensional space
- Define a low-dimensional probability distribution
- Optimize the embedding using gradient descent
How does t-SNE measure similarity in high-dimensional space?
Using a Gaussian distribution around each point
How does t-SNE define similarity in low-dimensional space?
Using a Student’s t-distribution with one degree of freedom
What does the perplexity parameter control in t-SNE?
The balance between local and global structure
How does changing perplexity affect t-SNE results?
- Low perplexity favors local structure
- High perplexity includes more global relationships
t-SNE: Cost Function
What function does t-SNE minimize?
Kullback-Leibler (KL) divergence between high- and low-dimensional probability distributions
t-SNE: Cost Function
Why does t-SNE use a Student’s t-distribution in low-dimensional space?
To avoid overcrowding and better separate clusters
What are the advantages of t-SNE?
- Excellent for visualizing complex datasets
- Captures non-linear relationships
- Good for clustering high-dimensional data
What are the main weaknesses of t-SNE?
- Computationally expensive
- Non-deterministic (results vary across runs)
- Poor at preserving global structure
How does t-SNE handle large datasets?
It struggles with large datasets due to high computational cost
How does t-SNE compare to UMAP in terms of speed?
UMAP is generally faster than t-SNEWhich method preserves global structure better?
UMAP preserves more global structure than t-SNE
Why might someone choose UMAP over t-SNE?
UMAP is deterministic, faster, and better at maintaining overall data structure
_____ prioritises local structure, ignores global distances
tSNE
______ captures both local and some global structure
UMAP
How are tSNE and UMAP similar?
- non-linear
- density-aware distances
- random (but can be made deterministic with fixed seed)
- NO variable importance assessment
Which dimensionality reduction technique(s) include(s) variable importance assessment?
PCA, ICA
Which dimensionality reduction technique(s) are or can be made deterministic?
is deterministic:
- PCA
- classical MDS
can be made deterministc:
- ICA - depends on implementation
- tSNE, UMAP: can be made deterministic with fixed seeds
not deterministic:
- metric/non-metric MDS
Which dimensionality reduction technique(s) capture global structure?
- PCA
- MDS can be tuned for local or global (classical MDS - global)
- UMAP: captures some global structure (but focuses on local)
Which dimensionality reduction technique(s) ignores global distances?
tSNE
re local/global distances:
Which dimensionality reduction technique(s) focuses on statistically independent directions, ie not distance based?
ICA