13 | DW-3 | tSNE, UMAP Flashcards by Stevie Davies

(QUIZ 6)
t-SNE was invented by ______ in ______

Laurens van der Maaten, 2008

How well did you know this?

Not at all

Perfectly

(QUIZ 6)
t-SNE means ________________________ .

t-distributed stochastic neighbour embedding

How well did you know this?

Not at all

Perfectly

(QUIZ 6)
PCA is a ______ dimensionality reduction technique, whereas t-SNE is a ______ technique.

PCA: linear
t-SNE: non-linear
https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/

How well did you know this?

Not at all

Perfectly

(QUIZ 6)
PCA is focused on the ______ structure of the data, whereas t-SNE is focused on the ______ structure.

PCA: global
t-SNE: local
https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/

How well did you know this?

Not at all

Perfectly

(QUIZ 6)
PCA is a ______ algorithm whereas t-SNE is ______.
The above means that the results from a data set for ______ are always the same whereas for ______ they might differ in each analysis.

PCA: deterministic
t-SNE: non-deterministic
https://www.geeksforgeeks.org/difference-between-pca-vs-t-sne/

How well did you know this?

Not at all

Perfectly

(QUIZ 6)
We cannot preserve variance in ______, instead we can preserve distance using hyperparameters.

In ______ we decide on how much variance to preserve using eigen values.
t-SNE
PCA

How well did you know this?

Not at all

Perfectly

What does UMAP stand for?

Uniform Manifold Approximation and Projection

How well did you know this?

Not at all

Perfectly

What type of machine learning technique is UMAP?

A nonlinear dimensionality reduction technique

How well did you know this?

Not at all

Perfectly

What is UMAP based on?

Topological and geometric principles of manifold learning

How well did you know this?

Not at all

Perfectly

What kind of data is UMAP commonly used for?

High-dimensional data, such as transcriptomic, image, and clustering data

How well did you know this?

Not at all

Perfectly

What is the main goal of UMAP?

To reduce the dimensionality of high-dimensional data while preserving local and global structure

How well did you know this?

Not at all

Perfectly

How does UMAP compare to PCA in terms of aim?

Unlike PCA, which focuses on variance maximization, UMAP seeks to preserve the data’s intrinsic topological structure in a lower-dimensional space

How well did you know this?

Not at all

Perfectly

What type of visualization does UMAP aim to provide?

A meaningful, interpretable 2D or 3D projection of complex datasets

How well did you know this?

Not at all

Perfectly

What is the theoretical foundation of UMAP?

It is based on the concept of Riemannian geometry and algebraic topology

How well did you know this?

Not at all

Perfectly

What does UMAP assume about high-dimensional data?

That it lies on a low-dimensional manifold embedded in a higher-dimensional space

How well did you know this?

Not at all

Perfectly

What is the first step in UMAP?

Constructing a weighted graph representation of the data’s local manifold structure

How well did you know this?

Not at all

Perfectly

What does UMAP optimize to generate the final embedding?

A low-dimensional graph layout that approximates the original high-dimensional structure

How well did you know this?

Not at all

Perfectly

What are the main steps in UMAP’s algorithm?

Construct a k-nearest neighbors (kNN) graph
Apply a fuzzy simplicial set representation
Optimize a low-dimensional embedding that preserves the fuzzy topology

How well did you know this?

Not at all

Perfectly

What role does the k-nearest neighbors (kNN) graph play in UMAP?

It captures the local structure of the data

How well did you know this?

Not at all

Perfectly

How does UMAP embed the data in a lower-dimensional space?

By optimizing a cross-entropy loss between the high-dimensional and low-dimensional representations

How well did you know this?

Not at all

Perfectly

What are the key hyperparameters in UMAP?

n_neighbors
min_dist
metric
spread

How well did you know this?

Not at all

Perfectly

UMAP
What does n_neighbors control?

The balance between local and global structure preservation

How well did you know this?

Not at all

Perfectly

UMAP
What does min_dist affect?

The compactness of clusters in the lower-dimensional space

How well did you know this?

Not at all

Perfectly

How does the choice of metric influence UMAP?

It determines the distance function used to define neighborhood relationships

How well did you know this?

Not at all

Perfectly

How does UMAP compare to t-SNE in terms of computational efficiency?

UMAP is generally faster and scales better to large datasets

UMAP Which method preserves more of the global structure?

UMAP tends to preserve more of the global structure, while t-SNE focuses on local relationships

How do UMAP and t-SNE handle different perplexity-like parameters?

UMAP uses n_neighbors, while t-SNE uses perplexity, but they serve similar functions

What is a major difference in the cost function optimization between UMAP and t-SNE?

t-SNE minimizes Kullback-Leibler divergence, whereas UMAP optimizes a fuzzy topological representation

What is one major weakness of UMAP?

It can sometimes distort global relationships in favor of preserving local structures

How does UMAP handle density variation across clusters?

It may struggle with variable-density clusters, sometimes overcompressing sparse regions

Why is UMAP not always deterministic?

Because of its reliance on stochastic processes in initialization and optimization

What challenge does UMAP face with very high-dimensional sparse data?

It may require careful tuning of hyperparameters to avoid misleading embeddings

Why is UMAP commonly used in transcriptomic analysis?

It effectively visualizes complex gene expression patterns in a low-dimensional space

What type of transcriptomic data is UMAP often applied to?

Single-cell RNA sequencing (scRNA-seq) data

How does UMAP help in scRNA-seq analysis?

It reveals cell clusters and differentiation pathways in an intuitive way

What preprocessing steps are typically required before applying UMAP to transcriptomic data?

Normalization Feature selection Distance metric selection

Which of the following dimensionality reduction methods use density aware distances? - PCA - ICA - MDS - t-SNE - UMAP

- t-SNE (via perplexity parameter) - UMAP random ( via k-nearest neighbour graph and fuzzy simplical complex.)

What does t-SNE stand for?

t-Distributed Stochastic Neighbor Embedding

What is t-SNE used for?

A nonlinear dimensionality reduction technique for high-dimensional data visualization

In what field is t-SNE commonly used?

Single-cell RNA sequencing (scRNA-seq) and other biological data analyses

What problem does t-SNE aim to solve?

The challenge of representing high-dimensional data in 2D or 3D while keeping similar points close together Embeds high dimensional data by preserving local similarity (probabilistic)

What are the main steps in t-SNE?

* Compute pairwise similarities in high-dimensional space * Define a low-dimensional probability distribution * Optimize the embedding using gradient descent

How does t-SNE measure similarity in high-dimensional space?

Using a Gaussian distribution around each point

How does t-SNE define similarity in low-dimensional space?

Using a Student’s t-distribution with one degree of freedom

What does the perplexity parameter control in t-SNE?

The balance between local and global structure

How does changing perplexity affect t-SNE results?

* Low perplexity favors local structure * High perplexity includes more global relationships

t-SNE: Cost Function What function does t-SNE minimize?

Kullback-Leibler (KL) divergence between high- and low-dimensional probability distributions

t-SNE: Cost Function Why does t-SNE use a Student’s t-distribution in low-dimensional space?

To avoid overcrowding and better separate clusters

What are the advantages of t-SNE?

* Excellent for visualizing complex datasets * Captures non-linear relationships * Good for clustering high-dimensional data

What are the main weaknesses of t-SNE?

* Computationally expensive * Non-deterministic (results vary across runs) * Poor at preserving global structure

How does t-SNE handle large datasets?

It struggles with large datasets due to high computational cost

How does t-SNE compare to UMAP in terms of speed?

UMAP is generally faster than t-SNEWhich method preserves global structure better? UMAP preserves more global structure than t-SNE

Why might someone choose UMAP over t-SNE?

UMAP is deterministic, faster, and better at maintaining overall data structure

_____ prioritises local structure, ignores global distances

tSNE

______ captures both local and some global structure

UMAP

How are tSNE and UMAP similar?

- non-linear - density-aware distances - random (but can be made deterministic with fixed seed) - NO variable importance assessment

Which dimensionality reduction technique(s) include(s) variable importance assessment?

PCA, ICA

Which dimensionality reduction technique(s) are or can be made deterministic?

is deterministic: - PCA - classical MDS can be made deterministc: - ICA - depends on implementation - tSNE, UMAP: can be made deterministic with fixed seeds not deterministic: - metric/non-metric MDS

Which dimensionality reduction technique(s) capture global structure?

- PCA - MDS can be tuned for local or global (classical MDS - global) - UMAP: captures some global structure (but focuses on local)

Which dimensionality reduction technique(s) ignores global distances?

tSNE

re local/global distances: Which dimensionality reduction technique(s) focuses on statistically independent directions, ie not distance based?

ICA