Lecture 5 Flashcards

High-Dimensional Visualization

1
Q

What are the main goals of high-dimensional data visualization?

A

Discover patterns in data.
Simplify complex datasets for easier interpretation.
Reduce dimensionality while preserving meaningful information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a heatmap, and why is it useful?

A

A heatmap is a color-coded image representation of a data matrix.
It visualizes patterns in data matrices, suitable for datasets with up to 1,000 rows/columns.

Example: Visualizing a subset of the mtcars dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the function of pheatmap in R?

A

A library for creating heatmaps.
Supports features like clustering rows and columns.

Example usage:
pheatmap(mat, cluster_rows=TRUE, cluster_cols=TRUE, scale=”column”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between centering and scaling variables?

A

Centering: Subtract the mean from each value.
Scaling (Z-score normalization): Center the data and divide by the standard deviation.
Purpose: Brings variables to a common scale, improving visualization and numerical stability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is clustering, and what are its main types?

A

Clustering groups observations or variables by similarity.
Two main types:
K-Means Clustering (predefined number of clusters).
Hierarchical Clustering (produces nested clusters visualized as a dendrogram).

Example: Grouping cars in the mtcars dataset by performance and weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does K-means clustering work?

A

Choose K initial centroids.
Assign each observation to its closest centroid.
Update centroids by averaging observations in each cluster.
Repeat until the centroids stabilize.

Example in R:
k <- 2
X <- scale(mat)
kmeans(X, k, nstart = 20)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the limitations of K-means clustering?

A

Sensitive to initialization (different results with different centroids).
Assumes clusters are isotropic, have similar variance, and are of similar size.
Requires the number of clusters K to be predefined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is hierarchical clustering, and how does it differ from K-means?

A

Hierarchical clustering: Produces a dendrogram showing nested clusters.
No need to predefine the number of clusters.
Time complexity is quadratic, compared to K-means’ linear complexity.

R Implementation:
d <- dist(X) # Compute pairwise distances
hc <- hclust(d, method = “complete”) # Perform clustering
plot(hc) # Plot dendrogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Rand Index used for?

A

Measures similarity between two clustering results.
Values range from 0 (no similarity) to 1 (identical clusters).

Example Calculation:
For two partitions of elements {A, B, C, D}, if 3 out of 6 pairs agree:

R = (3 agreements) / (6 total pairs) = 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Principal Component Analysis (PCA)?

A

A dimensionality reduction technique.
Projects high-dimensional data onto a lower-dimensional subspace.
Retains the maximum variance in fewer dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is the first principal component (PC1) defined?

A

PC1 is the direction that maximizes the variance of the projected data.
It is found by solving:

max Σ ||pᵀ(xᵢ)||² subject to ||w|| = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the key properties of PCA?

A

Principal components are orthogonal (uncorrelated).
The first component (PC1) captures the most variance.
PCA planes are nested (PC1 ⊆ PC2 ⊆ …).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is PCA performed in R?

A

pca_res <- prcomp(mat, center=TRUE, scale.=TRUE)
summary(pca_res) # View explained variance
biplot(pca_res) # Visualize projection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the “scree plot” in PCA?

A

A plot showing the variance explained by each principal component.
Helps decide how many components to retain (look for the “elbow”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some limitations of PCA, and what are nonlinear alternatives?

A

Limitations: PCA assumes linear relationships. It may fail when data lies on a curved surface (e.g., parabola).
Alternatives:
Kernel PCA
t-SNE
UMAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the key takeaways from clustering and PCA?

A

Clustering and PCA are unsupervised learning techniques.
Both are exploratory tools to derive hypotheses.
Results are subjective and require independent validation.