Lecture 5 Flashcards

Question 1

Q

What are the main goals of high-dimensional data visualization?

Answer

A

Discover patterns in data.
Simplify complex datasets for easier interpretation.
Reduce dimensionality while preserving meaningful information.

Question 2

Q

What is a heatmap, and why is it useful?

Answer

A

A heatmap is a color-coded image representation of a data matrix.
It visualizes patterns in data matrices, suitable for datasets with up to 1,000 rows/columns.

Example: Visualizing a subset of the mtcars dataset.

Question 3

Q

What is the function of pheatmap in R?

Answer

A

A library for creating heatmaps.
Supports features like clustering rows and columns.

Example usage:
pheatmap(mat, cluster_rows=TRUE, cluster_cols=TRUE, scale=”column”)

Question 4

Q

What is the difference between centering and scaling variables?

Answer

A

Centering: Subtract the mean from each value.
Scaling (Z-score normalization): Center the data and divide by the standard deviation.
Purpose: Brings variables to a common scale, improving visualization and numerical stability.

Question 5

Q

What is clustering, and what are its main types?

Answer

A

Clustering groups observations or variables by similarity.
Two main types:
K-Means Clustering (predefined number of clusters).
Hierarchical Clustering (produces nested clusters visualized as a dendrogram).

Example: Grouping cars in the mtcars dataset by performance and weight.

Question 6

Q

How does K-means clustering work?

Answer

A

Choose K initial centroids.
Assign each observation to its closest centroid.
Update centroids by averaging observations in each cluster.
Repeat until the centroids stabilize.

Example in R:
k <- 2
X <- scale(mat)
kmeans(X, k, nstart = 20)

Question 7

Q

What are the limitations of K-means clustering?

Answer

A

Sensitive to initialization (different results with different centroids).
Assumes clusters are isotropic, have similar variance, and are of similar size.
Requires the number of clusters K to be predefined.

Question 8

Q

What is hierarchical clustering, and how does it differ from K-means?

Answer

A

Hierarchical clustering: Produces a dendrogram showing nested clusters.
No need to predefine the number of clusters.
Time complexity is quadratic, compared to K-means’ linear complexity.

R Implementation:
d <- dist(X) # Compute pairwise distances
hc <- hclust(d, method = “complete”) # Perform clustering
plot(hc) # Plot dendrogram

Question 9

Q

What is the Rand Index used for?

Answer

A

Measures similarity between two clustering results.
Values range from 0 (no similarity) to 1 (identical clusters).

Example Calculation:
For two partitions of elements {A, B, C, D}, if 3 out of 6 pairs agree:

R = (3 agreements) / (6 total pairs) = 0.5

Question 10

Q

What is Principal Component Analysis (PCA)?

Answer

A

A dimensionality reduction technique.
Projects high-dimensional data onto a lower-dimensional subspace.
Retains the maximum variance in fewer dimensions.

Question 11

Q

How is the first principal component (PC1) defined?

Answer

A

PC1 is the direction that maximizes the variance of the projected data.
It is found by solving:

max Σ ||pᵀ(xᵢ)||² subject to ||w|| = 1

Question 12

Q

What are the key properties of PCA?

Answer

A

Principal components are orthogonal (uncorrelated).
The first component (PC1) captures the most variance.
PCA planes are nested (PC1 ⊆ PC2 ⊆ …).

Question 13

Q

How is PCA performed in R?

Answer

A

pca_res <- prcomp(mat, center=TRUE, scale.=TRUE)
summary(pca_res) # View explained variance
biplot(pca_res) # Visualize projection

Question 14

Q

What is the “scree plot” in PCA?

Answer

A

A plot showing the variance explained by each principal component.
Helps decide how many components to retain (look for the “elbow”).

Question 15

Q

What are some limitations of PCA, and what are nonlinear alternatives?

Answer

A

Limitations: PCA assumes linear relationships. It may fail when data lies on a curved surface (e.g., parabola).
Alternatives:
Kernel PCA
t-SNE
UMAP

Question 16

Q

What are the key takeaways from clustering and PCA?

Answer

Study These Flashcards

A

Clustering and PCA are unsupervised learning techniques.
Both are exploratory tools to derive hypotheses.
Results are subjective and require independent validation.

Lecture 5 Flashcards

High-Dimensional Visualization (16 cards)