Lecture 5 Flashcards
High-Dimensional Visualization
What are the main goals of high-dimensional data visualization?
Discover patterns in data.
Simplify complex datasets for easier interpretation.
Reduce dimensionality while preserving meaningful information.
What is a heatmap, and why is it useful?
A heatmap is a color-coded image representation of a data matrix.
It visualizes patterns in data matrices, suitable for datasets with up to 1,000 rows/columns.
Example: Visualizing a subset of the mtcars dataset.
What is the function of pheatmap in R?
A library for creating heatmaps.
Supports features like clustering rows and columns.
Example usage:
pheatmap(mat, cluster_rows=TRUE, cluster_cols=TRUE, scale=”column”)
What is the difference between centering and scaling variables?
Centering: Subtract the mean from each value.
Scaling (Z-score normalization): Center the data and divide by the standard deviation.
Purpose: Brings variables to a common scale, improving visualization and numerical stability.
What is clustering, and what are its main types?
Clustering groups observations or variables by similarity.
Two main types:
K-Means Clustering (predefined number of clusters).
Hierarchical Clustering (produces nested clusters visualized as a dendrogram).
Example: Grouping cars in the mtcars dataset by performance and weight.
How does K-means clustering work?
Choose K initial centroids.
Assign each observation to its closest centroid.
Update centroids by averaging observations in each cluster.
Repeat until the centroids stabilize.
Example in R:
k <- 2
X <- scale(mat)
kmeans(X, k, nstart = 20)
What are the limitations of K-means clustering?
Sensitive to initialization (different results with different centroids).
Assumes clusters are isotropic, have similar variance, and are of similar size.
Requires the number of clusters K to be predefined.
What is hierarchical clustering, and how does it differ from K-means?
Hierarchical clustering: Produces a dendrogram showing nested clusters.
No need to predefine the number of clusters.
Time complexity is quadratic, compared to K-means’ linear complexity.
R Implementation:
d <- dist(X) # Compute pairwise distances
hc <- hclust(d, method = “complete”) # Perform clustering
plot(hc) # Plot dendrogram
What is the Rand Index used for?
Measures similarity between two clustering results.
Values range from 0 (no similarity) to 1 (identical clusters).
Example Calculation:
For two partitions of elements {A, B, C, D}, if 3 out of 6 pairs agree:
R = (3 agreements) / (6 total pairs) = 0.5
What is Principal Component Analysis (PCA)?
A dimensionality reduction technique.
Projects high-dimensional data onto a lower-dimensional subspace.
Retains the maximum variance in fewer dimensions.
How is the first principal component (PC1) defined?
PC1 is the direction that maximizes the variance of the projected data.
It is found by solving:
max Σ ||pᵀ(xᵢ)||² subject to ||w|| = 1
What are the key properties of PCA?
Principal components are orthogonal (uncorrelated).
The first component (PC1) captures the most variance.
PCA planes are nested (PC1 ⊆ PC2 ⊆ …).
How is PCA performed in R?
pca_res <- prcomp(mat, center=TRUE, scale.=TRUE)
summary(pca_res) # View explained variance
biplot(pca_res) # Visualize projection
What is the “scree plot” in PCA?
A plot showing the variance explained by each principal component.
Helps decide how many components to retain (look for the “elbow”).
What are some limitations of PCA, and what are nonlinear alternatives?
Limitations: PCA assumes linear relationships. It may fail when data lies on a curved surface (e.g., parabola).
Alternatives:
Kernel PCA
t-SNE
UMAP
What are the key takeaways from clustering and PCA?
Clustering and PCA are unsupervised learning techniques.
Both are exploratory tools to derive hypotheses.
Results are subjective and require independent validation.