Section 3: Unsupervised Learning Flashcards
what is supervised learning?
target is present
goal: to make inference or predictions for the target
what is unsupervised learning?
target is absent (or ignored if present)
goal: to extract relationships between variables
two reasons why unsupervised learning is often more challenging than supervised learning
1) (objectives) objectives in unsupervised learning are more fuzzy and subjective; no simple goal like prediction
2) (hard to assess results) methods for assessing model quality based on the target variable, e.g., CV, are generally not applicable
what is Principal Components Analysis (PCA)?
idea: a) to transform a set if numeric variables into a smaller set of representative variables (PCs) -> reduce dimension of data
b) especially useful for highly correlated data -> a few PCs are enough to capture most information
properties of PCs
1) linear combinations of the original features
2) generated to capture as much information in the data (w.r.t. variance) as possible
3) mutually uncorrelated (different PCs capture different aspects of data)
4) relationship between PC scores and PC loadings
5) amount of variance explained decreases with PC order, i.e., PC1 explains the most variance and subsequent PCs explain less and less
two applications of PCS
1) EDA: plot the scores of the 1st PC vs the scores of the 2nd PC to gain a 2D view of the data in a scatterplot
2) feature generation: replaces the original variables by PCs to reduce overfitting and improve prediction performances
interpreting signs and magnitudes of PC loadings
what do the PCs represent, e.g., proxy, average, or contrast of which variables? which variables are more correlated with one another?
interpreting proportions of variance explained (PVEs)
PVEm = variance explained by mth PC/total variance
are the first few PVEs large enough (related to the strong correlations between variables)? If so, the PCs are useful
interpreting biplots
visualization of PCA output by displaying both the scores and loading vectors of the first two PCs.
- PC loadings on top and right axes -> deduce the meaning of PCs
- PC scores on bottom and left axes -> deduce characteristics of observations (based on meaning of PCs)
how to choose the number of PCs
trade-off: as number of PCs increases, cumultative PVE increases, dimension increases, model complexity increases
scree plot: eyeball the plot and locate the “elbow”
CV: treat number of PCs as a hyperparameter to be tuned if y exists
Drawbacks of PCS
1) loss of interpretability (Reason: PCs as composite variable can be hard to interpret)
2) not good for non-linearly related variables (PCs rely on linear transformations of variables)
3) PCA does dimension reduction, but not feature selection (PCs are constructed from all original features)
4) Target variable is ignored (PC is unsupervised)
Cluster Analysis
- to partition observations into a set of non-overlapping subgroups (“clusters”) and uncover hidden patterns
- observations within each cluster should be rather similar to one another
observations in different clusters should be rather different (well separated)
two feature generation methods based on clustering
1) cluster groups - as a new factor variables
2) cluster means - as a new numeric variable
K-means clustering process
For a fixed K (a positive #), choose K clusters C1, …, Ck to minimize the total within-cluster SS
Step 1 - initialization; given K, randomly select K points in the feature space as initial cluster centers
Step 2 - iteration; repeat the following steps until the cluster assignments no longer change: a) assign each obs. o the cluster with the closest center b) recalculate the K cluster centers (hence “K-means”)
What should you set K to?
set nstart to a large integer, e.g., >= 20
the algorithm produces a local optimum which depends on the randomly selected initial cluster centers
run the algorithm multiple times to improve the chance of finding a better local optimum
Selecting the value of K by elbow method
choose the “elbow” beyond which the proportion of variation explained is marginal
what is hierarchical clustering?
Algorithm:
- start with the individual observations, each treated as a separate cluster
- successively fuse the closest pair of clusters, one at a time
- stop when all clusters are fused into a single cluster containing all observations
output: a “hierarchy” of clusters which can be visualized by a dendrogram
What are the 4 linkages for hierarchical clustering?
1) Complete (default) - maximal pairwise distance
2) Single - minimal pairwise distance
3) Average - Average of all pairwise distances
4) centriod - distance between the two cluster centriods
complete and average linkage tends to produce _______ clusters
more balanced
both linkages are commonly used
Single linkage tends to produce _________, __________ clusters.
extended, trailing
single observations fused one-at-a-time
centriod linkage may lead to ____________.
inversion
some later fusions occur at a lower height than an earlier fusion
What is a dendrogram?
an upside-down tree showing the sequence of fusions and the inter-cluster dissimilarity (“height”) when each fusion occurs on the vertical axis
Similarities between clusters
clusters joined towards the bottom of the dendrogram are rather similar to one another, while those fused towards the top are rather far apart
Considerations when choosing the no. of clusters
Try to cut the dendrogram at a height such that:
- the resulting clusters have similar no. of obs. (balanced)
- the difference between height and the next threshold should be large enough -> obs. in different clusters have materially different characteristics
K-means vs Hierarchical: randomization, no. of clusters specified, nested clusters?
K-means
- randomization is needed
- no. of clusters are specified = K
- clusters are NOT nested
Hierarchical
- randomization is NOT needed
- no. of clusters are NOT specified
- clusters are nested (hierarchy)
Why does scaling matter for both PCA and clustering?
Without scaling: variables with a large order of magnitude will dominate variance and distance calculations -> this has a disproportionate effect on PC loadings and cluster groups
with scaling: all variables are on the same scale and share the same degree of importance
Alternative distance measure
correlation-based measure
Motivation: focuses on shapes of feature values rather than their exact magnitudes
limitation: only makes sense when p>= 3, for otherwise the correlation between two observations always equals +/-1
Clustering and curse of dimensionality
- visualization of the results of cluster analysis becomes problematic in high dimensions (p>=3)
- as the number of dimensions increases, our intuition breaks down and it becomes harder to differentiate between observations that are close and those that are far apart