Section 3: Unsupervised Learning Flashcards
what is supervised learning?
target is present
goal: to make inference or predictions for the target
what is unsupervised learning?
target is absent (or ignored if present)
goal: to extract relationships between variables
two reasons why unsupervised learning is often more challenging than supervised learning
1) (objectives) objectives in unsupervised learning are more fuzzy and subjective; no simple goal like prediction
2) (hard to assess results) methods for assessing model quality based on the target variable, e.g., CV, are generally not applicable
what is Principal Components Analysis (PCA)?
idea: a) to transform a set if numeric variables into a smaller set of representative variables (PCs) -> reduce dimension of data
b) especially useful for highly correlated data -> a few PCs are enough to capture most information
properties of PCs
1) linear combinations of the original features
2) generated to capture as much information in the data (w.r.t. variance) as possible
3) mutually uncorrelated (different PCs capture different aspects of data)
4) relationship between PC scores and PC loadings
5) amount of variance explained decreases with PC order, i.e., PC1 explains the most variance and subsequent PCs explain less and less
two applications of PCS
1) EDA: plot the scores of the 1st PC vs the scores of the 2nd PC to gain a 2D view of the data in a scatterplot
2) feature generation: replaces the original variables by PCs to reduce overfitting and improve prediction performances
interpreting signs and magnitudes of PC loadings
what do the PCs represent, e.g., proxy, average, or contrast of which variables? which variables are more correlated with one another?
interpreting proportions of variance explained (PVEs)
PVEm = variance explained by mth PC/total variance
are the first few PVEs large enough (related to the strong correlations between variables)? If so, the PCs are useful
interpreting biplots
visualization of PCA output by displaying both the scores and loading vectors of the first two PCs.
- PC loadings on top and right axes -> deduce the meaning of PCs
- PC scores on bottom and left axes -> deduce characteristics of observations (based on meaning of PCs)
how to choose the number of PCs
trade-off: as number of PCs increases, cumultative PVE increases, dimension increases, model complexity increases
scree plot: eyeball the plot and locate the “elbow”
CV: treat number of PCs as a hyperparameter to be tuned if y exists
Drawbacks of PCS
1) loss of interpretability (Reason: PCs as composite variable can be hard to interpret)
2) not good for non-linearly related variables (PCs rely on linear transformations of variables)
3) PCA does dimension reduction, but not feature selection (PCs are constructed from all original features)
4) Target variable is ignored (PC is unsupervised)
Cluster Analysis
- to partition observations into a set of non-overlapping subgroups (“clusters”) and uncover hidden patterns
- observations within each cluster should be rather similar to one another
observations in different clusters should be rather different (well separated)
two feature generation methods based on clustering
1) cluster groups - as a new factor variables
2) cluster means - as a new numeric variable
K-means clustering process
For a fixed K (a positive #), choose K clusters C1, …, Ck to minimize the total within-cluster SS
Step 1 - initialization; given K, randomly select K points in the feature space as initial cluster centers
Step 2 - iteration; repeat the following steps until the cluster assignments no longer change: a) assign each obs. o the cluster with the closest center b) recalculate the K cluster centers (hence “K-means”)
What should you set K to?
set nstart to a large integer, e.g., >= 20
the algorithm produces a local optimum which depends on the randomly selected initial cluster centers
run the algorithm multiple times to improve the chance of finding a better local optimum