visualization of PCA output by displaying both the scores and loading vectors of the first two PCs. - PC loadings on top and right axes -> deduce the meaning of PCs - PC scores on bottom and left axes -> deduce characteristics of observations (based on meaning of PCs)

- to partition observations into a set of non-overlapping subgroups (clusters) and uncover hidden patterns - observations within each cluster should be rather similar to one another observations in different clusters should be rather different (well separated)

Section 3: Unsupervised Learning Flashcards by Avery Britt

what is supervised learning?

target is present

goal: to make inference or predictions for the target

How well did you know this?

Not at all

Perfectly

what is unsupervised learning?

target is absent (or ignored if present)

goal: to extract relationships between variables

How well did you know this?

Not at all

Perfectly

two reasons why unsupervised learning is often more challenging than supervised learning

1) (objectives) objectives in unsupervised learning are more fuzzy and subjective; no simple goal like prediction
2) (hard to assess results) methods for assessing model quality based on the target variable, e.g., CV, are generally not applicable

How well did you know this?

Not at all

Perfectly

what is Principal Components Analysis (PCA)?

idea: a) to transform a set if numeric variables into a smaller set of representative variables (PCs) -> reduce dimension of data
b) especially useful for highly correlated data -> a few PCs are enough to capture most information

How well did you know this?

Not at all

Perfectly

properties of PCs

1) linear combinations of the original features
2) generated to capture as much information in the data (w.r.t. variance) as possible
3) mutually uncorrelated (different PCs capture different aspects of data)
4) relationship between PC scores and PC loadings
5) amount of variance explained decreases with PC order, i.e., PC1 explains the most variance and subsequent PCs explain less and less

How well did you know this?

Not at all

Perfectly

two applications of PCS

1) EDA: plot the scores of the 1st PC vs the scores of the 2nd PC to gain a 2D view of the data in a scatterplot
2) feature generation: replaces the original variables by PCs to reduce overfitting and improve prediction performances

How well did you know this?

Not at all

Perfectly

interpreting signs and magnitudes of PC loadings

what do the PCs represent, e.g., proxy, average, or contrast of which variables? which variables are more correlated with one another?

How well did you know this?

Not at all

Perfectly

interpreting proportions of variance explained (PVEs)

PVEm = variance explained by mth PC/total variance

are the first few PVEs large enough (related to the strong correlations between variables)? If so, the PCs are useful

How well did you know this?

Not at all

Perfectly

interpreting biplots

visualization of PCA output by displaying both the scores and loading vectors of the first two PCs.

PC loadings on top and right axes -> deduce the meaning of PCs
PC scores on bottom and left axes -> deduce characteristics of observations (based on meaning of PCs)

How well did you know this?

Not at all

Perfectly

how to choose the number of PCs

trade-off: as number of PCs increases, cumultative PVE increases, dimension increases, model complexity increases

scree plot: eyeball the plot and locate the “elbow”

CV: treat number of PCs as a hyperparameter to be tuned if y exists

How well did you know this?

Not at all

Perfectly

Drawbacks of PCS

1) loss of interpretability (Reason: PCs as composite variable can be hard to interpret)
2) not good for non-linearly related variables (PCs rely on linear transformations of variables)
3) PCA does dimension reduction, but not feature selection (PCs are constructed from all original features)
4) Target variable is ignored (PC is unsupervised)

How well did you know this?

Not at all

Perfectly

Cluster Analysis

to partition observations into a set of non-overlapping subgroups (“clusters”) and uncover hidden patterns
observations within each cluster should be rather similar to one another
observations in different clusters should be rather different (well separated)

How well did you know this?

Not at all

Perfectly

two feature generation methods based on clustering

1) cluster groups - as a new factor variables
2) cluster means - as a new numeric variable

How well did you know this?

Not at all

Perfectly

K-means clustering process

For a fixed K (a positive #), choose K clusters C1, …, Ck to minimize the total within-cluster SS

Step 1 - initialization; given K, randomly select K points in the feature space as initial cluster centers
Step 2 - iteration; repeat the following steps until the cluster assignments no longer change: a) assign each obs. o the cluster with the closest center b) recalculate the K cluster centers (hence “K-means”)

How well did you know this?

Not at all

Perfectly

What should you set K to?

set nstart to a large integer, e.g., >= 20

the algorithm produces a local optimum which depends on the randomly selected initial cluster centers

run the algorithm multiple times to improve the chance of finding a better local optimum

How well did you know this?

Not at all

Perfectly

Selecting the value of K by elbow method

choose the “elbow” beyond which the proportion of variation explained is marginal

what is hierarchical clustering?

Algorithm:
- start with the individual observations, each treated as a separate cluster
- successively fuse the closest pair of clusters, one at a time
- stop when all clusters are fused into a single cluster containing all observations

output: a “hierarchy” of clusters which can be visualized by a dendrogram

What are the 4 linkages for hierarchical clustering?

1) Complete (default) - maximal pairwise distance
2) Single - minimal pairwise distance
3) Average - Average of all pairwise distances
4) centriod - distance between the two cluster centriods

complete and average linkage tends to produce _______ clusters

more balanced

both linkages are commonly used

Single linkage tends to produce _________, __________ clusters.

extended, trailing

single observations fused one-at-a-time

centriod linkage may lead to ____________.

inversion

some later fusions occur at a lower height than an earlier fusion

What is a dendrogram?

an upside-down tree showing the sequence of fusions and the inter-cluster dissimilarity (“height”) when each fusion occurs on the vertical axis

Similarities between clusters

clusters joined towards the bottom of the dendrogram are rather similar to one another, while those fused towards the top are rather far apart

Considerations when choosing the no. of clusters

Try to cut the dendrogram at a height such that:
- the resulting clusters have similar no. of obs. (balanced)
- the difference between height and the next threshold should be large enough -> obs. in different clusters have materially different characteristics

K-means vs Hierarchical: randomization, no. of clusters specified, nested clusters?

K-means - randomization is needed - no. of clusters are specified = K - clusters are NOT nested Hierarchical - randomization is NOT needed - no. of clusters are NOT specified - clusters are nested (hierarchy)

Why does scaling matter for both PCA and clustering?

Without scaling: variables with a large order of magnitude will dominate variance and distance calculations -> this has a disproportionate effect on PC loadings and cluster groups with scaling: all variables are on the same scale and share the same degree of importance

Alternative distance measure

correlation-based measure Motivation: focuses on shapes of feature values rather than their exact magnitudes limitation: only makes sense when p>= 3, for otherwise the correlation between two observations always equals +/-1

Clustering and curse of dimensionality

- visualization of the results of cluster analysis becomes problematic in high dimensions (p>=3) - as the number of dimensions increases, our intuition breaks down and it becomes harder to differentiate between observations that are close and those that are far apart