Section 3: Unsupervised Learning Flashcards

1
Q

what is supervised learning?

A

target is present

goal: to make inference or predictions for the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is unsupervised learning?

A

target is absent (or ignored if present)

goal: to extract relationships between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

two reasons why unsupervised learning is often more challenging than supervised learning

A

1) (objectives) objectives in unsupervised learning are more fuzzy and subjective; no simple goal like prediction
2) (hard to assess results) methods for assessing model quality based on the target variable, e.g., CV, are generally not applicable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is Principal Components Analysis (PCA)?

A

idea: a) to transform a set if numeric variables into a smaller set of representative variables (PCs) -> reduce dimension of data
b) especially useful for highly correlated data -> a few PCs are enough to capture most information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

properties of PCs

A

1) linear combinations of the original features
2) generated to capture as much information in the data (w.r.t. variance) as possible
3) mutually uncorrelated (different PCs capture different aspects of data)
4) relationship between PC scores and PC loadings
5) amount of variance explained decreases with PC order, i.e., PC1 explains the most variance and subsequent PCs explain less and less

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

two applications of PCS

A

1) EDA: plot the scores of the 1st PC vs the scores of the 2nd PC to gain a 2D view of the data in a scatterplot
2) feature generation: replaces the original variables by PCs to reduce overfitting and improve prediction performances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

interpreting signs and magnitudes of PC loadings

A

what do the PCs represent, e.g., proxy, average, or contrast of which variables? which variables are more correlated with one another?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

interpreting proportions of variance explained (PVEs)

A

PVEm = variance explained by mth PC/total variance

are the first few PVEs large enough (related to the strong correlations between variables)? If so, the PCs are useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

interpreting biplots

A

visualization of PCA output by displaying both the scores and loading vectors of the first two PCs.

  • PC loadings on top and right axes -> deduce the meaning of PCs
  • PC scores on bottom and left axes -> deduce characteristics of observations (based on meaning of PCs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how to choose the number of PCs

A

trade-off: as number of PCs increases, cumultative PVE increases, dimension increases, model complexity increases

scree plot: eyeball the plot and locate the “elbow”

CV: treat number of PCs as a hyperparameter to be tuned if y exists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Drawbacks of PCS

A

1) loss of interpretability (Reason: PCs as composite variable can be hard to interpret)
2) not good for non-linearly related variables (PCs rely on linear transformations of variables)
3) PCA does dimension reduction, but not feature selection (PCs are constructed from all original features)
4) Target variable is ignored (PC is unsupervised)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cluster Analysis

A
  • to partition observations into a set of non-overlapping subgroups (“clusters”) and uncover hidden patterns
  • observations within each cluster should be rather similar to one another
    observations in different clusters should be rather different (well separated)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

two feature generation methods based on clustering

A

1) cluster groups - as a new factor variables
2) cluster means - as a new numeric variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

K-means clustering process

A

For a fixed K (a positive #), choose K clusters C1, …, Ck to minimize the total within-cluster SS

Step 1 - initialization; given K, randomly select K points in the feature space as initial cluster centers
Step 2 - iteration; repeat the following steps until the cluster assignments no longer change: a) assign each obs. o the cluster with the closest center b) recalculate the K cluster centers (hence “K-means”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What should you set K to?

A

set nstart to a large integer, e.g., >= 20

the algorithm produces a local optimum which depends on the randomly selected initial cluster centers

run the algorithm multiple times to improve the chance of finding a better local optimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Selecting the value of K by elbow method

A

choose the “elbow” beyond which the proportion of variation explained is marginal

17
Q

what is hierarchical clustering?

A

Algorithm:
- start with the individual observations, each treated as a separate cluster
- successively fuse the closest pair of clusters, one at a time
- stop when all clusters are fused into a single cluster containing all observations

output: a “hierarchy” of clusters which can be visualized by a dendrogram

18
Q

What are the 4 linkages for hierarchical clustering?

A

1) Complete (default) - maximal pairwise distance
2) Single - minimal pairwise distance
3) Average - Average of all pairwise distances
4) centriod - distance between the two cluster centriods

18
Q

complete and average linkage tends to produce _______ clusters

A

more balanced

both linkages are commonly used

18
Q

Single linkage tends to produce _________, __________ clusters.

A

extended, trailing

single observations fused one-at-a-time

18
Q

centriod linkage may lead to ____________.

A

inversion

some later fusions occur at a lower height than an earlier fusion

18
Q

What is a dendrogram?

A

an upside-down tree showing the sequence of fusions and the inter-cluster dissimilarity (“height”) when each fusion occurs on the vertical axis

19
Q

Similarities between clusters

A

clusters joined towards the bottom of the dendrogram are rather similar to one another, while those fused towards the top are rather far apart

20
Q

Considerations when choosing the no. of clusters

A

Try to cut the dendrogram at a height such that:
- the resulting clusters have similar no. of obs. (balanced)
- the difference between height and the next threshold should be large enough -> obs. in different clusters have materially different characteristics

21
Q

K-means vs Hierarchical: randomization, no. of clusters specified, nested clusters?

A

K-means
- randomization is needed
- no. of clusters are specified = K
- clusters are NOT nested

Hierarchical
- randomization is NOT needed
- no. of clusters are NOT specified
- clusters are nested (hierarchy)

22
Q

Why does scaling matter for both PCA and clustering?

A

Without scaling: variables with a large order of magnitude will dominate variance and distance calculations -> this has a disproportionate effect on PC loadings and cluster groups

with scaling: all variables are on the same scale and share the same degree of importance

23
Q

Alternative distance measure

A

correlation-based measure

Motivation: focuses on shapes of feature values rather than their exact magnitudes

limitation: only makes sense when p>= 3, for otherwise the correlation between two observations always equals +/-1

24
Q

Clustering and curse of dimensionality

A
  • visualization of the results of cluster analysis becomes problematic in high dimensions (p>=3)
  • as the number of dimensions increases, our intuition breaks down and it becomes harder to differentiate between observations that are close and those that are far apart