SRM Chapter 6 - Unsupervised Learning Flashcards

6.1 Principal Components Analysis 6.2 Cluster Analysis

1
Q

PCA (Principal Components Analysis)

A
  • Reduces complexity by transforming variables into a smaller number of principal components that highlights most important features of the data (explain a sufficient amount of variability).
  • Often applied before supervised models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

k-means Clustering

A
  • Divides data into predetermined number of clusters
  • Such that variance within each cluster is minimized
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hierarchal Clustering

A
  • Don’t have to specify number of clusters upfront
  • Dendrogram -> tree that allows for flexible cluster analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a principal component? Its features?

A
  • Each principal component is a linear combination of ALL features in the dataset
  • Features in the dataset are assumed to have a mean of 0 (centered)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Loadings

A

Like multipliers for each predictor (?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

First principal component

A
  • Explains the largest portion of variance in a dataset
  • i.e. PCA goes in order adding on components until a sufficient amount of variability is explained (goal is to have the lowest number of components for this to be true)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are values for the first principal component loadings determined?

A

By maximizing the sample variance of the first principal component

  • Note: a NORMALIZED linear combination of the features is used to circumvent the variance being inflated (which happens if we set the loadings to be as large as possible)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Second principal component

A

Linear combination of features that maximizes the remaining variability in the dataset (not captured by the 1st principal component)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the dot product of the loading vectors for PC1 and PC2? Why?

A

0, because they are orthogonal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can we solve for loading vectors?

A

Eigen decomposition (not tested) of the covariance matrix
- This produces eigenvalues (variances of each PC)
and eigenvectors (loading vectors).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the max number of distinct principal components that can be created?

A

For a dataset with n observations and p features, the max number of PCs is

min(n-1,p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Distinct principal component

A

Distinct if the variance is non-zero (means adding this new component still helps to capture some of the variance in the dataset).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Biplot

A
  • Plots PC1 and PC2 against each other (bottom x and y axes respectively)
  • Can only visualize two PCs at a time
  • Look at the x and y components of each predictor vector: if the x component is bigger then PC1 places more weight on this predictor; if the y component is bigger then
    PC2 places more weight on this predictor.
  • Weight is decided by the magnitude of each vector
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is scaling necessary in PCA?

A
  • Because if predictors are scaled differently, the PCs will place more weight on some more than others (you can see this in the biplot of all the predictors)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does PCR perform feature selection?

A

NO. PCR does not perform feature selection. All variables are used in producing the PCs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The first principal component: (2)

A
  1. Is the line in p-dimensional space that is closest to the observations
  2. Is the direction that explains the most variance
17
Q

How does PCA reduce dimensionality?

A

By involving linear transformations

18
Q

What happens if the # of PCs = the # of original variables?

A

Data approximation is exact (think all variables being used in some way, 100% of the variability is explained)

19
Q

T/F PCA most useful for data with strong NON-linear relationships

A

FALSE, most suitable for linear, it’s a linear technique

20
Q

The sum of the scores of each PC must be:

A

0 (NOT 1)

Because data centred (mean is) around 0, so +/- deviations cancel out

21
Q

PC loading vectors
PC scores

A

Loading vectors: DIRECTIONS in space along which the data vary the most

Scores: PROJECTIONS along the directions

22
Q

K-means clustering: at each iteration how does the number of clusters change?

A

Either same or less clusters