Module 8 - Unsupervised Learning Flashcards

1
Q

PCA, goal? how does it do that?

A
1) Produce a low-dimensional representation of a dataset that explains a good fraction of the variance
= make new variable(s) ("PCs") from a linear combination of old ones, which will replace them

2) Pair the dataset into some important variables that summarize all the info in the data. Finding a sequence of linear combinations of the variables that:

  • have maximal variance
  • are mutually uncorrelated (perpendicular)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

PCA: how to quantify the strength of the PCs?

A

Use proportion of variance explained (PVE) of each one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Clustering, goal?

A

1) Partition the data into distinct groups so that the observations within each group are similar to each other

2) Looking for homogeneous subgroups among the observations
- Each group represent “similar” observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

K-Means, how to choose K?

A

Use the elbow method

-As soon as the amount of additional variance being explained by a new cluster increases less significantly, stop adding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2 ways to approach hierarchal clustering

A

1) Agglomerative clustering
- Consider each observation as its own cluster
- Gradually group them with nearby clusters at each stage
- Stop until you only have 1 cluster left

2) Divisive clustering
- Consider all observations as a single cluster
- Progressively split into subclusters recursively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Goal of Hierarchal clustering?

A
  • Produces a hierarchical representation of the data

- Use this method to better understand the data when we expect there to be hierarchal structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False

-K-means clustering algorithm is less sensitive to the presence of outliers than the hierarchichal clustering algorithm

A

FALSE

Both algorithms force each observation to a cluster so that both may be heavily distorted by the presence of outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False

K-Means clustering algorithm requires random assignments, while the hierarchal clustering algorithm does not

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

True or False

-PCA provide low dimensional linear surfaces that are closer to the observaitons

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does nstart parameter do in the K-Means algorithm?

A
  • Controls the # of different initial cluster centres to be used
  • This improves the chances of finding a better local optimum. Tries to find the local minimum k. Using PVE as a criterion that MUST DECREASE at each step so it’s a greedy algorithm.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

True or false

-PCA can only be applied on numeric data

A

TRUE

-Categorical variables have to be converted beforehand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or false

-the more variance explained by a Principal component, the lower that PC is ranked

A

FALSE

-the more variance explained by a Principal component, the HIGHER that PC is ranked

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Maximum number of PCs?

A

Maximum number of PCs = Minimum number of variables and data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does k-Means quantify the quality of the produced clusters?

A
  • Want to explain as much variance as possible for
  • Using the ratio of between_SS/Total_SS
  • Good clustering = clustering which the WCV is as SMALL as possible
  • WCV = measure of the amount by which the observations within a cluster differ from each other.
  • This is quantified as the DISTANCE between the cluster’s CENTROID (Centre) and each observation within the cluster.
  • Partition the observations into K clusters such that the total WCV ,summed over all K clusters, is as SMALL as possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PCA, with categorical variables? what is the 1st step when using prcomp function in this case?

A
  • prcomp() requires numerical data
  • First thing to do is binarize any categorical variables
  • Use dummyVars from library(caret)
  • Set fullRank = TRUE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What to do after doing the PCA? How to test ?

A
  • Test the PCA variable on the training/test set
  • Then check if the new variable increases the accuracy/reduces the error/improves the AIC etc. to check if this should be added to the model
17
Q

By default, what kind of clustering function does hclust() use?

A

Agglomerative complete-linkage function

18
Q

Why is it good to standardize the variables prior to doing PCA?

A
  • You want to standardize the initial variables so that each one of them contributes equally to the analysis
  • If there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with smaller ranges.
19
Q

How to interpret the PCA loadings?

1) Size?
2) Sign?

A

1) Size/Magnitude of coefficient
- The larger the absolute value of the coefficient, the more important the corresponding variable is in calculating that specific component

  • Variables with LOW influence on the PC get values close to 0
  • Variables with MORE influence get numbers further from 0

2) Sign
- Variables with the same sign = positively related with each other
- Variables with opposing signs = inversely related to each other

20
Q

Kmeans - Pros?

A
  1. Simple, easy to implement

2. Suitable for large datasets

21
Q

Kmeans - Cons?

A
  1. Need to set K at the beginning of the algorithm
  2. Greedy algorithm
  3. Will have different results with different runs of the algorithm
22
Q

PCA - Interpret these loadings

Dry: -0.51
wet: 0.50
Clear: -0.50
rain: 0.4

A

the greater the loading, the greater the effect

Applying these weights creates a variable that is strongly positive for rain/wet conditions, and strongly negative for dry/clear conditions

23
Q

PCA: true or false

when adding the PC to the model, you need to remove the underlying variables from the dataset to avoid a rank deficient fit.

24
Q

True or false: in PCA, for categorical variables to work, they need to be converted to numeric