Module 8 - Unsupervised Learning Flashcards

Question 1

Q

PCA, goal? how does it do that?

Answer

A

1) Produce a low-dimensional representation of a dataset that explains a good fraction of the variance
= make new variable(s) ("PCs") from a linear combination of old ones, which will replace them

2) Pair the dataset into some important variables that summarize all the info in the data. Finding a sequence of linear combinations of the variables that:

have maximal variance
are mutually uncorrelated (perpendicular)

Question 2

Q

PCA: how to quantify the strength of the PCs?

Answer

A

Use proportion of variance explained (PVE) of each one

Question 3

Q

Clustering, goal?

Answer

A

1) Partition the data into distinct groups so that the observations within each group are similar to each other

2) Looking for homogeneous subgroups among the observations
- Each group represent “similar” observations

Question 4

Q

K-Means, how to choose K?

Answer

A

Use the elbow method

-As soon as the amount of additional variance being explained by a new cluster increases less significantly, stop adding.

Question 5

Q

2 ways to approach hierarchal clustering

Answer

A

1) Agglomerative clustering
- Consider each observation as its own cluster
- Gradually group them with nearby clusters at each stage
- Stop until you only have 1 cluster left

2) Divisive clustering
- Consider all observations as a single cluster
- Progressively split into subclusters recursively

Question 6

Q

Goal of Hierarchal clustering?

Answer

A

Produces a hierarchical representation of the data

- Use this method to better understand the data when we expect there to be hierarchal structure

Question 7

Q

True or False

-K-means clustering algorithm is less sensitive to the presence of outliers than the hierarchichal clustering algorithm

Answer

A

FALSE

Both algorithms force each observation to a cluster so that both may be heavily distorted by the presence of outliers

Question 8

Q

True or False

K-Means clustering algorithm requires random assignments, while the hierarchal clustering algorithm does not

Question 9

Q

True or False

-PCA provide low dimensional linear surfaces that are closer to the observaitons

Question 10

Q

What does nstart parameter do in the K-Means algorithm?

Answer

A

Controls the # of different initial cluster centres to be used
This improves the chances of finding a better local optimum. Tries to find the local minimum k. Using PVE as a criterion that MUST DECREASE at each step so it’s a greedy algorithm.

Question 11

Q

True or false

-PCA can only be applied on numeric data

Answer

A

TRUE

-Categorical variables have to be converted beforehand

Question 12

Q

True or false

-the more variance explained by a Principal component, the lower that PC is ranked

Answer

A

FALSE

-the more variance explained by a Principal component, the HIGHER that PC is ranked

Question 13

Q

Maximum number of PCs?

Answer

A

Maximum number of PCs = Minimum number of variables and data points

Question 14

Q

How does k-Means quantify the quality of the produced clusters?

Answer

A

Want to explain as much variance as possible for
Using the ratio of between_SS/Total_SS
Good clustering = clustering which the WCV is as SMALL as possible
WCV = measure of the amount by which the observations within a cluster differ from each other.
This is quantified as the DISTANCE between the cluster’s CENTROID (Centre) and each observation within the cluster.
Partition the observations into K clusters such that the total WCV ,summed over all K clusters, is as SMALL as possible

Question 15

Q

PCA, with categorical variables? what is the 1st step when using prcomp function in this case?

Answer

A

prcomp() requires numerical data
First thing to do is binarize any categorical variables
Use dummyVars from library(caret)
Set fullRank = TRUE

Question 16

Q

What to do after doing the PCA? How to test ?

Answer

Study These Flashcards

A

Test the PCA variable on the training/test set
Then check if the new variable increases the accuracy/reduces the error/improves the AIC etc. to check if this should be added to the model

Question 17

Q

By default, what kind of clustering function does hclust() use?

Answer

Study These Flashcards

A

Agglomerative complete-linkage function

Question 18

Q

Why is it good to standardize the variables prior to doing PCA?

Answer

Study These Flashcards

A

You want to standardize the initial variables so that each one of them contributes equally to the analysis
If there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with smaller ranges.

Question 19

Q

How to interpret the PCA loadings?

1) Size?
2) Sign?

Answer

Study These Flashcards

A

1) Size/Magnitude of coefficient
- The larger the absolute value of the coefficient, the more important the corresponding variable is in calculating that specific component

Variables with LOW influence on the PC get values close to 0
Variables with MORE influence get numbers further from 0

2) Sign
- Variables with the same sign = positively related with each other
- Variables with opposing signs = inversely related to each other

Question 20

Q

Kmeans - Pros?

Answer

Study These Flashcards

A

Simple, easy to implement

2. Suitable for large datasets

Question 21

Q

Kmeans - Cons?

Answer

Study These Flashcards

A

Need to set K at the beginning of the algorithm
Greedy algorithm
Will have different results with different runs of the algorithm

Question 22

Q

PCA - Interpret these loadings

Dry: -0.51
wet: 0.50
Clear: -0.50
rain: 0.4

Answer

Study These Flashcards

A

the greater the loading, the greater the effect

Applying these weights creates a variable that is strongly positive for rain/wet conditions, and strongly negative for dry/clear conditions

Question 23

Q

PCA: true or false

when adding the PC to the model, you need to remove the underlying variables from the dataset to avoid a rank deficient fit.

Answer

Study These Flashcards

A

TRUE

Question 24

Q

True or false: in PCA, for categorical variables to work, they need to be converted to numeric

Answer

Study These Flashcards

A

TRUE

Module 8 - Unsupervised Learning Flashcards

(24 cards)