Module 8 - Unsupervised Learning Flashcards
PCA, goal? how does it do that?
1) Produce a low-dimensional representation of a dataset that explains a good fraction of the variance = make new variable(s) ("PCs") from a linear combination of old ones, which will replace them
2) Pair the dataset into some important variables that summarize all the info in the data. Finding a sequence of linear combinations of the variables that:
- have maximal variance
- are mutually uncorrelated (perpendicular)
PCA: how to quantify the strength of the PCs?
Use proportion of variance explained (PVE) of each one
Clustering, goal?
1) Partition the data into distinct groups so that the observations within each group are similar to each other
2) Looking for homogeneous subgroups among the observations
- Each group represent “similar” observations
K-Means, how to choose K?
Use the elbow method
-As soon as the amount of additional variance being explained by a new cluster increases less significantly, stop adding.
2 ways to approach hierarchal clustering
1) Agglomerative clustering
- Consider each observation as its own cluster
- Gradually group them with nearby clusters at each stage
- Stop until you only have 1 cluster left
2) Divisive clustering
- Consider all observations as a single cluster
- Progressively split into subclusters recursively
Goal of Hierarchal clustering?
- Produces a hierarchical representation of the data
- Use this method to better understand the data when we expect there to be hierarchal structure
True or False
-K-means clustering algorithm is less sensitive to the presence of outliers than the hierarchichal clustering algorithm
FALSE
Both algorithms force each observation to a cluster so that both may be heavily distorted by the presence of outliers
True or False
K-Means clustering algorithm requires random assignments, while the hierarchal clustering algorithm does not
TRUE
True or False
-PCA provide low dimensional linear surfaces that are closer to the observaitons
TRUE
What does nstart parameter do in the K-Means algorithm?
- Controls the # of different initial cluster centres to be used
- This improves the chances of finding a better local optimum. Tries to find the local minimum k. Using PVE as a criterion that MUST DECREASE at each step so it’s a greedy algorithm.
True or false
-PCA can only be applied on numeric data
TRUE
-Categorical variables have to be converted beforehand
True or false
-the more variance explained by a Principal component, the lower that PC is ranked
FALSE
-the more variance explained by a Principal component, the HIGHER that PC is ranked
Maximum number of PCs?
Maximum number of PCs = Minimum number of variables and data points
How does k-Means quantify the quality of the produced clusters?
- Want to explain as much variance as possible for
- Using the ratio of between_SS/Total_SS
- Good clustering = clustering which the WCV is as SMALL as possible
- WCV = measure of the amount by which the observations within a cluster differ from each other.
- This is quantified as the DISTANCE between the cluster’s CENTROID (Centre) and each observation within the cluster.
- Partition the observations into K clusters such that the total WCV ,summed over all K clusters, is as SMALL as possible
PCA, with categorical variables? what is the 1st step when using prcomp function in this case?
- prcomp() requires numerical data
- First thing to do is binarize any categorical variables
- Use dummyVars from library(caret)
- Set fullRank = TRUE
What to do after doing the PCA? How to test ?
- Test the PCA variable on the training/test set
- Then check if the new variable increases the accuracy/reduces the error/improves the AIC etc. to check if this should be added to the model
By default, what kind of clustering function does hclust() use?
Agglomerative complete-linkage function
Why is it good to standardize the variables prior to doing PCA?
- You want to standardize the initial variables so that each one of them contributes equally to the analysis
- If there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with smaller ranges.
How to interpret the PCA loadings?
1) Size?
2) Sign?
1) Size/Magnitude of coefficient
- The larger the absolute value of the coefficient, the more important the corresponding variable is in calculating that specific component
- Variables with LOW influence on the PC get values close to 0
- Variables with MORE influence get numbers further from 0
2) Sign
- Variables with the same sign = positively related with each other
- Variables with opposing signs = inversely related to each other
Kmeans - Pros?
- Simple, easy to implement
2. Suitable for large datasets
Kmeans - Cons?
- Need to set K at the beginning of the algorithm
- Greedy algorithm
- Will have different results with different runs of the algorithm
PCA - Interpret these loadings
Dry: -0.51
wet: 0.50
Clear: -0.50
rain: 0.4
the greater the loading, the greater the effect
Applying these weights creates a variable that is strongly positive for rain/wet conditions, and strongly negative for dry/clear conditions
PCA: true or false
when adding the PC to the model, you need to remove the underlying variables from the dataset to avoid a rank deficient fit.
TRUE
True or false: in PCA, for categorical variables to work, they need to be converted to numeric
TRUE