6. Unsupervised Learning Flashcards
What is PCA.
Statistical tool that finds a low-dimensional representation of a dataset that contains as much info from the dataset as possible.
Can PCA be used in a supervised and unsupervised setting?
Yes, PCR would be used in a supervised setting
Formula for the mth PC
Formula
The principle component loadings are constrained to what?
Formula
Mth PC score formula
Formula
For PCA, do the variables need to be scaled or centred?
Centred
What is the maximum number of PC’s?
Min(n-1,p)
What is the formula for PVE of the mth PC?
Formula
What is the model equation for PCR? What happens when k=p?
Formula
What are the two methods of calculating within cluster variation?
Formula
What is the algorithm for k means clustering?
- Randomly assign a cluster to each observation. This is the initial cluster assignments, pre determined number of clusters.
- Calculate the centroid of each cluster
- For each observation, identify the closest centroid and reassign to that cluster
- Repeat steps 2 and 3 until the cluster assignments stop changing.
What are 2 drawbacks of k means clustering?
- Initial cluster assignments affect the final assignments.
- Selecting k is an arbitrary process
- Not robust
Does k means need to have it variables standardized?
No, this relies heavily on the problem at hand
Are k means and hierarchical clustering robust?
No
Is k means clustering greedy?
Yes
Centroid linkage is subject to _____. And single linkages has a dendogram that is _____
Inversions and skewed
True or false: when performing PCR, it is recommended to standardize the predictors prior to generating the principle components.
True. This is to avoid high variance variables from monopolizing the principle components.
Can PCR reduce overfitting?
Yes, instead of using all of the original variables, PCR uses only the first k PC’s to predict the response, which reduces overfitting.