Unsupervised Learning Flashcards
PCA vs Princpal Components Regresssion
PCR just refers to performing a regression on the dimensions obtained from transforming feature set into Principal components. PCA is the UNSUPERVISED analysis of a set of data and using that to understand more about a set of data.
Pre processing steps to PCA
you have to center and scale your variables prior to PCA.
Things you can do with PCA
you can look at the loading vector output from PCA, and can understand what underlying variables are most important for each principal component. If multiple variables are equally important inside each prinicpal component, they are highly correlated with eachother. Also, variables that are part of different principal components are not correlated with the rest generally. Kind of like clustering your variables. see pg. 377 Intro to Stat Learning.
How to know how many principal components you need in PCA?
Use Proportion of Variance Explained PVE by each principal component. Can also compute for M principal components. You can plot the cumalitive PVE against the # of principal components, to see if there is an “elbow”
How to interpret hierarchical clustering
You are presented with a diagram called a dendogram. The height of where branches fuse together are important. The higher up on the vertical axes, the less similar they are even though they fuse together. Things that fuse together towards the bottom are more similar to eachother. Position on the vertical axes is of the utmost importance. Things that fuse together high on the vertical axes may not be similar to anything at all!
How to interpret hierarchical clustering
You are presented with a diagram called a dendogram. The height of where branches fuse together are important. The higher up on the vertical axes, the less similar they are even though they fuse together. Things that fuse together towards the bottom are more similar to eachother. Position on the vertical axes is of the utmost importance. Things that fuse together high on the vertical axes may not be similar to anything at all!
Parameters of Hierachal Clustering
- Distance Measure (Euclidean, Correlation Based, Manhattan)
- Linkage - Use “Complete Linkage” , or second best “Average Linkage”
- What height to “cut” the dendogram, analogous to choosing # of clusters
Parameters of Hierachal Clustering
- Distance Measure (Euclidean, Correlation Based, Manhattan)
- Linkage - Use “Complete Linkage” , or second best “Average Linkage”
- What height to “cut” the dendogram, analogous to choosing # of clusters
Things you always want to do in clustering
- Center and Scale all vars
When would you want to use correlation based distance
when the features are highly correlated, which will give you a different result then just eucledian distance because the observed values may be far apart. ## WARNING: must have at least 3 features for this, if you only have 2 features this wont’ work.
What to do in kmeans or hierachal clustering when there are lots of dimensions
you really have to perform PCA, because you won’t get good results otherwise.