SRM Chapter 6 - Unsupervised Learning Flashcards
6.1 Principal Components Analysis 6.2 Cluster Analysis
PCA (Principal Components Analysis)
- Reduces complexity by transforming variables into a smaller number of principal components that highlights most important features of the data (explain a sufficient amount of variability).
- Often applied before supervised models
k-means Clustering
- Divides data into predetermined number of clusters
- Such that variance within each cluster is minimized
Hierarchal Clustering
- Don’t have to specify number of clusters upfront
- Dendrogram -> tree that allows for flexible cluster analysis
What is a principal component? Its features?
- Each principal component is a linear combination of ALL features in the dataset
- Features in the dataset are assumed to have a mean of 0 (centered)
Loadings
Like multipliers for each predictor (?)
First principal component
- Explains the largest portion of variance in a dataset
- i.e. PCA goes in order adding on components until a sufficient amount of variability is explained (goal is to have the lowest number of components for this to be true)
How are values for the first principal component loadings determined?
By maximizing the sample variance of the first principal component
- Note: a NORMALIZED linear combination of the features is used to circumvent the variance being inflated (which happens if we set the loadings to be as large as possible)
Second principal component
Linear combination of features that maximizes the remaining variability in the dataset (not captured by the 1st principal component)
What is the dot product of the loading vectors for PC1 and PC2? Why?
0, because they are orthogonal
How can we solve for loading vectors?
Eigen decomposition (not tested) of the covariance matrix
- This produces eigenvalues (variances of each PC)
and eigenvectors (loading vectors).
What is the max number of distinct principal components that can be created?
For a dataset with n observations and p features, the max number of PCs is
min(n-1,p)
Distinct principal component
Distinct if the variance is non-zero (means adding this new component still helps to capture some of the variance in the dataset).
Biplot
- Plots PC1 and PC2 against each other (bottom x and y axes respectively)
- Can only visualize two PCs at a time
- Look at the x and y components of each predictor vector: if the x component is bigger then PC1 places more weight on this predictor; if the y component is bigger then
PC2 places more weight on this predictor. - Weight is decided by the magnitude of each vector
Why is scaling necessary in PCA?
- Because if predictors are scaled differently, the PCs will place more weight on some more than others (you can see this in the biplot of all the predictors)
Does PCR perform feature selection?
NO. PCR does not perform feature selection. All variables are used in producing the PCs
The first principal component: (2)
- Is the line in p-dimensional space that is closest to the observations
- Is the direction that explains the most variance
How does PCA reduce dimensionality?
By involving linear transformations
What happens if the # of PCs = the # of original variables?
Data approximation is exact (think all variables being used in some way, 100% of the variability is explained)
T/F PCA most useful for data with strong NON-linear relationships
FALSE, most suitable for linear, it’s a linear technique
The sum of the scores of each PC must be:
0 (NOT 1)
Because data centred (mean is) around 0, so +/- deviations cancel out
PC loading vectors
PC scores
Loading vectors: DIRECTIONS in space along which the data vary the most
Scores: PROJECTIONS along the directions
K-means clustering: at each iteration how does the number of clusters change?
Either same or less clusters