8. Unsupervised Learning Flashcards
Purpose of PCA
( Principal Component Analysis is ordered )
-> finds a sequence of linear combinations of the variables that have maximal variance, are mutually uncorrelated
- Data visualization / pre-processing before supervised techniques are applied.
- Dimensionality reduction
Purpose of clustering
Discovering unknown sub-groups (homogenous clusters) in data
Unsupervised Learning methods
PCA
Clustering
( Data observation and low-complexity data description )
SVM vs. Logistic Regression for (almost) separable classes
SVM
SVM vs. Logistic Regression for non-separable classes
SIMILAR
SVM and Logistic regression (with ridge penalty)
SVM vs. Logistic Regression for estimating probabilites
Logistic regression
SVM vs. Logistic Regression for fast and interpretable model
Logistic regression
SVM vs. Logistic Regression for non-linear boundaries
kernel SVM’s
( kernel Logistic regression expensive )
K-means clustering partition requirements
- Each value belongs to at least one cluster
- Each value belongs to only 1 cluster
(no-overlap)
Clustering potential issues
- Standardize observations first?
- How many clusters?
- What type of linkage / dissimilarity measure?
Within-Cluster-Variation formula
-
( usually a cumulative value is reported )
K-means clustering Vorgehensweise
- Initial clustering - Randomly assign a number from 1 to K to each of the observations.
- Compute cluster centroids
- Assign observations to closest centroid (using eucledian distance)
Hierarchical clustering Vorgehensweise
- Treat each observation as own cluster
- Measure pairwise dissimilarities (i.e. eucledian distance)
- Fuse most similar clusters, and re-compute
(benefit is
Linkage / Dissimilarity measures
- Complete: maximal intercluster dissimilarity
- Single: minimal intercluster dissimilarity
- Average: mean intercluster dissimilarity
- Centroid: dissimilarity between the centroid for the cluster
Vorgehensweise of Hyperplane / SVM
Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes == Maximal Margin Classifier
If not possible:
loosen „separate“ requirement (slack variables)
enlarge the feature space so that separation becomes possible (e.g. feature expansion woth transformed variables -> non-linear boundaries)