8. Unsupervised Learning Flashcards

Question 1

Q

Purpose of PCA

Answer

A

( Principal Component Analysis is ordered )
-> finds a sequence of linear combinations of the variables that have maximal variance, are mutually uncorrelated

Data visualization / pre-processing before supervised techniques are applied.
Dimensionality reduction

Question 2

Q

Purpose of clustering

Answer

A

Discovering unknown sub-groups (homogenous clusters) in data

Question 3

Q

Unsupervised Learning methods

Answer

A

PCA
Clustering

( Data observation and low-complexity data description )

Question 4

Q

SVM vs. Logistic Regression for (almost) separable classes

Question 5

Q

SVM vs. Logistic Regression for non-separable classes

Answer

A

SIMILAR
SVM and Logistic regression (with ridge penalty)

Question 6

Q

SVM vs. Logistic Regression for estimating probabilites

Answer

A

Logistic regression

Question 7

Q

SVM vs. Logistic Regression for fast and interpretable model

Answer

A

Logistic regression

Question 8

Q

SVM vs. Logistic Regression for non-linear boundaries

Answer

A

kernel SVM’s

( kernel Logistic regression expensive )

Question 9

Q

K-means clustering partition requirements

Answer

A

Each value belongs to at least one cluster
Each value belongs to only 1 cluster

(no-overlap)

Question 10

Q

Clustering potential issues

Answer

A

Standardize observations first?
How many clusters?
What type of linkage / dissimilarity measure?

Question 11

Q

Within-Cluster-Variation formula

Answer

A

-

( usually a cumulative value is reported )

Question 12

Q

K-means clustering Vorgehensweise

Answer

A

Initial clustering - Randomly assign a number from 1 to K to each of the observations.
Compute cluster centroids
Assign observations to closest centroid (using eucledian distance)

Question 13

Q

Hierarchical clustering Vorgehensweise

Answer

A

Treat each observation as own cluster
Measure pairwise dissimilarities (i.e. eucledian distance)
Fuse most similar clusters, and re-compute

(benefit is

Question 14

Q

Linkage / Dissimilarity measures

Answer

A

Complete: maximal intercluster dissimilarity
Single: minimal intercluster dissimilarity
Average: mean intercluster dissimilarity
Centroid: dissimilarity between the centroid for the cluster

Question 15

Q

Vorgehensweise of Hyperplane / SVM

Answer

A

Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes == Maximal Margin Classifier

If not possible:
loosen „separate“ requirement (slack variables)
enlarge the feature space so that separation becomes possible (e.g. feature expansion woth transformed variables -> non-linear boundaries)

Question 16

Q

Popular kernel functions

Answer

A

Linear kernel (standard for linear classification)
Radial Basis Function
Polynomial