Week 7 Flashcards

1
Q

Real uses of unsupervised learning

A

Customer segmentation (single parents, young party-goers)
Identifying fraud (bank transactions, GPS logs, bots on social media
Identifying new animal species
Creating the classes needed for a classification algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does K-means work

A

Identifies points close to K centroids, where K is given by the user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does DBSCAN work?

A

It finds core regions of high density and expands clusters from them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Hierarchical clustering?

A

Can be agglomerative or divisive, as long as you produce a hierarchy of clusters

Something like:
1. Split all points into clusters A and B
2. Split cluster A into clusters A1 and A2
3. Split cluster B into clusters B1 and B2
4. Split cluster A1 into …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hard vs Soft clustering

A

Hard: each object belongs in one cluster, similar to how a perceptron performs classification

Soft: objects are assigned to multiple clusters, with corresponding probabilities, similar to how a logistic regression performs classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What did DBSCAN work best at compared to others?

A

Identifying rings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a key ingredient for clustering?

A

What data is represented and HOW
The similarity metric/distance metric
(L1 or L2 norm, Jaccard Similarity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Jaccard Similarity

A

A n B | / | A u B |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Jaccard distance

A

1 - Jaccard Similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do we do in Dimensionality Reduction

A

Remove noise from the data
Focus on the features (or combinations of features that are actually important)
Less number-crunching = more efficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two types of Dimensionality Reduction

A

Feature selection + extraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

3 types of feature selection

A

Filter methods
Wrapper methods
Embedded methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Filter method examples

A

Information gain
Correlation with target
Pairwise correlation
Variance threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Wrapper method examples

A

Recursive feature elimination
Sequential feature selection
Permutation importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Embedded method examples

A

L1 Lasso Regularization
Decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Variance thresholding

A

Filter method: low-variance features contain less information
Calculate variance of each feature, drop features with variance below threshold.
FEATURES MUST BE SAME SCALE!

17
Q

What is Forward Search

A

Wrapper method: create n models, one feature each, select best one. Create n-1 models, adding one feature, select best one, proceed until you have m features

18
Q

What is recursive feature elimination

A

Wrapper method: create n-1 models, with n-1 features each, select best, create n-2 models, removing one feature, select best one, proceed until you have removed m features

19
Q

What is a Decision Tree

A

Embedded method: Splits the data into a tree. Can split by Gini impurity coefficient (measures how pure a node is), information gain, and variance reduction

20
Q

What is feature extraction

A

Linear and nonlinear methods. Extract useful combinations of features in the data

21
Q

What is PCA

A

Linear method of feature extraction
Find an orthogonal coordinate transformation (rotation, rotation+reflection) such that every new coordinate is maximally informative

22
Q

How does PCA work

A

Flip points so that each coordinate is maximally informative and orthogonal to each other. Imagine rotating cube, we take a slice where x and y describes most of the data (z a little bit)

23
Q

What are the new variables from PCA called

A

Principle components, and they are linear combinations of the original coordinates.

yi are uncorrelated, and are ordered by the fraction of the total variation each retains. PC1 (y1) is most informative, then PC2 etc.

24
Q

When does PCA work best vs worst

A

BEST: high correlation between features
WORST: all variables are equally important and uncorrelated. PCA is uninformative.

25
Q

PCA input and output

A

Input high dimensional data
Output low dimensional data

26
Q

What is t-SNE

A

non-linear dimensionality reduction
Take the distribution of distances between N points in the dataset.
Scatter N points in 2 or 3 dimensions randomly
Move those N points around until the distribution of distances between them resembles D.

27
Q

Why t-SNE

A

Very useful for visualising high-D data

28
Q

What is UMAP

A

Similar to tSNE, only slightly different in every step
Runs faster uses less memory
No problem embedding into >3 dimensions
Can preserve both local and global structure.

29
Q

Problems with t-SNE and UMAP

A

Both depend a lot on their hyperparameters

Cluster sizes and distances between clusters means nothing

X and y axes are basically impossible to interpret.