Week 7 Flashcards
Real uses of unsupervised learning
Customer segmentation (single parents, young party-goers)
Identifying fraud (bank transactions, GPS logs, bots on social media
Identifying new animal species
Creating the classes needed for a classification algorithm
How does K-means work
Identifies points close to K centroids, where K is given by the user.
How does DBSCAN work?
It finds core regions of high density and expands clusters from them.
What is Hierarchical clustering?
Can be agglomerative or divisive, as long as you produce a hierarchy of clusters
Something like:
1. Split all points into clusters A and B
2. Split cluster A into clusters A1 and A2
3. Split cluster B into clusters B1 and B2
4. Split cluster A1 into …
Hard vs Soft clustering
Hard: each object belongs in one cluster, similar to how a perceptron performs classification
Soft: objects are assigned to multiple clusters, with corresponding probabilities, similar to how a logistic regression performs classification.
What did DBSCAN work best at compared to others?
Identifying rings.
What is a key ingredient for clustering?
What data is represented and HOW
The similarity metric/distance metric
(L1 or L2 norm, Jaccard Similarity)
What is Jaccard Similarity
A n B | / | A u B |
What is Jaccard distance
1 - Jaccard Similarity
What do we do in Dimensionality Reduction
Remove noise from the data
Focus on the features (or combinations of features that are actually important)
Less number-crunching = more efficient
What are the two types of Dimensionality Reduction
Feature selection + extraction
3 types of feature selection
Filter methods
Wrapper methods
Embedded methods
Filter method examples
Information gain
Correlation with target
Pairwise correlation
Variance threshold
Wrapper method examples
Recursive feature elimination
Sequential feature selection
Permutation importance
Embedded method examples
L1 Lasso Regularization
Decision tree
What is Variance thresholding
Filter method: low-variance features contain less information
Calculate variance of each feature, drop features with variance below threshold.
FEATURES MUST BE SAME SCALE!
What is Forward Search
Wrapper method: create n models, one feature each, select best one. Create n-1 models, adding one feature, select best one, proceed until you have m features
What is recursive feature elimination
Wrapper method: create n-1 models, with n-1 features each, select best, create n-2 models, removing one feature, select best one, proceed until you have removed m features
What is a Decision Tree
Embedded method: Splits the data into a tree. Can split by Gini impurity coefficient (measures how pure a node is), information gain, and variance reduction
What is feature extraction
Linear and nonlinear methods. Extract useful combinations of features in the data
What is PCA
Linear method of feature extraction
Find an orthogonal coordinate transformation (rotation, rotation+reflection) such that every new coordinate is maximally informative
How does PCA work
Flip points so that each coordinate is maximally informative and orthogonal to each other. Imagine rotating cube, we take a slice where x and y describes most of the data (z a little bit)
What are the new variables from PCA called
Principle components, and they are linear combinations of the original coordinates.
yi are uncorrelated, and are ordered by the fraction of the total variation each retains. PC1 (y1) is most informative, then PC2 etc.
When does PCA work best vs worst
BEST: high correlation between features
WORST: all variables are equally important and uncorrelated. PCA is uninformative.
PCA input and output
Input high dimensional data
Output low dimensional data
What is t-SNE
non-linear dimensionality reduction
Take the distribution of distances between N points in the dataset.
Scatter N points in 2 or 3 dimensions randomly
Move those N points around until the distribution of distances between them resembles D.
Why t-SNE
Very useful for visualising high-D data
What is UMAP
Similar to tSNE, only slightly different in every step
Runs faster uses less memory
No problem embedding into >3 dimensions
Can preserve both local and global structure.
Problems with t-SNE and UMAP
Both depend a lot on their hyperparameters
Cluster sizes and distances between clusters means nothing
X and y axes are basically impossible to interpret.