Week 7 Flashcards
Real uses of unsupervised learning
Customer segmentation (single parents, young party-goers)
Identifying fraud (bank transactions, GPS logs, bots on social media
Identifying new animal species
Creating the classes needed for a classification algorithm
How does K-means work
Identifies points close to K centroids, where K is given by the user.
How does DBSCAN work?
It finds core regions of high density and expands clusters from them.
What is Hierarchical clustering?
Can be agglomerative or divisive, as long as you produce a hierarchy of clusters
Something like:
1. Split all points into clusters A and B
2. Split cluster A into clusters A1 and A2
3. Split cluster B into clusters B1 and B2
4. Split cluster A1 into …
Hard vs Soft clustering
Hard: each object belongs in one cluster, similar to how a perceptron performs classification
Soft: objects are assigned to multiple clusters, with corresponding probabilities, similar to how a logistic regression performs classification.
What did DBSCAN work best at compared to others?
Identifying rings.
What is a key ingredient for clustering?
What data is represented and HOW
The similarity metric/distance metric
(L1 or L2 norm, Jaccard Similarity)
What is Jaccard Similarity
A n B | / | A u B |
What is Jaccard distance
1 - Jaccard Similarity
What do we do in Dimensionality Reduction
Remove noise from the data
Focus on the features (or combinations of features that are actually important)
Less number-crunching = more efficient
What are the two types of Dimensionality Reduction
Feature selection + extraction
3 types of feature selection
Filter methods
Wrapper methods
Embedded methods
Filter method examples
Information gain
Correlation with target
Pairwise correlation
Variance threshold
Wrapper method examples
Recursive feature elimination
Sequential feature selection
Permutation importance
Embedded method examples
L1 Lasso Regularization
Decision tree