L9: Unsupervised machine learning Flashcards
Unsupervised ML
What: Works without a target outcome. Its more concerned with identifying associations within the data
Clustering:
Ability to process a set of previously unknown data points and create groups of them based on similarities
Cluster analysis
Divides data into clusters
Similar items are placed in same cluster
* Intra-cluster differences are minimized
* Inter-cluster differences are maximized
What can you do with clusters
1) looking into the results themselves can yield insights
* Customer segmentation
2) it can be input for predictive models, so categories become target variables
for classification, i.e. you can use distance to the closest cluster to classify a
new observation
3) anomaly detection
4) recommender systems
K MEANS
1: Assign each point to its closest centroid
2: Recompute the controids
IN PRACTICE
Because K means depends on random initial initialization, depending on the
data at hand, you might end up with suboptimal solutions
To avoid that: rerun the clustering (rule of thumb says 50-100 times, but ofc. It
also depends on data and time) and choose the solution with the lowest cost
function
TDLR: just because you got your algorithm to converge it doesn’t mean you
found the best way of clustering your data.
HOW DO YOU DECIDE HOW MANY
CLUSTERS??
ELBOW-METHOD
SIMILARITY
Euclidean distance
Jaccard similarity
How to choose similarity model?
1) domain knowledge- i.e. for text we use cosine
2) type of variable – Jaccard for nominal etc
3) if in doubt, use Euclidian, but always
4) check!
HIERARCHICAL CLUSTERING
Se slide
Linking clusters: Four linking methods
- Complete method
- Single method
- Average method
- Centroid method
- Complete method
- Single method
- Average method
- Centroid method
Linking clusters: Four linking methods
Complete method:
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses largest of similarities
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses largest of similarities
o Tends to produce more average
trees
o Sensitive to noise.
Single method:
o Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses smallest of similarities
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses smallest of similarities
o Tends to produce more
unbalanced trees
Average method:
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses average of similarities
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses average of similarities
o Tends to produce more balanced
trees
o Often considered best choice
Centroid method:
Finds centroid of cluster 1 and
cluster 2
o Uses similarity between two
centroids
o Finds centroid of cluster 1 and
cluster 2
o Uses similarity between two
centroids
o Can impose inversion problems
(i.e. similarity need not
decrease) and violates the
fundamental assumption that
small clusters are more coherent
than large clusters.