L9: Unsupervised machine learning Flashcards
Unsupervised ML
What: Works without a target outcome. Its more concerned with identifying associations within the data
Clustering:
Ability to process a set of previously unknown data points and create groups of them based on similarities
Cluster analysis
Divides data into clusters
Similar items are placed in same cluster
* Intra-cluster differences are minimized
* Inter-cluster differences are maximized
What can you do with clusters
1) looking into the results themselves can yield insights
* Customer segmentation
2) it can be input for predictive models, so categories become target variables
for classification, i.e. you can use distance to the closest cluster to classify a
new observation
3) anomaly detection
4) recommender systems
K MEANS
1: Assign each point to its closest centroid
2: Recompute the controids
IN PRACTICE
Because K means depends on random initial initialization, depending on the
data at hand, you might end up with suboptimal solutions
To avoid that: rerun the clustering (rule of thumb says 50-100 times, but ofc. It
also depends on data and time) and choose the solution with the lowest cost
function
TDLR: just because you got your algorithm to converge it doesn’t mean you
found the best way of clustering your data.
HOW DO YOU DECIDE HOW MANY
CLUSTERS??
ELBOW-METHOD
SIMILARITY
Euclidean distance
Jaccard similarity
How to choose similarity model?
1) domain knowledge- i.e. for text we use cosine
2) type of variable – Jaccard for nominal etc
3) if in doubt, use Euclidian, but always
4) check!
HIERARCHICAL CLUSTERING
Se slide
Linking clusters: Four linking methods
- Complete method
- Single method
- Average method
- Centroid method
- Complete method
- Single method
- Average method
- Centroid method
Linking clusters: Four linking methods
Complete method:
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses largest of similarities
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses largest of similarities
o Tends to produce more average
trees
o Sensitive to noise.
Single method:
o Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses smallest of similarities
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses smallest of similarities
o Tends to produce more
unbalanced trees
Average method:
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses average of similarities
Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses average of similarities
o Tends to produce more balanced
trees
o Often considered best choice
Centroid method:
Finds centroid of cluster 1 and
cluster 2
o Uses similarity between two
centroids
o Finds centroid of cluster 1 and
cluster 2
o Uses similarity between two
centroids
o Can impose inversion problems
(i.e. similarity need not
decrease) and violates the
fundamental assumption that
small clusters are more coherent
than large clusters.
CONSIDER SCALING!
o Data on different scales can cause undesirable results in clustering methods
o Solution is to scale data so that features have same mean and standard
deviation
BUT HOW DO WE VALIDATE CLUSTERING
Clustering:
o Sum of least squares (scree plot)
o Silhouette width
Anomaly detection
Manufacturing example:
You have data on two features for engine
performance – Vibration and Heat (!?)
You know that these engines haven’t failed
Can you figure out if a new engine might?
ANOMALY DETECTION VS SUP.
LEARNING
Anomaly detection
Small number of positive
examples (fraud instances) (0
to 20)
Large number of negative
examples (people who don’t
commit fraud)
May different types of
anomalies – future anomalies
might look nothing like any of
the anomalous examples
we’ve seen so far
FRAUD
Supervised learning
Large number of positive and
negative examples
Enough positive examples for
an algorithm to get a sense of
what positives are like, future
positive examples likely to be
similar to the ones in training
set
SPAM
RECOMMENDER SYSTEMS
Popularity:
- Recommend the most popular or tending item(s) to everyone
Content-based:
- Items are similar if their attributes are similar
- Often hand-engineered (domain-specific) attributes
Collaborative filtering
- Recommends items chosen by similar user
- Domain-free
SUM UP: CLUSTER ANALYSIS
PIPELINE
1) data clean up:
* standardize/scale your variables
* Pay attention to outliers! (clustering forces all observations into a cluster)
2) choose type of clustering technique
* for hierarchical clustering
* Choose similarity measure
* Choose linkage type
* For k means
* Choose k
3) cluster
4) make sense of clusters