L9: Unsupervised machine learning Flashcards

Question 1

Q

Unsupervised ML

Answer

A

What: Works without a target outcome. Its more concerned with identifying associations within the data

Clustering:
Ability to process a set of previously unknown data points and create groups of them based on similarities

Question 2

Q

Cluster analysis

Answer

A

Divides data into clusters
Similar items are placed in same cluster
* Intra-cluster differences are minimized
* Inter-cluster differences are maximized

Question 3

Q

What can you do with clusters

Answer

A

1) looking into the results themselves can yield insights
* Customer segmentation
2) it can be input for predictive models, so categories become target variables
for classification, i.e. you can use distance to the closest cluster to classify a
new observation
3) anomaly detection
4) recommender systems

Question 4

Q

K MEANS

Answer

A

1: Assign each point to its closest centroid

2: Recompute the controids

Question 5

Q

IN PRACTICE

Answer

A

Because K means depends on random initial initialization, depending on the
data at hand, you might end up with suboptimal solutions
To avoid that: rerun the clustering (rule of thumb says 50-100 times, but ofc. It
also depends on data and time) and choose the solution with the lowest cost
function
TDLR: just because you got your algorithm to converge it doesn’t mean you
found the best way of clustering your data.

Question 6

Q

HOW DO YOU DECIDE HOW MANY
CLUSTERS??

Answer

A

ELBOW-METHOD

Question 7

Q

SIMILARITY

Answer

A

Euclidean distance
Jaccard similarity

Question 8

Q

How to choose similarity model?

Answer

A

1) domain knowledge- i.e. for text we use cosine
2) type of variable – Jaccard for nominal etc
3) if in doubt, use Euclidian, but always
4) check!

Question 9

Q

HIERARCHICAL CLUSTERING

Question 10

Q

Linking clusters: Four linking methods

Answer

A

Complete method
Single method
Average method
Centroid method

Question 11

Q

Complete method
Single method
Average method
Centroid method

Answer

A

Linking clusters: Four linking methods

Question 12

Q

Complete method:

Answer

A

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses largest of similarities

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses largest of similarities
o Tends to produce more average
trees
o Sensitive to noise.

Question 13

Q

Single method:

Answer

A

o Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses smallest of similarities

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses smallest of similarities
o Tends to produce more
unbalanced trees

Question 14

Q

Average method:

Answer

A

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses average of similarities

Pairwise similarity between all
observations in cluster 1 and
cluster 2
o Uses average of similarities
o Tends to produce more balanced
trees
o Often considered best choice

Question 15

Q

Centroid method:

Answer

A

Finds centroid of cluster 1 and
cluster 2
o Uses similarity between two
centroids

o Finds centroid of cluster 1 and
cluster 2
o Uses similarity between two
centroids
o Can impose inversion problems
(i.e. similarity need not
decrease) and violates the
fundamental assumption that
small clusters are more coherent
than large clusters.

Question 16

Q

CONSIDER SCALING!

Answer

Study These Flashcards

A

o Data on different scales can cause undesirable results in clustering methods
o Solution is to scale data so that features have same mean and standard
deviation

Question 17

Q

BUT HOW DO WE VALIDATE CLUSTERING

Answer

Study These Flashcards

A

Clustering:
o Sum of least squares (scree plot)
o Silhouette width

Question 18

Q

Anomaly detection

Answer

Study These Flashcards

A

Manufacturing example:
You have data on two features for engine
performance – Vibration and Heat (!?)
You know that these engines haven’t failed
Can you figure out if a new engine might?

Question 19

Q

ANOMALY DETECTION VS SUP.
LEARNING

Answer

Study These Flashcards

A

Anomaly detection
Small number of positive
examples (fraud instances) (0
to 20)
Large number of negative
examples (people who don’t
commit fraud)
May different types of
anomalies – future anomalies
might look nothing like any of
the anomalous examples
we’ve seen so far
FRAUD

Supervised learning
Large number of positive and
negative examples
Enough positive examples for
an algorithm to get a sense of
what positives are like, future
positive examples likely to be
similar to the ones in training
set
SPAM

Question 20

Q

RECOMMENDER SYSTEMS

Answer

Study These Flashcards

A

Popularity:
- Recommend the most popular or tending item(s) to everyone

Content-based:
- Items are similar if their attributes are similar
- Often hand-engineered (domain-specific) attributes

Collaborative filtering
- Recommends items chosen by similar user
- Domain-free

Question 21

Q

SUM UP: CLUSTER ANALYSIS
PIPELINE

Answer

Study These Flashcards

A

1) data clean up:
* standardize/scale your variables
* Pay attention to outliers! (clustering forces all observations into a cluster)
2) choose type of clustering technique
* for hierarchical clustering
* Choose similarity measure
* Choose linkage type
* For k means
* Choose k
3) cluster
4) make sense of clusters 

L9: Unsupervised machine learning Flashcards

(21 cards)