Clustering Flashcards
What topics are under dissimilarities?
Continuous dissims.
Binary dissims.
3 Rules
Dissim matricies
What are two types of continuous dissims?
Euclidean
Manhattan
What are two types of binary dissims?
Simple
Jaccard
What is the Simple binary dissimilarity equation?
d(x,y)=(a+d)/(a+b+c+d)
What is the Jaccard binary dissimilarity equation?
d(x,y)=a/(a+b+c)
What are the three rules for dissimilarities?
d(x,y)>=0, if x=y then d=0
d(x,y)=d(y,x)
d(x,y)>=d(x,z)+d(x,y)
What are the key features of dissimilarity matricies?
0’s along the diagonal
Usually factors are standardized
Usually factors are scaled
Symmetric
What topics are important to hierarchical clustering?
5 steps
Linkage
Chaining
Number of groups
What are the five steps in hierarchical clustering?
- Each obs. into its own group
- Pair nearest obs.
- One less group now!
- Pair next nearest obs.
- Repeat until no individual obs left
What types of linkage are there?
Simple
Complete
Average
What is the formula for simple linkage?
d(A,B)=min x€A,y€B,d(x,y)
What is the formula for complete linkage?
d(A,B)=max x€A,y€B,d(x,y)
What is important to know about the number of groups in hierarchical clustering?
The dendrogram should be cut along the y-axis, this divides the total obs. into groups below the line.
What is chaining?
Where observations are individually appended to the same group one after another -results in poor model.
What topics are under partitioning methods?
6 steps
K-means
Starting position
Number of groups
What are the 6 steps to partitioning methods?
- Data is partitioned into groups
- Centroids of groups are calculated
- Each obs. distance from centroid is calculated
- Move obs to new group if necessary
- If no moves are made, stop, local min found
- Repeat from 2 otherwise
What is important to knows about k means?
It is iterative
It is computationally efficient (+)
It results in local minimums (-)
What is important to remember about starting positions?
Random point
Random area
Or result from hierarchical cluster
How is the number of clusters in the data determined?
Trialling range of clusters (e.g., 1-10)
Plotting sum of squares
Looking for elbow in graph
What three topics are under cluster validation?
5 steps
Rand index
Issues with clustering
What are the issues with clustering?
Sensitive to the start point
Only works for continuous data
Clusters are all spherical
What things are important to note about the rand index?
How to calculate: rand(S1,S2)=A/A+D
Adjusted rand accounts for natural randomness
What are the 5 steps in cluster validation?
- Split data into train/test
- Run clustering on training data
- Group test data based on cluster: S1
- Cluster test individually: S2
- Cross tabulate S1 and S2