Clustering FINAL Flashcards
Cluster analysis divides data into
groups (clusters) that are meaningful, useful, or both
Objects in a cluster
share the same characteristics
What fields is clustering used in
a variety of fields, health and medicine, business
How is clustering used in health and medicing?
Cluster patients according to symptoms presented upon evaluation
How is clustering used in business?
Cluster stores according to sales/customer satisfaction/…
How is clustering used in computer networking?
Cluster traffic according to application type
What does it mean that clusters are meaningful?
Clusters should capture the natural structure of the data
What does it mean that clusters are useful? Using the cluster prototype
Using the cluster prototype, clusters can be used as a starting point for data summarization, compression (vector quantization), classification (nearest neighbor)
Clustering groups data objects based on
information found in the data that describes the objects and their relationships
What type of learning is clustering
Unsupervised learning
Clustering goal
Objects within a cluster be similar to one another, but different from objects in other clusters
Is the notion of a cluster well defined?
No
Does clustering have to know exactly what it is sorting
No, can sort things like pennies nickels dimes without knowing how much they are worth
Partitional clustering
Divide objects into non-overlapping clusters such that each object belongs to exactly one cluster
Hierarchical clustering
Clusters can have subclusters
Exclusive Clustering
1:1 relationship between object and cluster
Why need hyper parameter for clusteiring
Tells us how many clusters we are expecting from dataset
Overlapping clustering
1:n relationship between object and cluster; an object can belong to > 1 cluster
Fuzzy clustering
n:n relationship, all objects belong to all clusters with a certain probability (or membership weight)
In fuzzy clustering, each object’s probability of belonging to all clusters should sum to
1.0
Complete clustering
Assign every point to at least one cluster
Partial clustering
Some objects may not be assigned to any cluster
What might some objects not assigned to a cluster represent?
Noise or outlier
Well-separated clusters
Each point is closer to all of the points in its cluster than to any point in another cluster
Center-based clusters
Each point is closer to the center of its cluster than to the center of any other cluster
Density based clusters
identifies clusters as regions of high data point density separated by regions of low density. It is useful for discovering clusters of arbitrary shapes and can handle noise effectively.
Conceptual clusters
Points in a cluster share some general property that derives from the entire set of points
K-means clustering definition
A prototype-based partitional clustering techniques to find patterns in the data to create k clusters
Hierarchical clustering
Build a hierarchy of clusters starting from singleton clusters
DBSCAN
Density based clustering
K-means clustering process
Takes n observations and partitions them into k clusters
In k means clustering, relationship between k and n
k«n
What type of clustering scheme is K-means
Prototype-based
How to use supervised techniques from unsupervised learning
Derive labels from unsupervised learning, then create a data set with the labels
Meaning of prototype based clustering scheme
it has a notion of a centroid, which in the literature is called a prototype, and we want all of our points to be closer to the centroid
How difficult is k means clustering?
Computationally difficult (NP-hard)
Prerequisites of K-means algorithm
Attributes must be numeric
Attributes must be standardized (scaled)
K-means algorithm
Select K points as initial centroids
Form K clusters by assigning each point to its closest centroid
Recompute the centroid of each cluster
repeat until centroids do not change
How to compute distances for k-means clustering
Euclidean distance
Manhattan Distance
When do we stop algorithm for K means clustering
When you have an iteration and the SSE doesn’t change from before, we know we have converged, minimized SSE
Does K-means always converge?
Yes, it will find minima (perhaps not global) because SSE will always go down
How to choose initial centroids?
Random selection
In natural cluster
Random selection of initial centroid problem
May lead to sub optimal clustering
Best way to initially choose centroids
Put all of the centroids close to each other in some dense area of your space
Bisecting K-means goal
Form best (optimal) clusters
Bisecting K-means idea
To obtain K clusters, split the set of all points into two clusters, select one of the clusters to split, and so on until you have k clusters
Which cluster to split?
The largest one or the one with largest error (SSE)
Bisecting K-means algorithm
Initialize the list of clusters to contain the cluster consisting of all points
Remove a cluster from the list of clusters
Bisect the selected cluster using basic K-means
Select the two clusters from the bisection with the lowest total SSE
Add these two clusters to the list of clusters
Repeat until the list of clusters contains K clusters
How to choose the k in K-means?
We want the total within cluster sum of squares to be minimal
Clustering should exhibit
Low intra-cluster SS
high inter-cluster SS
Low intra-cluster SS
Cohesion
High inter-cluster SS
Separation
Within cluster SS
Sum of squares of each point in the cluster to the cluster centroid
Total SS
Sum of squares of each point in the dataset to the global cluster mean
Between SS
Total SS - total within SS
Variance of clustering
Ratio of between SS/Total SS (between 0.0 and 1.0)
What does a high ration of between SS/Total SS mean
The more variance is explained by the clusters
The higher the variance
the better it is because you are explaining that much percentage of the variance
How does K-means handle outliers
Not gracefully, it will try to include these in a cluster
What can help K-means
Outlier detection and removal prior
in Kmeans how many points are assigned to a cluster?
Every one
Kmeans only creates what type of clusters
Globular clusters (round)
K-means may result in an
empty cluster
How to handle empty cluster problem
Choose the point that is farthest away from any centroid, basically assign outlier to the last empty cluster
Another option is to choose a replacement centroid at random from the cluster that has the highest SSE and split it
K means has problems when clusters are of differing
sizes, densities, non-globular
K-means complexity
Space: O((m+K)n)
Time: O(IKm*n)
What might outliers lead to in k means
Higher SSE and non-representative cluster centroids
What are hierarchical clustering inspired by?
The area of taxonomy, where hierarchical structures are common and elements under the same hierarchy automatically constitute a cluster
What is not required of hierarchical clustering unlike k means
Choose k a-priori (prior). You look at what it creates and then try to choose a K according to your understanding of the problem
Is not having to choose a k a-prior an advantage or disadvantage?
Both
Two approaches to hierarchical clustering
Agglomerative (bottom up)
Divisive
Agglomerative
Each point starts off as individual cluster, and at each step, merge closest pairs of clusters (Need cluster proximity metric)
Divisive
All points in one cluster, at each step, split a cluster until singleton clusters of individual points remain (Which cluster to split and how to do the splitting)
How is hierarchical clustering depicted
Depicted as tree-like structure called Dendrograms
Dendrograms y axis x axis
Labeled with the proximity between clusters, x axis is labels of data points
How to interpret dendrogram
Look for closest cluster (in terms of squared distance) and join them. Continue until one cluster left
How to pick clusters in dendrogram
Draw horizontal line that intersects k lines, k being the number of clusters you want
Hierarchical clustering algorithm
Compute proximity matrix if necessary
Merge the closest two clusters
Update the proximity matrix to reflect the proximity between the new cluster and original clusters
Repeat until only one cluster remains
3 ways for merge to be done in hierarchical clustering algorithm
MIN (single link), MAX (complete link), Group average
MIN (single link)
Defines the distance between two clusters as the shortest distance between any pair of points in the two clusters
MAX (complete Link)
Measures the distance between two clusters as the greatest distance between any pair of points in the two clusters
Group Average
Defines the distance between two clusters as the average of all pairwise distances between points in the two clusters
Which merging is sensitive to outliers, which one is not sensitive to outliers
MIN(single Link), MAX (complete link)
Hierarchical clustering complexity
O(m^2) Space
O(m^2) Time
using heap –> O(m^2 log m)
Does scaling matter in hierarchical clustering
Yes if attributes are wide ranges
What dissimilarity measure should be used in HC?
Euclidean
Manhattan
What linkage to be used in HC?
Same as before, MIN, MAX, Average, depends whether or not your data is susceptible to outliers
Unlike k-means, merging decisions are
final, observations may not belong to different centroids over time
Even though clustering in general is considered an unsupervised learning method,
it can be used in supervised learning mode
Cluster indicate a _ and each object in a certain cluster _
class label, belongs to that class
Error measures for supervised clustering are
the same as classification
What is DBSCAN?
A density-based clustering algorithms parametrized by a radius/neighborhood and number of neighboring points
e-Neighbourhood
Objects within a radius e from a source object
Density (DBSCAN)
If e-neighborhood of a source point (object) contains at least MinPts other points (object), then the source point is in a “high-density” area
DBSCAN divides all points into
Core point
Border point
Noise point
Core point
A point that has at least MinPts within an e
Border point
A point that is not a core point, but is in the neighborhood of a core point
Noise point
Points that are neither core or border points
Points idea
I pick a point
I draw a circle of radius e around that point
Anything that falls within the circle is a core point
Anything that falls on the boundary is a border point
Anything that falls outside the circle is a noise point
Density-based clustering: dbscan algorithm steps
Label all points as core, border, or noise points
Eliminate all noise points
Put an edge between all core points that are within Eps of each other
Make each group of connected core points into a separate cluster
Assign each border point to one of the clusters of its associated core points
How to pick epsilon
Graph of taking the distance of each point in the data set to its nearest 5 neighbors. It does that across all points in data set. Take distance and sort them then look at curve. See where it starts to rise and choose epsilon value of where it starts to rise
dbscan complexity
O(m) space
O(m^2) time worse case
O(m log m) in low dimension space
What happens if neighborhood (MinPts) is too small
Then a sparse cluster may be erroneously labeled as noise
What happens if neighborhood (MinPts) is too large
Then dense clusters may be merged together, and small clusters may be labeled as noise
How many MinPts satisfy for 2 dimensions
4
For greater than 2 dimensions, how many MinPts
2*dimensions
How many MinPts for large and noisy datasets
large
What to do if clusters are too large
Decrease epsilon
Too much noise
increase epsilon
How to choose epsilon concise
Calculate the average of distances of every point to its k nearest neighbors, k-dist will be large for noise points and look for value where it starts rising
Why validate clustering?
Avoid finding random patterns in data
Compare two clusters
Compare two clustering algorithms
Types of validations
Internal
External
Relative
Internal validation
Used to measure the goodness of a clustering structure without respect to external information (SSE for example)
External Validation
Match cluster result with external results (class labels)
Relative Validation
Used to compare two different clustering algorithms or clusters
Two measures of internal validation
Cohesion
Separation
Cohesion
Measures how close are objects in the same cluster, measure by within cluster SSE)
Separation
Measures how well separated are the clusters
Silhouette value
Measures the degree of confidence in the clustering assignment of a particular observation
Silhouette formula
Si = (Bi-Ai)/max(Bi,Ai)
Silhouette width
[-1,1], large Si means observations are well clustered, small Si means observations lies between two clusters, Negative Si means observations probably placed in wrong cluster
what is ai
the average distance from the ith point to all the other points in the same cluster as the ith point
what is bi
average distance from the ith point to all the other points in the other cluster