Clustering FINAL Flashcards
Cluster analysis divides data into
groups (clusters) that are meaningful, useful, or both
Objects in a cluster
share the same characteristics
What fields is clustering used in
a variety of fields, health and medicine, business
How is clustering used in health and medicing?
Cluster patients according to symptoms presented upon evaluation
How is clustering used in business?
Cluster stores according to sales/customer satisfaction/…
How is clustering used in computer networking?
Cluster traffic according to application type
What does it mean that clusters are meaningful?
Clusters should capture the natural structure of the data
What does it mean that clusters are useful? Using the cluster prototype
Using the cluster prototype, clusters can be used as a starting point for data summarization, compression (vector quantization), classification (nearest neighbor)
Clustering groups data objects based on
information found in the data that describes the objects and their relationships
What type of learning is clustering
Unsupervised learning
Clustering goal
Objects within a cluster be similar to one another, but different from objects in other clusters
Is the notion of a cluster well defined?
No
Does clustering have to know exactly what it is sorting
No, can sort things like pennies nickels dimes without knowing how much they are worth
Partitional clustering
Divide objects into non-overlapping clusters such that each object belongs to exactly one cluster
Hierarchical clustering
Clusters can have subclusters
Exclusive Clustering
1:1 relationship between object and cluster
Why need hyper parameter for clusteiring
Tells us how many clusters we are expecting from dataset
Overlapping clustering
1:n relationship between object and cluster; an object can belong to > 1 cluster
Fuzzy clustering
n:n relationship, all objects belong to all clusters with a certain probability (or membership weight)
In fuzzy clustering, each object’s probability of belonging to all clusters should sum to
1.0
Complete clustering
Assign every point to at least one cluster
Partial clustering
Some objects may not be assigned to any cluster
What might some objects not assigned to a cluster represent?
Noise or outlier
Well-separated clusters
Each point is closer to all of the points in its cluster than to any point in another cluster
Center-based clusters
Each point is closer to the center of its cluster than to the center of any other cluster
Density based clusters
identifies clusters as regions of high data point density separated by regions of low density. It is useful for discovering clusters of arbitrary shapes and can handle noise effectively.
Conceptual clusters
Points in a cluster share some general property that derives from the entire set of points
K-means clustering definition
A prototype-based partitional clustering techniques to find patterns in the data to create k clusters
Hierarchical clustering
Build a hierarchy of clusters starting from singleton clusters
DBSCAN
Density based clustering
K-means clustering process
Takes n observations and partitions them into k clusters
In k means clustering, relationship between k and n
k«n
What type of clustering scheme is K-means
Prototype-based
How to use supervised techniques from unsupervised learning
Derive labels from unsupervised learning, then create a data set with the labels
Meaning of prototype based clustering scheme
it has a notion of a centroid, which in the literature is called a prototype, and we want all of our points to be closer to the centroid
How difficult is k means clustering?
Computationally difficult (NP-hard)
Prerequisites of K-means algorithm
Attributes must be numeric
Attributes must be standardized (scaled)
K-means algorithm
Select K points as initial centroids
Form K clusters by assigning each point to its closest centroid
Recompute the centroid of each cluster
repeat until centroids do not change
How to compute distances for k-means clustering
Euclidean distance
Manhattan Distance
When do we stop algorithm for K means clustering
When you have an iteration and the SSE doesn’t change from before, we know we have converged, minimized SSE
Does K-means always converge?
Yes, it will find minima (perhaps not global) because SSE will always go down
How to choose initial centroids?
Random selection
In natural cluster
Random selection of initial centroid problem
May lead to sub optimal clustering
Best way to initially choose centroids
Put all of the centroids close to each other in some dense area of your space
Bisecting K-means goal
Form best (optimal) clusters
Bisecting K-means idea
To obtain K clusters, split the set of all points into two clusters, select one of the clusters to split, and so on until you have k clusters
Which cluster to split?
The largest one or the one with largest error (SSE)
Bisecting K-means algorithm
Initialize the list of clusters to contain the cluster consisting of all points
Remove a cluster from the list of clusters
Bisect the selected cluster using basic K-means
Select the two clusters from the bisection with the lowest total SSE
Add these two clusters to the list of clusters
Repeat until the list of clusters contains K clusters
How to choose the k in K-means?
We want the total within cluster sum of squares to be minimal