Chapter 10 Flashcards
Cluster, potential class
a collection of data objects
similar to one another within the same group
dissimilar to the objects in other groups
cluster analysis, clustering, data sementation…
finding simiularites between data according to the characteristics found in the data and grouping similar data objects into clusters.
unsupervised learning
no predefined classes, i.e. learning by observarions vs. learning by examples, superives
a stand alone trool to get insight
preprocessing step for other algorithms
Examples of clusteriong
biology: animal kingdom class order economic science: market research
summarization
preprocessing for regression, PCA, classification, and association analysis
compression
image processing: vector quantization
finding k-nearest neighbors
localizing search to one or a small number of clusters
outlier detection
outliers are often viewed as those far away from any cluster
KNN
simplest model, k=1 closeest value, k=2 the two closest entries
distance calculation: euclidean distance, manhattan distance
challenge: high dimensional data 3d 4d
scalability
clustering all the data insread of onlt on samples which can lead to biased results
abiltiy to deal with different types of attributes
numerical, binary, categorical, ordinal, linked, and mixture of these
constraint based clustering
user may give inputs on constraints
use domain knowledge to determine input parameters
partitioning crietria
single level vs hierarchical partitioning
seperatrion of clusters
exclusive one customer belongs to one region vs non exlisov one document mauy belong to more than pone class
similarity measure
distance based bs connectivity based
clustering space
dull space vs subspaces
good partitioning
objects in the same cluster are close to related to each other whereas objects in different clusters are far apart or very diofferent.
typical methods
k means, kmedoid , work week for dinding spherical shaped slusters in samll to medium size databases
hierarchical apaproch
create a hierarchical decomposition of data objects
agglomerative bottom up approach
starts with each object forming a separate group successively merges into one or a ermination condition holds
divisive top down approach
atarts wiht all objects in the same cluster, successively splits into smaller clusters intil each object is in one cluster or a termination condition holds
density based approach
distance based clusterinfg methods can find spherical shaped cluster but encounter difficulty in discovering clusters of avitrary shapes
based on the notion of density
continue to grow a clsuter as long as the density in the neighborhood exceeds some thresholds
centroi
the center of a cluster
kmeans
eahc cluster is reoresebnted by thr enter of the cluster, the centroid of a cluster is the mean value of the points within the custer
iteratively improves the within cluster variation
iterative relocation
the process iteratively reassigning objects to clusters to improve the partitioning
kmeans algorithm four steps
1/partition objects into k nonemepty subsets
- compute seed points as the entroid of the clusters of the current partitioning
- assign each object to the cluster with the nearest seed point
- iteratively improves the within cluster variant
kmeans strength
efficient
kmeans weakness
applicable only to objects in a continuouse n dimesional space
need to specify k in advance
sensitive to noisy data and outlires
variations of the kimeans differ in
selection of the intial k means
dissimilarity calculations
strategies to calculate cluster means
k modes
repalcing means of clusters with modes
using a frequency based method to update modes of clusters
k medoids
insread of rtaing the mean vlkaue of the object in a cluster as a reference point, edoids can be used which h is the mont cneteally located object in a cluster
hierarchial methods
grouping data objects into hierarchy or tree of clusters
hierarchy is useful for data summarization and visualization
agglomerative
organized objects into a hierarchy using a bott up strategy
start with individual objects as clusters, iterativelrt merged to form larger and larger cluster, the single cluster becomes the hierarchy root
the merging step: find two lcuster that are cosese and comjine the two to form one cluster
divisive
employs top down stratagy
ler all the given objects form one cluster, iteratively spliot into smaller sub clusters and recursively partition those clusters into smaller ones intil each cluster at the lowest level contains only one object
multiphased clsuetering
integrate hierarchical with other clusterting methods
Agnes, agflomerative nesting
introduced in kaufmann and tousseeuw
implemented in statistical packages
merge nodes that have the least dissimilarity
eventually all nodes belong to the same cluster
diana civisive analysis
impleemented in statistical analysis packages
inver order of AGNES
eventually each node forms a cluster on its own.
density based clustering methods
partitionign and hierarchical methods are sedigned to find spjerocal; shaped cluster
main fetures
discover lcuster of srbritary hspe
handle noise
one scan
need density parameters as termination condition
dbscan: density based spatial clustering of applications with noise
the dnesity of and object o can be measured by the number of objects close to o
it finds core objects, that have dense neighborhoods , connects core objects and their neighborhoods to form dense regions as clusters
density reachable
a point is density reachable from a point if there is a chain of points
density connects
a point p is sensity connected to a point if there is a point o such that both p an q are density reachable form
dbscan algortrithm
arbritart select a point p
retieve all point density reachable from p
if p is a core point, a cluster is formed
if p is a border point, no points are senity reachable from p and DBSCAN visits the next point of the database
continue until; all points have been processed
high intraclass imilarity
cohesive within clusters
low interclass similarity
distinctive between clusters
The quality of clusterig method depends on
the similarity measure used by the method
its implementation
its ability to discover some or all of the hidden patterns
dissimilarity/similarity metric
simalarity is expressd in terms of a distance function, typically metric
qualtiy of clustering
ther is usually a separate quality function that measures the goodness of a cluster
it is hard to define similar enough or good enough, it is highly subjective
extrinsic
supervise, i.e. the gorund truth is available
compare a clustering against the ground truth using certain clustering quality measure
intrinsic: unserpervised
, i.e. the ground truth is unavalible
evaluate the goodness of a clustering by considering how well the clusters are separated and how compact the clusters are