Cluster analysis Flashcards
Cluster analysis is a range of methods that determine if there are ___________________ of data
different groups or clusters
Cluster analysis assumes _____________________
distinct grouping
Cluster analysis is _____________. we are trying to see if there are any hidden groups or clusters, but don’t know how to ________ groups. The data does not come with a ________ label
‘unsupervised’; define; class
In multivariate data sets, we can use ___________ to define similarity
distance
If two things are similar, they are probably the same sort of thing
If the other two things look very different, they are probably a different sort of thing
For nearest neighbour clustering, define the distance between clusters as the ______________ between any of the objects
shortest distance
centroid clustering use ____ between objects to define clusters. This time, consider ‘object of interest the ________ rather than the individual points that make up the cluster
distance, cluster
centroid clustering: replace _____________ with __________ of the cluster to which they belong
individual; ‘centroid’
Centroid clustering: use ______________ to define clusters, replace the ___________ with a ________________
The new object is _________________ (mean, median…)
This average is only meaningful is use __________
use distance^2 to define clusters
replace individual objects with a new ‘combined object’
the new object is between the original objects’ positions
this average is only meaningful when using distance^2
Generally, would expect nearest neighbour to have ______ distances than centroid clustering, especially as the ____________ increases
shorter, size
what does single linkage cluster analysis measure
what is the distance from an unclassified object to another object
e.g., nearest neighbour, centroid
nearest neighbour sensitive to__________
outliers
the inclusion of an object which is far way greatly increases the ‘capture power’
centroid methods do not account for _________________
the spread of within-cluster objects
what does average linkage methods measure
what is the average distance from an unclassified object to the other objects in a cluster
within group linkage method
create clusters with the smallest average linkage distance in them
between group linkage
create clusters with the smallest average distance of newly formed links
which hierarchical clustering method to use?
the nearest neighbour is purist’s approach: but sensitive to ‘outliers’: a cluster that manages to ‘grab’ a relatively distant object becomes a dominant cluster
centroid clustering difficult to interpret, but is less sensitive to outliers
average linkage methods even less sensitive to outliers, PASW uses between group average as its default
None can be easily evaluated probalistically
K-means clustering specify _________________________.
Each cluster is defined by ________________
Often the preferred way of determining _______________
k, the number of clusters
centroid
exactly k clusters
what is the method of k-means clustering
select k points randomly repeat assign all points to nearest centroid re-compute the centroids until no change in centroids
the initial centroids influence the result
use hierarchical clustering -cut off at the relevant level- to determine initial position
How can you determine k
should have an idea beforehand
No fail safe way to derive
can use hierarchical cluster analysis
can plot the average within-cluster analysis
Issues with clustering
- k-means clusters might end up dividing the large clusters and combine the two small ones
- nearest neighbour clusters are very sensitive to outliers, might end up putting outliers in separate groups, similar for centroid clustering
- if original clusters are not spherical, e.g., elongated circular curve blocks might causes problems
points at the tip of the crescent might end up in different groups since clustering only cares distances - with same data, using different clustering methods, might end up dividing the groups differently
Cluster analysis is a very good way of describing patterns in complex, multivariate data
work well with globular clusters of _________ and __________
same size and density