Cluster analysis Flashcards

Question 1

Q

Cluster analysis is a range of methods that determine if there are ___________________ of data

Answer

A

different groups or clusters

Question 2

Q

Cluster analysis assumes _____________________

Answer

A

distinct grouping

Question 3

Q

Cluster analysis is _____________. we are trying to see if there are any hidden groups or clusters, but don’t know how to ________ groups. The data does not come with a ________ label

Answer

A

‘unsupervised’; define; class

Question 4

Q

In multivariate data sets, we can use ___________ to define similarity

Answer

A

distance
If two things are similar, they are probably the same sort of thing
If the other two things look very different, they are probably a different sort of thing

Question 5

Q

For nearest neighbour clustering, define the distance between clusters as the ______________ between any of the objects

Answer

A

shortest distance

Question 6

Q

centroid clustering use ____ between objects to define clusters. This time, consider ‘object of interest the ________ rather than the individual points that make up the cluster

Answer

A

distance, cluster

Question 7

Q

centroid clustering: replace _____________ with __________ of the cluster to which they belong

Answer

A

individual; ‘centroid’

Question 8

Q

Centroid clustering: use ______________ to define clusters, replace the ___________ with a ________________
The new object is _________________ (mean, median…)
This average is only meaningful is use __________

Answer

A

use distance^2 to define clusters
replace individual objects with a new ‘combined object’
the new object is between the original objects’ positions
this average is only meaningful when using distance^2

Question 9

Q

Generally, would expect nearest neighbour to have ______ distances than centroid clustering, especially as the ____________ increases

Answer

A

shorter, size

Question 10

Q

what does single linkage cluster analysis measure

Answer

A

what is the distance from an unclassified object to another object
e.g., nearest neighbour, centroid

Question 11

Q

nearest neighbour sensitive to__________

Answer

A

outliers

the inclusion of an object which is far way greatly increases the ‘capture power’

Question 12

Q

centroid methods do not account for _________________

Answer

A

the spread of within-cluster objects

Question 13

Q

what does average linkage methods measure

Answer

A

what is the average distance from an unclassified object to the other objects in a cluster

Question 14

Q

within group linkage method

Answer

A

create clusters with the smallest average linkage distance in them

Question 15

Q

between group linkage

Answer

A

create clusters with the smallest average distance of newly formed links

Question 16

Q

which hierarchical clustering method to use?

Answer

Study These Flashcards

A

the nearest neighbour is purist’s approach: but sensitive to ‘outliers’: a cluster that manages to ‘grab’ a relatively distant object becomes a dominant cluster

centroid clustering difficult to interpret, but is less sensitive to outliers

average linkage methods even less sensitive to outliers, PASW uses between group average as its default

None can be easily evaluated probalistically

Question 17

Q

K-means clustering specify _________________________.
Each cluster is defined by ________________
Often the preferred way of determining _______________

Answer

Study These Flashcards

A

k, the number of clusters
centroid
exactly k clusters

Question 18

Q

what is the method of k-means clustering

Answer

Study These Flashcards

A

select k points randomly
repeat
  assign all points to nearest centroid
  re-compute the centroids
until no change in centroids

the initial centroids influence the result
use hierarchical clustering -cut off at the relevant level- to determine initial position

Question 19

Q

How can you determine k

Answer

Study These Flashcards

A

should have an idea beforehand
No fail safe way to derive
can use hierarchical cluster analysis
can plot the average within-cluster analysis

Question 20

Q

Issues with clustering

Answer

Study These Flashcards

A

k-means clusters might end up dividing the large clusters and combine the two small ones
nearest neighbour clusters are very sensitive to outliers, might end up putting outliers in separate groups, similar for centroid clustering
if original clusters are not spherical, e.g., elongated circular curve blocks might causes problems
points at the tip of the crescent might end up in different groups since clustering only cares distances
with same data, using different clustering methods, might end up dividing the groups differently

Question 21

Q

Cluster analysis is a very good way of describing patterns in complex, multivariate data
work well with globular clusters of _________ and __________

Answer

Study These Flashcards

A

same size and density

Cluster analysis Flashcards

(21 cards)