Cluster analysis Flashcards

1
Q

Cluster analysis is a range of methods that determine if there are ___________________ of data

A

different groups or clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cluster analysis assumes _____________________

A

distinct grouping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cluster analysis is _____________. we are trying to see if there are any hidden groups or clusters, but don’t know how to ________ groups. The data does not come with a ________ label

A

‘unsupervised’; define; class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In multivariate data sets, we can use ___________ to define similarity

A

distance
If two things are similar, they are probably the same sort of thing
If the other two things look very different, they are probably a different sort of thing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

For nearest neighbour clustering, define the distance between clusters as the ______________ between any of the objects

A

shortest distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

centroid clustering use ____ between objects to define clusters. This time, consider ‘object of interest the ________ rather than the individual points that make up the cluster

A

distance, cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

centroid clustering: replace _____________ with __________ of the cluster to which they belong

A

individual; ‘centroid’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Centroid clustering: use ______________ to define clusters, replace the ___________ with a ________________
The new object is _________________ (mean, median…)
This average is only meaningful is use __________

A

use distance^2 to define clusters
replace individual objects with a new ‘combined object’
the new object is between the original objects’ positions
this average is only meaningful when using distance^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Generally, would expect nearest neighbour to have ______ distances than centroid clustering, especially as the ____________ increases

A

shorter, size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what does single linkage cluster analysis measure

A

what is the distance from an unclassified object to another object
e.g., nearest neighbour, centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

nearest neighbour sensitive to__________

A

outliers

the inclusion of an object which is far way greatly increases the ‘capture power’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

centroid methods do not account for _________________

A

the spread of within-cluster objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what does average linkage methods measure

A

what is the average distance from an unclassified object to the other objects in a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

within group linkage method

A

create clusters with the smallest average linkage distance in them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

between group linkage

A

create clusters with the smallest average distance of newly formed links

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

which hierarchical clustering method to use?

A

the nearest neighbour is purist’s approach: but sensitive to ‘outliers’: a cluster that manages to ‘grab’ a relatively distant object becomes a dominant cluster

centroid clustering difficult to interpret, but is less sensitive to outliers

average linkage methods even less sensitive to outliers, PASW uses between group average as its default

None can be easily evaluated probalistically

17
Q

K-means clustering specify _________________________.
Each cluster is defined by ________________
Often the preferred way of determining _______________

A

k, the number of clusters
centroid
exactly k clusters

18
Q

what is the method of k-means clustering

A
select k points randomly
repeat
  assign all points to nearest centroid
  re-compute the centroids
until no change in centroids

the initial centroids influence the result
use hierarchical clustering -cut off at the relevant level- to determine initial position

19
Q

How can you determine k

A

should have an idea beforehand
No fail safe way to derive
can use hierarchical cluster analysis
can plot the average within-cluster analysis

20
Q

Issues with clustering

A
  1. k-means clusters might end up dividing the large clusters and combine the two small ones
  2. nearest neighbour clusters are very sensitive to outliers, might end up putting outliers in separate groups, similar for centroid clustering
  3. if original clusters are not spherical, e.g., elongated circular curve blocks might causes problems
    points at the tip of the crescent might end up in different groups since clustering only cares distances
  4. with same data, using different clustering methods, might end up dividing the groups differently
21
Q

Cluster analysis is a very good way of describing patterns in complex, multivariate data
work well with globular clusters of _________ and __________

A

same size and density