Chapter 10 Flashcards

1
Q

Cluster, potential class

A

a collection of data objects
similar to one another within the same group
dissimilar to the objects in other groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

cluster analysis, clustering, data sementation…

A

finding simiularites between data according to the characteristics found in the data and grouping similar data objects into clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

unsupervised learning

A

no predefined classes, i.e. learning by observarions vs. learning by examples, superives
a stand alone trool to get insight
preprocessing step for other algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Examples of clusteriong

A
biology: animal kingdom class order
economic science: market research
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

summarization

A

preprocessing for regression, PCA, classification, and association analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

compression

A

image processing: vector quantization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

finding k-nearest neighbors

A

localizing search to one or a small number of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

outlier detection

A

outliers are often viewed as those far away from any cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

KNN

A

simplest model, k=1 closeest value, k=2 the two closest entries
distance calculation: euclidean distance, manhattan distance
challenge: high dimensional data 3d 4d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

scalability

A

clustering all the data insread of onlt on samples which can lead to biased results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

abiltiy to deal with different types of attributes

A

numerical, binary, categorical, ordinal, linked, and mixture of these

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

constraint based clustering

A

user may give inputs on constraints

use domain knowledge to determine input parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

partitioning crietria

A

single level vs hierarchical partitioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

seperatrion of clusters

A

exclusive one customer belongs to one region vs non exlisov one document mauy belong to more than pone class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

similarity measure

A

distance based bs connectivity based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

clustering space

A

dull space vs subspaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

good partitioning

A

objects in the same cluster are close to related to each other whereas objects in different clusters are far apart or very diofferent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

typical methods

A

k means, kmedoid , work week for dinding spherical shaped slusters in samll to medium size databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

hierarchical apaproch

A

create a hierarchical decomposition of data objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

agglomerative bottom up approach

A

starts with each object forming a separate group successively merges into one or a ermination condition holds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

divisive top down approach

A

atarts wiht all objects in the same cluster, successively splits into smaller clusters intil each object is in one cluster or a termination condition holds

22
Q

density based approach

A

distance based clusterinfg methods can find spherical shaped cluster but encounter difficulty in discovering clusters of avitrary shapes
based on the notion of density
continue to grow a clsuter as long as the density in the neighborhood exceeds some thresholds

23
Q

centroi

A

the center of a cluster

24
Q

kmeans

A

eahc cluster is reoresebnted by thr enter of the cluster, the centroid of a cluster is the mean value of the points within the custer
iteratively improves the within cluster variation

25
Q

iterative relocation

A

the process iteratively reassigning objects to clusters to improve the partitioning

26
Q

kmeans algorithm four steps

A

1/partition objects into k nonemepty subsets

  1. compute seed points as the entroid of the clusters of the current partitioning
  2. assign each object to the cluster with the nearest seed point
  3. iteratively improves the within cluster variant
27
Q

kmeans strength

A

efficient

28
Q

kmeans weakness

A

applicable only to objects in a continuouse n dimesional space
need to specify k in advance
sensitive to noisy data and outlires

29
Q

variations of the kimeans differ in

A

selection of the intial k means
dissimilarity calculations
strategies to calculate cluster means

30
Q

k modes

A

repalcing means of clusters with modes

using a frequency based method to update modes of clusters

31
Q

k medoids

A

insread of rtaing the mean vlkaue of the object in a cluster as a reference point, edoids can be used which h is the mont cneteally located object in a cluster

32
Q

hierarchial methods

A

grouping data objects into hierarchy or tree of clusters

hierarchy is useful for data summarization and visualization

33
Q

agglomerative

A

organized objects into a hierarchy using a bott up strategy
start with individual objects as clusters, iterativelrt merged to form larger and larger cluster, the single cluster becomes the hierarchy root
the merging step: find two lcuster that are cosese and comjine the two to form one cluster

34
Q

divisive

A

employs top down stratagy
ler all the given objects form one cluster, iteratively spliot into smaller sub clusters and recursively partition those clusters into smaller ones intil each cluster at the lowest level contains only one object

35
Q

multiphased clsuetering

A

integrate hierarchical with other clusterting methods

36
Q

Agnes, agflomerative nesting

A

introduced in kaufmann and tousseeuw
implemented in statistical packages
merge nodes that have the least dissimilarity
eventually all nodes belong to the same cluster

37
Q

diana civisive analysis

A

impleemented in statistical analysis packages
inver order of AGNES
eventually each node forms a cluster on its own.

38
Q

density based clustering methods

A

partitionign and hierarchical methods are sedigned to find spjerocal; shaped cluster

39
Q

main fetures

A

discover lcuster of srbritary hspe
handle noise
one scan
need density parameters as termination condition

40
Q

dbscan: density based spatial clustering of applications with noise

A

the dnesity of and object o can be measured by the number of objects close to o
it finds core objects, that have dense neighborhoods , connects core objects and their neighborhoods to form dense regions as clusters

41
Q

density reachable

A

a point is density reachable from a point if there is a chain of points

42
Q

density connects

A

a point p is sensity connected to a point if there is a point o such that both p an q are density reachable form

43
Q

dbscan algortrithm

A

arbritart select a point p
retieve all point density reachable from p
if p is a core point, a cluster is formed
if p is a border point, no points are senity reachable from p and DBSCAN visits the next point of the database
continue until; all points have been processed

44
Q

high intraclass imilarity

A

cohesive within clusters

45
Q

low interclass similarity

A

distinctive between clusters

46
Q

The quality of clusterig method depends on

A

the similarity measure used by the method
its implementation
its ability to discover some or all of the hidden patterns

47
Q

dissimilarity/similarity metric

A

simalarity is expressd in terms of a distance function, typically metric

48
Q

qualtiy of clustering

A

ther is usually a separate quality function that measures the goodness of a cluster
it is hard to define similar enough or good enough, it is highly subjective

49
Q

extrinsic

A

supervise, i.e. the gorund truth is available

compare a clustering against the ground truth using certain clustering quality measure

50
Q

intrinsic: unserpervised

A

, i.e. the ground truth is unavalible
evaluate the goodness of a clustering by considering how well the clusters are separated and how compact the clusters are