Chapter 10 Flashcards

1
Q

Cluster, potential class

A

a collection of data objects
similar to one another within the same group
dissimilar to the objects in other groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

cluster analysis, clustering, data sementation…

A

finding simiularites between data according to the characteristics found in the data and grouping similar data objects into clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

unsupervised learning

A

no predefined classes, i.e. learning by observarions vs. learning by examples, superives
a stand alone trool to get insight
preprocessing step for other algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Examples of clusteriong

A
biology: animal kingdom class order
economic science: market research
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

summarization

A

preprocessing for regression, PCA, classification, and association analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

compression

A

image processing: vector quantization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

finding k-nearest neighbors

A

localizing search to one or a small number of clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

outlier detection

A

outliers are often viewed as those far away from any cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

KNN

A

simplest model, k=1 closeest value, k=2 the two closest entries
distance calculation: euclidean distance, manhattan distance
challenge: high dimensional data 3d 4d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

scalability

A

clustering all the data insread of onlt on samples which can lead to biased results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

abiltiy to deal with different types of attributes

A

numerical, binary, categorical, ordinal, linked, and mixture of these

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

constraint based clustering

A

user may give inputs on constraints

use domain knowledge to determine input parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

partitioning crietria

A

single level vs hierarchical partitioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

seperatrion of clusters

A

exclusive one customer belongs to one region vs non exlisov one document mauy belong to more than pone class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

similarity measure

A

distance based bs connectivity based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

clustering space

A

dull space vs subspaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

good partitioning

A

objects in the same cluster are close to related to each other whereas objects in different clusters are far apart or very diofferent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

typical methods

A

k means, kmedoid , work week for dinding spherical shaped slusters in samll to medium size databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

hierarchical apaproch

A

create a hierarchical decomposition of data objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

agglomerative bottom up approach

A

starts with each object forming a separate group successively merges into one or a ermination condition holds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

divisive top down approach

A

atarts wiht all objects in the same cluster, successively splits into smaller clusters intil each object is in one cluster or a termination condition holds

22
Q

density based approach

A

distance based clusterinfg methods can find spherical shaped cluster but encounter difficulty in discovering clusters of avitrary shapes
based on the notion of density
continue to grow a clsuter as long as the density in the neighborhood exceeds some thresholds

23
Q

centroi

A

the center of a cluster

24
Q

kmeans

A

eahc cluster is reoresebnted by thr enter of the cluster, the centroid of a cluster is the mean value of the points within the custer
iteratively improves the within cluster variation

25
iterative relocation
the process iteratively reassigning objects to clusters to improve the partitioning
26
kmeans algorithm four steps
1/partition objects into k nonemepty subsets 2. compute seed points as the entroid of the clusters of the current partitioning 3. assign each object to the cluster with the nearest seed point 4. iteratively improves the within cluster variant
27
kmeans strength
efficient
28
kmeans weakness
applicable only to objects in a continuouse n dimesional space need to specify k in advance sensitive to noisy data and outlires
29
variations of the kimeans differ in
selection of the intial k means dissimilarity calculations strategies to calculate cluster means
30
k modes
repalcing means of clusters with modes | using a frequency based method to update modes of clusters
31
k medoids
insread of rtaing the mean vlkaue of the object in a cluster as a reference point, edoids can be used which h is the mont cneteally located object in a cluster
32
hierarchial methods
grouping data objects into hierarchy or tree of clusters | hierarchy is useful for data summarization and visualization
33
agglomerative
organized objects into a hierarchy using a bott up strategy start with individual objects as clusters, iterativelrt merged to form larger and larger cluster, the single cluster becomes the hierarchy root the merging step: find two lcuster that are cosese and comjine the two to form one cluster
34
divisive
employs top down stratagy ler all the given objects form one cluster, iteratively spliot into smaller sub clusters and recursively partition those clusters into smaller ones intil each cluster at the lowest level contains only one object
35
multiphased clsuetering
integrate hierarchical with other clusterting methods
36
Agnes, agflomerative nesting
introduced in kaufmann and tousseeuw implemented in statistical packages merge nodes that have the least dissimilarity eventually all nodes belong to the same cluster
37
diana civisive analysis
impleemented in statistical analysis packages inver order of AGNES eventually each node forms a cluster on its own.
38
density based clustering methods
partitionign and hierarchical methods are sedigned to find spjerocal; shaped cluster
39
main fetures
discover lcuster of srbritary hspe handle noise one scan need density parameters as termination condition
40
dbscan: density based spatial clustering of applications with noise
the dnesity of and object o can be measured by the number of objects close to o it finds core objects, that have dense neighborhoods , connects core objects and their neighborhoods to form dense regions as clusters
41
density reachable
a point is density reachable from a point if there is a chain of points
42
density connects
a point p is sensity connected to a point if there is a point o such that both p an q are density reachable form
43
dbscan algortrithm
arbritart select a point p retieve all point density reachable from p if p is a core point, a cluster is formed if p is a border point, no points are senity reachable from p and DBSCAN visits the next point of the database continue until; all points have been processed
44
high intraclass imilarity
cohesive within clusters
45
low interclass similarity
distinctive between clusters
46
The quality of clusterig method depends on
the similarity measure used by the method its implementation its ability to discover some or all of the hidden patterns
47
dissimilarity/similarity metric
simalarity is expressd in terms of a distance function, typically metric
48
qualtiy of clustering
ther is usually a separate quality function that measures the goodness of a cluster it is hard to define similar enough or good enough, it is highly subjective
49
extrinsic
supervise, i.e. the gorund truth is available | compare a clustering against the ground truth using certain clustering quality measure
50
intrinsic: unserpervised
, i.e. the ground truth is unavalible evaluate the goodness of a clustering by considering how well the clusters are separated and how compact the clusters are