Data Mining Flashcards

1
Q

What are the seven main types of clustering algorithms?

A

Pattern-Based
Projected
Partitioning/Representative
Density
Hierarchical
Bi-Clustering
Correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which algorithms are Pattern-Based

A

p-Cluster
MaPle
EDSC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which algorithms are Projection-based?

A

PROCLUS: PROjected CLUStering
MD5
Isomap
t-SNE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which algorithms are paritioning/representative?

A

kMeans
kMediod

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which algorithms are Density based?

A

CLIQUE
DBSCAN
OPTICS
OP-Cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which algorithms are hierarchical?

A

DiSH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
CURE (Clustering Using Representatives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which algorithms use bi-clustering?

A

delta-bicluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main goal in clustering?

A

To find meaningful features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are four strategies to deal with high-dimensional data?

A

1) dimensionality reduction (PCA, MD5, SNE)
2) regularization (L1, L2)
3) ensemble methods
4) projected clustering (MD5, SNE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What function describes the probability that a random value will be <= a given value?

A

Cumulative Density Distribution function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What types of kernels can be used to estimate density?

A

Discrete
Gaussian
Multivariate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What process splits data into cells where all the points are closest to the seed point?

A

Voronoi parcelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What process compares the local density of a point to the local density of its k-nearest-neighbors?

A

LOF (local outlier factor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are masking and swamping?

A

Masking is when an outlier gets included in the cluster. Swamping is when the model is changed so the inliers appear as outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the silhouette score?

A

A measure of how well a data point is classified relative to other points in the cluster and ranges from -1 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the silhouette score used for?

A

To evaluate the performance of an algorithm and/or to decide on the number of clusters to set as a parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What types of norms are there?

A

Euclidean
Manhattan
Max norm
Weighted Euclidean
Quadratic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is an outlier?

A

Arouses suspicion that it was generated by a different mechanism
Appears to deviate markedly from the sample
Is inconsistent with the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why do outliers occur?

A

measurement/transmission errors, data input/processing errors, attacks/fraud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What’s the difference between a local and a global outlier?

A

Local outlier: instance that is very different from the instances around it
Global outlier: very different from entire dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are Arthur’s main challenges in dealing with High Dimensional Data?

A

1) “concentration effect”: curse of dim
2) discrimination vs. ranking of values
3) combinatorial issues and subspace selection
4) Hubness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is hubness?

A

Phenomenon where some instances in a dataset (hubs) occur as the nearest neighbors of many other instances, more than expected by chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the definition of “concentration of distances”/curse of dimensionality?

A

The ratio of the variance of length of any point vector converges to zero with increasing data dimensionality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a shrinking hypersphere?

A

A method used in density-based clustering (i.e. DBSCAN) to find clusters of similar data in a high-dimensional space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What else can be useful when absolute distance is not?

A

Distance rankings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the central limit theorem?

A

A result in probability where if you have a large number of samples, the distribution will be approximately normal (relevant to curse of dimensionality)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are 8 problems with High Dimensional Data?

A

1) curse of dimensionality
2) noise
3) circle of needing to know the neighbors to choose the right subspace and needing to know the right subspace to find correct neighbors
4) bias of scores (toward high dim subspaces)
5) scores appearing identical
6) exponential subspaces
7) data-snooping bias
8) hubness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are two methods of subspace traversal?

A

top-down and bottom-up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does projected clustering do?

A

Partitions the data into disjoint clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does subspace clustering do?

A

Finds all clusters in all subspaces (possibly with overlap)

31
Q

How does Apriori work?

A

1) search DB one for each length of a transaction pattern
2) count occurences of candidates
3) eliminate candidates for next round that are not frequent in longer combos
“PRUNING” => THINK ALPHABET COMBOS

32
Q

What is the downward closure property?

A

AKA “monotonicity” states that if a given input has a certain property, all subsets of a frequent itemset must also be frequent. Basis for Apriori.

33
Q

Are subspace and projected clustering usually bottom-up or top-down?

A

Subspace usually bottom-up and projected usually top-down

34
Q

What is bottom-up clustering?

A

Starting with each point as its own cluster and successively merges pairs of clusters

35
Q

What is top-down clustering?

A

Starts with the all points as a single cluster and divides them into smaller clusters

36
Q

What is bi-clustering?

A

Simultaneously clustering rows and columns

37
Q

How does PCA work?

A

1) take local selection of points, then build covariance matrix to derive eigensystem
2) defines hyperplane using strong eigenvectors
3) sum of the smallest eigenvectors defines the minimum

38
Q

What are some non-axis parallel approaches to clustering?

A

Pattern-based, bi/co-clustering, correlation clustering
Density-based clustering usually

39
Q

Why use a non-axis parallel approach over an axis-parallel one?

A

It can handle irregularly shaped data that do not follow a ball or ellipsoid space

40
Q

What do pattern-based clustering algorithms find?

A

Pairwise linear dependencies (simple positive correlations)

41
Q

What characterizes axis-parallel algorithms?

A

They break down the feature space into hyperrectangles that form the clusters

42
Q

What does correlation clustering find?

A

Linear dependencies (more general correlations)

43
Q

What are the three main problems for clustering ensembles?

A

1) the number of clusters can vary among different solutions
2) cluster labels are symbolic (not classes)
3) how to achieve diversity is harder to see

44
Q

What is used to combine the results of ensemble clustering ?

A

A consensus function

45
Q

What are the names of some consensus clustering methods?

A

4C: Computing Clusters of Correlation Connected Objects
Cluster model: Deriving Quantitative Models For Correlation Clusters
COPAC
CASH (Hough-Transform based)
ERiC: Explaining Relationships Among Correlation Clusters

46
Q

Which algorithm requires two parameters and what are they?

A

DBSCAN: Density-Based Spatial Clustering of Applications with Noise
eps and MinPts

47
Q

Which type of clustering usually adopts the bottom-up strategy and which one usually adopts the top-down strategy?
What do they rely on?

A

Bottom-up: subspace clustering
Top-down: projected clustering
Rely on the assumption that subspaces are axis-paralllel

48
Q

What are the two flavors of correlation based on?

A

PCA and Hough Transform

49
Q

What is a fundamental efficiency technique to finding the localized region near a point as an outlier detection mechanism?

A

Approximate neighborhoods

50
Q

What are three ways to implement approximate neighborhoods?

A

Random projection, locality sensitive hashing, space-filling curves

51
Q

What are five major efficiency techniques used in top-N outlier detection?

A

Approximate neighborhoods
RBRP (Recursive Binning and Re-Projection)
PINN (Projection-Indexed Nearest Neighbors)
Space-filling curves
(Very Good) Neighborhood Approximations

52
Q

What is the goal of subspace outlier detection?

A

To find outliers in relevant subspaces that are not outliers in the full-dimensional space

53
Q

What is data snooping bias?

A

When the selection of a model or set of parameters is influenced by the data it is being run against (which can lead to an overfit model that doesn’t generalize well

54
Q

What are three methods of subspace outlier detection?

A

OutRank
SOD
Correlation Outlier

55
Q

What is another term for similarity

A

Distance measure

56
Q

What is the main challenge with similarity?

A

Defining a suitable distance

57
Q

What are some common evaluation methods?

A

F1, precision, recall, silhouette score

58
Q

What is conditional entropy?

A

A measure of the amount of uncertainty in one variable given knowledge of another random variable
H(X|Y) where X and Y are the two variables being considered

59
Q

What is pair counting?

A

A technique used to measure the correlation between two variables or sets of data. Involves counting the number of data points that have a specific relationship or fall within a certain range. Can be used to compute correlation.

60
Q

What are two solutions to evaluating results?

A

Mapping sets of objects and pair counting

61
Q

What does pairwise-stable mean?

A

When the different subsets of the data are used, the same inputs are consistently grouped together. Measure of robustness of an algorithm

62
Q

What is variance?

A

The degree of dispersion of the data points within a cluster (how far data points are from each other and the centroid)

63
Q

What is the sample count in the Apriori algorithm called?

A

Support

64
Q

What does ORCLUS use for dimensionality reduction?

A

SVD (Singular Value Decomposition)

65
Q

What are four consensus methods mentioned in lecture?

A

HGPA (Hypergraph Partitioning)
CSPA (Cluster-based similarity partitioning)
maximum likelihood
voting

66
Q

What is SUBCLU’s three step process?

A

1) All 1-dimensional subspaces are clustered. All clusters in higher-dim subspaces will be subsets of these clusters producing k+1 dimensional candidate subspaces
2) After pruning noise, DBSCAN is run on the subspaces to see if they contain clusters.
3) If it does, it is used in the next combination of subspaces

67
Q

What is single-linkage?

A

merging based on minimum distance between any two observations

68
Q

What is complete linkage?

A

merging based on maximum distance between any two observations

69
Q

What are the three phases of PROCLUS?

A

1) Initialization of cluster mediods
2) Iterative phase (refining mediods from Phase 1)
3) Refinement (reassigning subspaces to mediods)

70
Q

How does SVD (single value decomposition) work?

A

1) construct a covariance matrix with the data
2) find the eigenvectors and eigenvalues of that matrix
3) choose the strongest (largest) eigenvalues to retain the most useful information

71
Q

What are the three major steps in ORCLUS?

A

1) Assign
walk points and assign to seed with lowest distance
calculate new seeds and return
2) Find vectors
construct covariance matrix, find eigen-vectors and - values, pick lowest
3) Merge
use SVD on each pair of clusters
use Energy as a metric for how well they might combine
merge best fitting ones and recalculate

72
Q

What is the quality measure of clusters called in ORCLUS?

A

Energy

73
Q

What are the inputs to ORCLUS?

A

k, number of clusters
k0, number of seeds to start with
l, number of dimensions to project clusters onto

74
Q

What does pairwise stable mean?

A

Between cluster distance dominates the within cluster distance