Data Mining Flashcards

Question

What else can be useful when absolute distance is not?

Answer 1

Distance rankings

Answer 2

A result in probability where if you have a large number of samples, the distribution will be approximately normal (relevant to curse of dimensionality)

Answer 3

1) curse of dimensionality 2) noise 3) circle of needing to know the neighbors to choose the right subspace and needing to know the right subspace to find correct neighbors 4) bias of scores (toward high dim subspaces) 5) scores appearing identical 6) exponential subspaces 7) data-snooping bias 8) hubness

Answer 4

top-down and bottom-up

Answer 5

Partitions the data into disjoint clusters

Answer 6

Finds all clusters in all subspaces (possibly with overlap)

Answer 7

1) search DB one for each length of a transaction pattern 2) count occurences of candidates 3) eliminate candidates for next round that are not frequent in longer combos "PRUNING" => THINK ALPHABET COMBOS

Answer 8

AKA "monotonicity" states that if a given input has a certain property, all subsets of a frequent itemset must also be frequent. Basis for Apriori.

Answer 9

Subspace usually bottom-up and projected usually top-down

Answer 10

Starting with each point as its own cluster and successively merges pairs of clusters

Answer 11

Starts with the all points as a single cluster and divides them into smaller clusters

Answer 12

Simultaneously clustering rows and columns

Answer 13

1) take local selection of points, then build covariance matrix to derive eigensystem 2) defines hyperplane using strong eigenvectors 3) sum of the smallest eigenvectors defines the minimum

Answer 14

Pattern-based, bi/co-clustering, correlation clustering Density-based clustering usually

Answer 15

It can handle irregularly shaped data that do not follow a ball or ellipsoid space

Answer 16

Pairwise linear dependencies (simple positive correlations)

Answer 17

They break down the feature space into hyperrectangles that form the clusters

Answer 18

Linear dependencies (more general correlations)

Answer 19

1) the number of clusters can vary among different solutions 2) cluster labels are symbolic (not classes) 3) how to achieve diversity is harder to see

Answer 20

A consensus function

Answer 21

4C: Computing Clusters of Correlation Connected Objects Cluster model: Deriving Quantitative Models For Correlation Clusters COPAC CASH (Hough-Transform based) ERiC: Explaining Relationships Among Correlation Clusters

Answer 22

DBSCAN: Density-Based Spatial Clustering of Applications with Noise eps and MinPts

Answer 23

Bottom-up: subspace clustering Top-down: projected clustering Rely on the assumption that subspaces are axis-paralllel

Answer 24

PCA and Hough Transform

Answer 25

Approximate neighborhoods

Answer 26

Random projection, locality sensitive hashing, space-filling curves

Answer 27

Approximate neighborhoods RBRP (Recursive Binning and Re-Projection) PINN (Projection-Indexed Nearest Neighbors) Space-filling curves (Very Good) Neighborhood Approximations

Answer 28

To find outliers in relevant subspaces that are not outliers in the full-dimensional space

Answer 29

When the selection of a model or set of parameters is influenced by the data it is being run against (which can lead to an overfit model that doesn't generalize well

Answer 30

OutRank SOD Correlation Outlier

Answer 31

Distance measure

Answer 32

Defining a suitable distance

Answer 33

F1, precision, recall, silhouette score

Answer 34

A measure of the amount of uncertainty in one variable given knowledge of another random variable H(X|Y) where X and Y are the two variables being considered

Answer 35

A technique used to measure the correlation between two variables or sets of data. Involves counting the number of data points that have a specific relationship or fall within a certain range. Can be used to compute correlation.

Answer 36

Mapping sets of objects and pair counting

Answer 37

When the different subsets of the data are used, the same inputs are consistently grouped together. Measure of robustness of an algorithm

Answer 38

The degree of dispersion of the data points within a cluster (how far data points are from each other and the centroid)

Answer 39

SVD (Singular Value Decomposition)

Answer 40

HGPA (Hypergraph Partitioning) CSPA (Cluster-based similarity partitioning) maximum likelihood voting

Answer 41

1) All 1-dimensional subspaces are clustered. All clusters in higher-dim subspaces will be subsets of these clusters producing k+1 dimensional candidate subspaces 2) After pruning noise, DBSCAN is run on the subspaces to see if they contain clusters. 3) If it does, it is used in the next combination of subspaces

Answer 42

merging based on minimum distance between any two observations

Answer 43

merging based on maximum distance between any two observations

Answer 44

1) Initialization of cluster mediods 2) Iterative phase (refining mediods from Phase 1) 3) Refinement (reassigning subspaces to mediods)

Answer 45

1) construct a covariance matrix with the data 2) find the eigenvectors and eigenvalues of that matrix 3) choose the strongest (largest) eigenvalues to retain the most useful information

Answer 46

1) Assign walk points and assign to seed with lowest distance calculate new seeds and return 2) Find vectors construct covariance matrix, find eigen-vectors and - values, pick lowest 3) Merge use SVD on each pair of clusters use Energy as a metric for how well they might combine merge best fitting ones and recalculate

Answer 47

k, number of clusters k0, number of seeds to start with l, number of dimensions to project clusters onto

Answer 48

Between cluster distance dominates the within cluster distance

Data Mining Flashcards

(74 cards)