Data Mining Flashcards
What are the seven main types of clustering algorithms?
Pattern-Based
Projected
Partitioning/Representative
Density
Hierarchical
Bi-Clustering
Correlation
Which algorithms are Pattern-Based
p-Cluster
MaPle
EDSC
Which algorithms are Projection-based?
PROCLUS: PROjected CLUStering
MD5
Isomap
t-SNE
Which algorithms are paritioning/representative?
kMeans
kMediod
Which algorithms are Density based?
CLIQUE
DBSCAN
OPTICS
OP-Cluster
Which algorithms are hierarchical?
DiSH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
CURE (Clustering Using Representatives)
Which algorithms use bi-clustering?
delta-bicluster
What is the main goal in clustering?
To find meaningful features
What are four strategies to deal with high-dimensional data?
1) dimensionality reduction (PCA, MD5, SNE)
2) regularization (L1, L2)
3) ensemble methods
4) projected clustering (MD5, SNE)
What function describes the probability that a random value will be <= a given value?
Cumulative Density Distribution function
What types of kernels can be used to estimate density?
Discrete
Gaussian
Multivariate
What process splits data into cells where all the points are closest to the seed point?
Voronoi parcelling
What process compares the local density of a point to the local density of its k-nearest-neighbors?
LOF (local outlier factor)
What are masking and swamping?
Masking is when an outlier gets included in the cluster. Swamping is when the model is changed so the inliers appear as outliers
What is the silhouette score?
A measure of how well a data point is classified relative to other points in the cluster and ranges from -1 to 1
What is the silhouette score used for?
To evaluate the performance of an algorithm and/or to decide on the number of clusters to set as a parameter
What types of norms are there?
Euclidean
Manhattan
Max norm
Weighted Euclidean
Quadratic
What is an outlier?
Arouses suspicion that it was generated by a different mechanism
Appears to deviate markedly from the sample
Is inconsistent with the dataset
Why do outliers occur?
measurement/transmission errors, data input/processing errors, attacks/fraud
What’s the difference between a local and a global outlier?
Local outlier: instance that is very different from the instances around it
Global outlier: very different from entire dataset
What are Arthur’s main challenges in dealing with High Dimensional Data?
1) “concentration effect”: curse of dim
2) discrimination vs. ranking of values
3) combinatorial issues and subspace selection
4) Hubness
What is hubness?
Phenomenon where some instances in a dataset (hubs) occur as the nearest neighbors of many other instances, more than expected by chance
What is the definition of “concentration of distances”/curse of dimensionality?
The ratio of the variance of length of any point vector converges to zero with increasing data dimensionality
What is a shrinking hypersphere?
A method used in density-based clustering (i.e. DBSCAN) to find clusters of similar data in a high-dimensional space
What else can be useful when absolute distance is not?
Distance rankings
What is the central limit theorem?
A result in probability where if you have a large number of samples, the distribution will be approximately normal (relevant to curse of dimensionality)
What are 8 problems with High Dimensional Data?
1) curse of dimensionality
2) noise
3) circle of needing to know the neighbors to choose the right subspace and needing to know the right subspace to find correct neighbors
4) bias of scores (toward high dim subspaces)
5) scores appearing identical
6) exponential subspaces
7) data-snooping bias
8) hubness
What are two methods of subspace traversal?
top-down and bottom-up
What does projected clustering do?
Partitions the data into disjoint clusters
What does subspace clustering do?
Finds all clusters in all subspaces (possibly with overlap)
How does Apriori work?
1) search DB one for each length of a transaction pattern
2) count occurences of candidates
3) eliminate candidates for next round that are not frequent in longer combos
“PRUNING” => THINK ALPHABET COMBOS
What is the downward closure property?
AKA “monotonicity” states that if a given input has a certain property, all subsets of a frequent itemset must also be frequent. Basis for Apriori.
Are subspace and projected clustering usually bottom-up or top-down?
Subspace usually bottom-up and projected usually top-down
What is bottom-up clustering?
Starting with each point as its own cluster and successively merges pairs of clusters
What is top-down clustering?
Starts with the all points as a single cluster and divides them into smaller clusters
What is bi-clustering?
Simultaneously clustering rows and columns
How does PCA work?
1) take local selection of points, then build covariance matrix to derive eigensystem
2) defines hyperplane using strong eigenvectors
3) sum of the smallest eigenvectors defines the minimum
What are some non-axis parallel approaches to clustering?
Pattern-based, bi/co-clustering, correlation clustering
Density-based clustering usually
Why use a non-axis parallel approach over an axis-parallel one?
It can handle irregularly shaped data that do not follow a ball or ellipsoid space
What do pattern-based clustering algorithms find?
Pairwise linear dependencies (simple positive correlations)
What characterizes axis-parallel algorithms?
They break down the feature space into hyperrectangles that form the clusters
What does correlation clustering find?
Linear dependencies (more general correlations)
What are the three main problems for clustering ensembles?
1) the number of clusters can vary among different solutions
2) cluster labels are symbolic (not classes)
3) how to achieve diversity is harder to see
What is used to combine the results of ensemble clustering ?
A consensus function
What are the names of some consensus clustering methods?
4C: Computing Clusters of Correlation Connected Objects
Cluster model: Deriving Quantitative Models For Correlation Clusters
COPAC
CASH (Hough-Transform based)
ERiC: Explaining Relationships Among Correlation Clusters
Which algorithm requires two parameters and what are they?
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
eps and MinPts
Which type of clustering usually adopts the bottom-up strategy and which one usually adopts the top-down strategy?
What do they rely on?
Bottom-up: subspace clustering
Top-down: projected clustering
Rely on the assumption that subspaces are axis-paralllel
What are the two flavors of correlation based on?
PCA and Hough Transform
What is a fundamental efficiency technique to finding the localized region near a point as an outlier detection mechanism?
Approximate neighborhoods
What are three ways to implement approximate neighborhoods?
Random projection, locality sensitive hashing, space-filling curves
What are five major efficiency techniques used in top-N outlier detection?
Approximate neighborhoods
RBRP (Recursive Binning and Re-Projection)
PINN (Projection-Indexed Nearest Neighbors)
Space-filling curves
(Very Good) Neighborhood Approximations
What is the goal of subspace outlier detection?
To find outliers in relevant subspaces that are not outliers in the full-dimensional space
What is data snooping bias?
When the selection of a model or set of parameters is influenced by the data it is being run against (which can lead to an overfit model that doesn’t generalize well
What are three methods of subspace outlier detection?
OutRank
SOD
Correlation Outlier
What is another term for similarity
Distance measure
What is the main challenge with similarity?
Defining a suitable distance
What are some common evaluation methods?
F1, precision, recall, silhouette score
What is conditional entropy?
A measure of the amount of uncertainty in one variable given knowledge of another random variable
H(X|Y) where X and Y are the two variables being considered
What is pair counting?
A technique used to measure the correlation between two variables or sets of data. Involves counting the number of data points that have a specific relationship or fall within a certain range. Can be used to compute correlation.
What are two solutions to evaluating results?
Mapping sets of objects and pair counting
What does pairwise-stable mean?
When the different subsets of the data are used, the same inputs are consistently grouped together. Measure of robustness of an algorithm
What is variance?
The degree of dispersion of the data points within a cluster (how far data points are from each other and the centroid)
What is the sample count in the Apriori algorithm called?
Support
What does ORCLUS use for dimensionality reduction?
SVD (Singular Value Decomposition)
What are four consensus methods mentioned in lecture?
HGPA (Hypergraph Partitioning)
CSPA (Cluster-based similarity partitioning)
maximum likelihood
voting
What is SUBCLU’s three step process?
1) All 1-dimensional subspaces are clustered. All clusters in higher-dim subspaces will be subsets of these clusters producing k+1 dimensional candidate subspaces
2) After pruning noise, DBSCAN is run on the subspaces to see if they contain clusters.
3) If it does, it is used in the next combination of subspaces
What is single-linkage?
merging based on minimum distance between any two observations
What is complete linkage?
merging based on maximum distance between any two observations
What are the three phases of PROCLUS?
1) Initialization of cluster mediods
2) Iterative phase (refining mediods from Phase 1)
3) Refinement (reassigning subspaces to mediods)
How does SVD (single value decomposition) work?
1) construct a covariance matrix with the data
2) find the eigenvectors and eigenvalues of that matrix
3) choose the strongest (largest) eigenvalues to retain the most useful information
What are the three major steps in ORCLUS?
1) Assign
walk points and assign to seed with lowest distance
calculate new seeds and return
2) Find vectors
construct covariance matrix, find eigen-vectors and - values, pick lowest
3) Merge
use SVD on each pair of clusters
use Energy as a metric for how well they might combine
merge best fitting ones and recalculate
What is the quality measure of clusters called in ORCLUS?
Energy
What are the inputs to ORCLUS?
k, number of clusters
k0, number of seeds to start with
l, number of dimensions to project clusters onto
What does pairwise stable mean?
Between cluster distance dominates the within cluster distance