Data Mining Flashcards
What are the seven main types of clustering algorithms?
Pattern-Based
Projected
Partitioning/Representative
Density
Hierarchical
Bi-Clustering
Correlation
Which algorithms are Pattern-Based
p-Cluster
MaPle
EDSC
Which algorithms are Projection-based?
PROCLUS: PROjected CLUStering
MD5
Isomap
t-SNE
Which algorithms are paritioning/representative?
kMeans
kMediod
Which algorithms are Density based?
CLIQUE
DBSCAN
OPTICS
OP-Cluster
Which algorithms are hierarchical?
DiSH
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
CURE (Clustering Using Representatives)
Which algorithms use bi-clustering?
delta-bicluster
What is the main goal in clustering?
To find meaningful features
What are four strategies to deal with high-dimensional data?
1) dimensionality reduction (PCA, MD5, SNE)
2) regularization (L1, L2)
3) ensemble methods
4) projected clustering (MD5, SNE)
What function describes the probability that a random value will be <= a given value?
Cumulative Density Distribution function
What types of kernels can be used to estimate density?
Discrete
Gaussian
Multivariate
What process splits data into cells where all the points are closest to the seed point?
Voronoi parcelling
What process compares the local density of a point to the local density of its k-nearest-neighbors?
LOF (local outlier factor)
What are masking and swamping?
Masking is when an outlier gets included in the cluster. Swamping is when the model is changed so the inliers appear as outliers
What is the silhouette score?
A measure of how well a data point is classified relative to other points in the cluster and ranges from -1 to 1
What is the silhouette score used for?
To evaluate the performance of an algorithm and/or to decide on the number of clusters to set as a parameter
What types of norms are there?
Euclidean
Manhattan
Max norm
Weighted Euclidean
Quadratic
What is an outlier?
Arouses suspicion that it was generated by a different mechanism
Appears to deviate markedly from the sample
Is inconsistent with the dataset
Why do outliers occur?
measurement/transmission errors, data input/processing errors, attacks/fraud
What’s the difference between a local and a global outlier?
Local outlier: instance that is very different from the instances around it
Global outlier: very different from entire dataset
What are Arthur’s main challenges in dealing with High Dimensional Data?
1) “concentration effect”: curse of dim
2) discrimination vs. ranking of values
3) combinatorial issues and subspace selection
4) Hubness
What is hubness?
Phenomenon where some instances in a dataset (hubs) occur as the nearest neighbors of many other instances, more than expected by chance
What is the definition of “concentration of distances”/curse of dimensionality?
The ratio of the variance of length of any point vector converges to zero with increasing data dimensionality
What is a shrinking hypersphere?
A method used in density-based clustering (i.e. DBSCAN) to find clusters of similar data in a high-dimensional space
What else can be useful when absolute distance is not?
Distance rankings
What is the central limit theorem?
A result in probability where if you have a large number of samples, the distribution will be approximately normal (relevant to curse of dimensionality)
What are 8 problems with High Dimensional Data?
1) curse of dimensionality
2) noise
3) circle of needing to know the neighbors to choose the right subspace and needing to know the right subspace to find correct neighbors
4) bias of scores (toward high dim subspaces)
5) scores appearing identical
6) exponential subspaces
7) data-snooping bias
8) hubness
What are two methods of subspace traversal?
top-down and bottom-up
What does projected clustering do?
Partitions the data into disjoint clusters