B05 k-Means Clustering Flashcards
k-means clustering is a ________, ________ and
__________ clustering approach that assigns all n items in a dataset to one of k clusters, such that the
differences within a cluster are minimized while the
differences between clusters is maximized
partitional, exclusive, complete
Good clustering will produce clusters with:
- ______ intra-class similarity.
- ______ inter-class similarity.
High
Low
Other distance measures used in Clustering include:
Minkowski distance, Pearson
correlation distance, Spearman
correlation distance and Kendall
correlation distance.
Challenges with k-Means Clustering
k-Means is very sensitive to the initial randomly chosen cluster centers (this is known as the ________)
random
initialization trap
The _______ initialization approach mitigates the effects of the random initialization trap.
K-means++
Methods for choosing the right K include:
Elbow Method Information Criterion Approach Silhouette method Jump method Gap statistic
WCSS stands for ________ and is associated with the _____ for choosing K
Within Cluster Sum of Squares
Elbow Method
Strengths of k-Means Clustering?
-Uses simple non-statistical principles. -Very flexible and malleable algorithm. -Wide set of real-world applications.
Weaknesses of k-Means Clustering?
-Simplistic algorithm.
–Relies on chance.
S-ometimes requires some
domain knowledge in
determining the ideal number
of clusters.
-Not ideal for non-spherical
clusters.
-Works with numeric data only.
An individual independent example of the concept
represented by the dataset. It is described by a set of
attributes or features
An instance (row)
Property or characteristic of an instance. These can
either be discrete or continuous.
Feature
The attribute or feature that is described by the other
features within an instance.
Class
The ___________ of a dataset represents the number
of features in the dataset.
dimensionality
Data _______ and _______ describe
the degree to which data exists for
each feature of all observations.
sparsity
density
__________ describes the grain or level of detail in the data.
Resolution