clustering Flashcards
supervised learning
- training set of actual outcomes is available
- eg. classification, linear regression
unsupervised learning
- grouping data into categories based on some measures of inherent similarity to understand pattern without specifying purpose or target
- eg. clustering
goal of clustering
- points in the same cluster have small distances from one another
- points in different clusters are at large distances from one another
- homogeneous WITHIN the cluster
- heterogeneous ACROSS the cluster
main types of clustering
- hierarchical (bottom up, combine)
- partitioning method (eg. K-Means Clustering)
k means algorithm
- value of K is decided and K seeds are assigned
- each observation is allocated to the closest seed to get K clusters
- compute centroid of each cluster as the new seed
- reassign each observation to the new clusters, based on distance from new seeds
- iterate step 2-4 until a stopping criteria / stable solution
clustering: using SSE as a stopping criteria
- what if doesnt work
iterate until there is a very small change or no change in SSE
- if no convergence (SSE continues to fluctuate), max iterations maybe exceeded, iteration stops
TSS vs WSS
total sum of squares TSS: total variance in the data without clustering
total within cluster sum of squares (inertia)
- SSE respective to cluster centroid
sum of squared errors SSE = WSS
between cluster Sum of squares
- difference between TSS - WSS
what does (between cluster sum of squares)/TSS represent
- proportion of total variance in data set that is explained by clustering model
k-means clustering challenges
- combinatorial optimisation problem, computationally very difficult
- not guaranteed to converge to optimal solution; different starting seed might produce different solutions
- sensitive to outliers and random initial seeds
data preparation for K-Means Clustering
- selection: decide the clustering variables
- transformation: check for outliers/skewed data
- standardisation: normalise
- weightage: differentiate variables
what is attribute selection
- why necessary?
- choose right attributes to be included in the model
- should be relevant, not too many missing values, should have variations
- reduces over-fitting: less redundant data means less opportunity to make decisions based on noise
- reduces training time: less data, faster training
- improves accuracy: less misleading data
why standardise: K means algorithm
- adjust for difference in scale (magnitude and numbers)
- variables with bigger values dominate choice of clusters due to euclidean distance
how standardise
- normalisation (min-max method): (X- min value)/range of values [0,1]
- Z score = (X-Mean) / s.d.
why weighting: K Means algorithm
- not all variables are equally important in an analysis
- assign weight to variable as per relative importance