clustering Flashcards

1
Q

supervised learning

A
  • training set of actual outcomes is available
  • eg. classification, linear regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

unsupervised learning

A
  • grouping data into categories based on some measures of inherent similarity to understand pattern without specifying purpose or target
  • eg. clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

goal of clustering

A
  • points in the same cluster have small distances from one another
  • points in different clusters are at large distances from one another
  • homogeneous WITHIN the cluster
  • heterogeneous ACROSS the cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

main types of clustering

A
  • hierarchical (bottom up, combine)
  • partitioning method (eg. K-Means Clustering)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

k means algorithm

A
  1. value of K is decided and K seeds are assigned
  2. each observation is allocated to the closest seed to get K clusters
  3. compute centroid of each cluster as the new seed
  4. reassign each observation to the new clusters, based on distance from new seeds
  5. iterate step 2-4 until a stopping criteria / stable solution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

clustering: using SSE as a stopping criteria
- what if doesnt work

A

iterate until there is a very small change or no change in SSE
- if no convergence (SSE continues to fluctuate), max iterations maybe exceeded, iteration stops

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

TSS vs WSS

A

total sum of squares TSS: total variance in the data without clustering

total within cluster sum of squares (inertia)
- SSE respective to cluster centroid
sum of squared errors SSE = WSS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

between cluster Sum of squares

A
  • difference between TSS - WSS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what does (between cluster sum of squares)/TSS represent

A
  • proportion of total variance in data set that is explained by clustering model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

k-means clustering challenges

A
  • combinatorial optimisation problem, computationally very difficult
  • not guaranteed to converge to optimal solution; different starting seed might produce different solutions
  • sensitive to outliers and random initial seeds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

data preparation for K-Means Clustering

A
  1. selection: decide the clustering variables
  2. transformation: check for outliers/skewed data
  3. standardisation: normalise
  4. weightage: differentiate variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is attribute selection
- why necessary?

A
  • choose right attributes to be included in the model
  • should be relevant, not too many missing values, should have variations
  • reduces over-fitting: less redundant data means less opportunity to make decisions based on noise
  • reduces training time: less data, faster training
  • improves accuracy: less misleading data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why standardise: K means algorithm

A
  • adjust for difference in scale (magnitude and numbers)
  • variables with bigger values dominate choice of clusters due to euclidean distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how standardise

A
  1. normalisation (min-max method): (X- min value)/range of values [0,1]
  2. Z score = (X-Mean) / s.d.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

why weighting: K Means algorithm

A
  • not all variables are equally important in an analysis
  • assign weight to variable as per relative importance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

selecting k

A
  • plot k (x) against average distance/SSE (y)
  • point where changes are reduced with increasing k