clustering Flashcards

Question 1

Q

supervised learning

Answer

A

training set of actual outcomes is available
eg. classification, linear regression

Question 2

Q

unsupervised learning

Answer

A

grouping data into categories based on some measures of inherent similarity to understand pattern without specifying purpose or target
eg. clustering

Question 3

Q

goal of clustering

Answer

A

points in the same cluster have small distances from one another
points in different clusters are at large distances from one another
homogeneous WITHIN the cluster
heterogeneous ACROSS the cluster

Question 4

Q

main types of clustering

Answer

A

hierarchical (bottom up, combine)
partitioning method (eg. K-Means Clustering)

Question 5

Q

k means algorithm

Answer

A

value of K is decided and K seeds are assigned
each observation is allocated to the closest seed to get K clusters
compute centroid of each cluster as the new seed
reassign each observation to the new clusters, based on distance from new seeds
iterate step 2-4 until a stopping criteria / stable solution

Question 6

Q

clustering: using SSE as a stopping criteria
- what if doesnt work

Answer

A

iterate until there is a very small change or no change in SSE
- if no convergence (SSE continues to fluctuate), max iterations maybe exceeded, iteration stops

Question 7

Q

TSS vs WSS

Answer

A

total sum of squares TSS: total variance in the data without clustering

total within cluster sum of squares (inertia)
- SSE respective to cluster centroid
sum of squared errors SSE = WSS

Question 8

Q

between cluster Sum of squares

Answer

A

difference between TSS - WSS

Question 9

Q

what does (between cluster sum of squares)/TSS represent

Answer

A

proportion of total variance in data set that is explained by clustering model

Question 10

Q

k-means clustering challenges

Answer

A

combinatorial optimisation problem, computationally very difficult
not guaranteed to converge to optimal solution; different starting seed might produce different solutions
sensitive to outliers and random initial seeds

Question 11

Q

data preparation for K-Means Clustering

Answer

A

selection: decide the clustering variables
transformation: check for outliers/skewed data
standardisation: normalise
weightage: differentiate variables

Question 12

Q

what is attribute selection
- why necessary?

Answer

A

choose right attributes to be included in the model
should be relevant, not too many missing values, should have variations
reduces over-fitting: less redundant data means less opportunity to make decisions based on noise
reduces training time: less data, faster training
improves accuracy: less misleading data

Question 13

Q

why standardise: K means algorithm

Answer

A

adjust for difference in scale (magnitude and numbers)
variables with bigger values dominate choice of clusters due to euclidean distance

Question 14

Q

how standardise

Answer

A

normalisation (min-max method): (X- min value)/range of values [0,1]
Z score = (X-Mean) / s.d.

Question 15

Q

why weighting: K Means algorithm

Answer

A

not all variables are equally important in an analysis
assign weight to variable as per relative importance

Question 16

Q

selecting k

Answer

Study These Flashcards

A

plot k (x) against average distance/SSE (y)
point where changes are reduced with increasing k

clustering Flashcards

(16 cards)