Week 5: Clustering 2 Flashcards
What are the 2 ways we use to pick the k-value?
1) Elbow method
2) SSE
What are the 4 steps in data preparation for k-means clustering?
1) Select the variables of the dataset you want to use
2) Transform variables if needed
3) Standardize variables if needed
4) Weigh important variables
When and why do we perform variable transformations?
When there are skewed data. We do this because the k-means clustering would not give good results otherwise
What are the 2 types of data transformations we can use, and what’s the difference?
1) Log transformation
2) Square root transformation
Both aim to compress higher values so that the lower values are more spread out, but log is more aggressive that SR
Why do we standardize variables?
To prevent a scenario where variables with larger scales dominate variables with smaller scales
What is the most common form of standardization and its formula?
Z-score: (X - Mean) / S.D