K-Means Clustering Flashcards
Is K-Means Clustering supervised or unsupervised?
K-Means Clustering is an example of unsupervised machine learning.
Explain supervised versus unsupervised learning
In unsupervised learning, there is no specific output. The data is analyzed without knowing a specific output you’re looking for.
Name some examples of clustering
Market segmentation, product analysis, etc.
What is a cluster
A cluster is a collection of objects that are similar
How do we determine similarity in clustering?
We need a notion of distance
What is the objective of clustering?
The objective of clustering is to group similar data points into a group. Some examples are segmenting customers into similar groups, or automatically organizing files emails into folders.
How does clustering simplify data?
Clustering simplifies data by reducing many data points into a few clusters
What are some examples of distance used in clustering?
Examples of common distance measures in clustering are Manhattan Distance, Euclidian Distance, and Chebyshev Distance.
What is the formula for Euclidian distance?
Square root of [ X1 - X2 squared + Y1 - Y2 squared + Z1 - Z2 squared + m…]
m number of columns
What is the formula for Manhattan distance?
Absolute value of X1 minus X2 plus absolute value of Y1 minus Y2 plus absolute value of Z1 minus Z2, etc.
Why is it called Manhattan distance?
In Manhattan, you cannot connect two points directly. You must walk in a grid.
How do you calculate Chebyshev or chessboard distance?
Take the max value of (absolute value X1 - X2 or absolute value of Y1 - Y2 or absolute value of Z1 - Z2 … etc)
What is the Minkowski distance?
A formula which uses a P value, depending on which distance measure you want. It is calculated by the sum of (all absolute value of Xi - Yi raised to the power of P) then raised to the power of 1/p.
Euclidian distance uses P equals two, Manhattan uses P equals one, chessboard equals P is greater than two and up to infinity.
What measure of distance does K means clustering use?
Euclidian distance
Name some types of clustering
connectivity based clustering
Centroid based clustering
What is connectivity based clustering?
Based on the idea that related objects are closer to each other
Formula for determining number of connections between N points
(N x (N-1)) / 2
What is the point of attempting to choose the optimal K and what is a commonly used method for doing so?
You are attempting to strike a good balance between compression and accuracy. The elbow method is commonly used.
Using K means, how do you calculate the centroid of a cluster
The centroid of a cluster is calculated by finding the mean vector of all data points in that cluster. For example, add up the absolute values of X and divide by the number of data points to find the X value, and add up all of the absolute value of Y data points and divide by the number of data points to find the Y value.