K-Means Clustering Flashcards

Question 1

Q

Is K-Means Clustering supervised or unsupervised?

Answer

A

K-Means Clustering is an example of unsupervised machine learning.

Question 2

Q

Explain supervised versus unsupervised learning

Answer

A

In unsupervised learning, there is no specific output. The data is analyzed without knowing a specific output you’re looking for.

Question 3

Q

Name some examples of clustering

Answer

A

Market segmentation, product analysis, etc.

Question 4

Q

What is a cluster

Answer

A

A cluster is a collection of objects that are similar

Question 5

Q

How do we determine similarity in clustering?

Answer

A

We need a notion of distance

Question 6

Q

What is the objective of clustering?

Answer

A

The objective of clustering is to group similar data points into a group. Some examples are segmenting customers into similar groups, or automatically organizing files emails into folders.

Question 7

Q

How does clustering simplify data?

Answer

A

Clustering simplifies data by reducing many data points into a few clusters

Question 8

Q

What are some examples of distance used in clustering?

Answer

A

Examples of common distance measures in clustering are Manhattan Distance, Euclidian Distance, and Chebyshev Distance.

Question 9

Q

What is the formula for Euclidian distance?

Answer

A

Square root of [ X1 - X2 squared + Y1 - Y2 squared + Z1 - Z2 squared + m…]

m number of columns

Question 10

Q

What is the formula for Manhattan distance?

Answer

A

Absolute value of X1 minus X2 plus absolute value of Y1 minus Y2 plus absolute value of Z1 minus Z2, etc.

Question 11

Q

Why is it called Manhattan distance?

Answer

A

In Manhattan, you cannot connect two points directly. You must walk in a grid.

Question 12

Q

How do you calculate Chebyshev or chessboard distance?

Answer

A

Take the max value of (absolute value X1 - X2 or absolute value of Y1 - Y2 or absolute value of Z1 - Z2 … etc)

Question 13

Q

What is the Minkowski distance?

Answer

A

A formula which uses a P value, depending on which distance measure you want. It is calculated by the sum of (all absolute value of Xi - Yi raised to the power of P) then raised to the power of 1/p.

Euclidian distance uses P equals two, Manhattan uses P equals one, chessboard equals P is greater than two and up to infinity.

Question 14

Q

What measure of distance does K means clustering use?

Answer

A

Euclidian distance

Question 15

Q

Name some types of clustering

Answer

A

connectivity based clustering
Centroid based clustering

Question 16

Q

What is connectivity based clustering?

Answer

Study These Flashcards

A

Based on the idea that related objects are closer to each other

Question 17

Q

Formula for determining number of connections between N points

Answer

Study These Flashcards

A

(N x (N-1)) / 2

Question 18

Q

What is the point of attempting to choose the optimal K and what is a commonly used method for doing so?

Answer

Study These Flashcards

A

You are attempting to strike a good balance between compression and accuracy. The elbow method is commonly used.

Question 19

Q

Using K means, how do you calculate the centroid of a cluster

Answer

Study These Flashcards

A

The centroid of a cluster is calculated by finding the mean vector of all data points in that cluster. For example, add up the absolute values of X and divide by the number of data points to find the X value, and add up all of the absolute value of Y data points and divide by the number of data points to find the Y value.

K-Means Clustering Flashcards

(19 cards)