Class Two Flashcards

Question 1

Q

What is unsupervised machine learning?

Answer

A

Unsupervised machine learning is a type of machine learning where the algorithm learns patterns and structures in the data without being provided with explicit labels or target variables.

Question 2

Q

What is K-Means clustering?

Answer

A

K-Means clustering is an unsupervised machine learning algorithm used for partitioning data into K clusters based on similarity. It aims to minimize the sum of squared distances between data points and their cluster centroids.

Question 3

Q

What are the advantages of K-Means clustering?

Answer

A

Advantages of K-Means clustering include its simplicity, scalability to large datasets, and effectiveness in identifying well-separated spherical clusters.

Advantages:
Easy to implement
Adapts easily
Few hyperparameters
Disadvantages:
Does not scale well
Curse of dimensionality
Prone to overfitting

Question 4

Q

When should you use K-Means clustering?

Answer

A

K-Means clustering is suitable when the data is continuous and there is a need to partition it into distinct groups based on similarity or proximity. It is useful for:

Data preprocessing: Datasets frequently have missing values, but KNN can
estimate for those values [Missing data imputation].
Recommendation Engines: Using clickstream data from websites, KNN can
provide automatic recommendations to users on additional content.
Finance: Using KNN on credit data can help banks assess risk of a loan.
Healthcare: Predicting risk of heart attacks and prostate cancer.
Pattern Recognition: KNN has also assisted in identifying patterns in text
and in digit classification.

Question 5

Q

What are the limitations of K-Means clustering?

Answer

A

Limitations of K-Means clustering include sensitivity to the initial placement of cluster centroids, the requirement to specify the number of clusters in advance, and the assumption of spherical clusters.

From slides:

Necessary to run algorithm several times to avoid sub-optimal solutions.
Need to specify the number of clusters.
Does not behave very well when the clusters have:
Varying sizes
Different densities
Non-spherical shapes.
Must scale input features before you run K-Means.
Scaling features improve the performance.

Question 6

Q

What is DBSCAN clustering?

Answer

A

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised machine learning algorithm that groups data points into clusters based on density. It can find clusters of arbitrary shapes and handle outliers.

Question 7

Q

What are the advantages of DBSCAN clustering?

Answer

A

Advantages of DBSCAN clustering include its ability to discover clusters of various shapes, its robustness to noise and outliers, and the ability to determine the number of clusters automatically.

Question 8

Q

When should you use DBSCAN clustering?

Answer

A

DBSCAN clustering is suitable when the data has varying density, there are irregularly shaped clusters, and when noise or outliers need to be identified.

Question 9

Q

What are the limitations of DBSCAN clustering?

Answer

A

Limitations of DBSCAN clustering include sensitivity to the choice of distance parameters, difficulty in handling data with varying densities, and the potential for producing overly complex clusters.

Question 10

Q

What is hierarchical clustering?

Answer

A

Hierarchical clustering is an unsupervised machine learning algorithm that creates a hierarchy of clusters. It iteratively merges or divides clusters based on their similarity, forming a tree-like structure called a dendrogram.

Question 11

Q

What are the advantages of hierarchical clustering?

Answer

A

Advantages of hierarchical clustering include its ability to reveal the hierarchical structure of the data, its flexibility in handling different similarity measures, and the visualization provided by dendrograms.

Question 12

Q

When should you use hierarchical clustering?

Answer

A

Hierarchical clustering is suitable when the data has a hierarchical structure, and the goal is to explore relationships and similarities at different levels of granularity.

Question 13

Q

What are the limitations of hierarchical clustering?

Answer

A

Limitations of hierarchical clustering include its computational complexity for large datasets, sensitivity to the choice of distance or similarity measures, and difficulty in handling noise and outliers.

Question 14

Q

How do you determine the optimal number of clusters in K-Means clustering?

Answer

A

The optimal number of clusters in K-Means clustering can be determined using techniques such as the elbow method, silhouette analysis, or visual inspection of cluster quality.

Question 15

Q

What is the silhouette coefficient used for in clustering?

Answer

A

The silhouette coefficient is a measure of how well each data point fits into its assigned cluster in terms of both cohesion and separation. It ranges from -1 to 1, where higher values indicate better clustering quality.

Question 16

Q

What is the difference between K-Means and Hierarchical Clustering?

Answer

Study These Flashcards

A

K-Means Clustering is a partitioning-based algorithm that requires specifying the number of clusters in advance, while Hierarchical Clustering is an agglomerative or divisive algorithm that creates a hierarchy of clusters without the need for a predetermined number of clusters.

Question 17

Q

What are the Three C’s of ML?

Answer

Study These Flashcards

A

Three C’s of ML:
1. Collaborative filtering: is a technique for recommendations
2. Clustering: algorithms discover structure in collections of data.
3. Classification: is a form of ‘supervised’ learning.

Class Two Flashcards

(17 cards)