Class Two Flashcards
What is unsupervised machine learning?
Unsupervised machine learning is a type of machine learning where the algorithm learns patterns and structures in the data without being provided with explicit labels or target variables.
What is K-Means clustering?
K-Means clustering is an unsupervised machine learning algorithm used for partitioning data into K clusters based on similarity. It aims to minimize the sum of squared distances between data points and their cluster centroids.
What are the advantages of K-Means clustering?
Advantages of K-Means clustering include its simplicity, scalability to large datasets, and effectiveness in identifying well-separated spherical clusters.
- Advantages:
- Easy to implement
- Adapts easily
- Few hyperparameters
- Disadvantages:
- Does not scale well
- Curse of dimensionality
- Prone to overfitting
When should you use K-Means clustering?
K-Means clustering is suitable when the data is continuous and there is a need to partition it into distinct groups based on similarity or proximity. It is useful for:
- Data preprocessing: Datasets frequently have missing values, but KNN can
estimate for those values [Missing data imputation]. - Recommendation Engines: Using clickstream data from websites, KNN can
provide automatic recommendations to users on additional content. - Finance: Using KNN on credit data can help banks assess risk of a loan.
- Healthcare: Predicting risk of heart attacks and prostate cancer.
- Pattern Recognition: KNN has also assisted in identifying patterns in text
and in digit classification.
What are the limitations of K-Means clustering?
Limitations of K-Means clustering include sensitivity to the initial placement of cluster centroids, the requirement to specify the number of clusters in advance, and the assumption of spherical clusters.
From slides:
- Necessary to run algorithm several times to avoid sub-optimal solutions.
- Need to specify the number of clusters.
- Does not behave very well when the clusters have:
- Varying sizes
- Different densities
- Non-spherical shapes.
- Must scale input features before you run K-Means.
- Scaling features improve the performance.
What is DBSCAN clustering?
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised machine learning algorithm that groups data points into clusters based on density. It can find clusters of arbitrary shapes and handle outliers.
What are the advantages of DBSCAN clustering?
Advantages of DBSCAN clustering include its ability to discover clusters of various shapes, its robustness to noise and outliers, and the ability to determine the number of clusters automatically.
When should you use DBSCAN clustering?
DBSCAN clustering is suitable when the data has varying density, there are irregularly shaped clusters, and when noise or outliers need to be identified.
What are the limitations of DBSCAN clustering?
Limitations of DBSCAN clustering include sensitivity to the choice of distance parameters, difficulty in handling data with varying densities, and the potential for producing overly complex clusters.
What is hierarchical clustering?
Hierarchical clustering is an unsupervised machine learning algorithm that creates a hierarchy of clusters. It iteratively merges or divides clusters based on their similarity, forming a tree-like structure called a dendrogram.
What are the advantages of hierarchical clustering?
Advantages of hierarchical clustering include its ability to reveal the hierarchical structure of the data, its flexibility in handling different similarity measures, and the visualization provided by dendrograms.
When should you use hierarchical clustering?
Hierarchical clustering is suitable when the data has a hierarchical structure, and the goal is to explore relationships and similarities at different levels of granularity.
What are the limitations of hierarchical clustering?
Limitations of hierarchical clustering include its computational complexity for large datasets, sensitivity to the choice of distance or similarity measures, and difficulty in handling noise and outliers.
How do you determine the optimal number of clusters in K-Means clustering?
The optimal number of clusters in K-Means clustering can be determined using techniques such as the elbow method, silhouette analysis, or visual inspection of cluster quality.
What is the silhouette coefficient used for in clustering?
The silhouette coefficient is a measure of how well each data point fits into its assigned cluster in terms of both cohesion and separation. It ranges from -1 to 1, where higher values indicate better clustering quality.