Unsupervised Learning Flashcards
How would you define clustering? Can you name a few clustering algorithms(7)?
Clustering is the unsupervised task of grouping similar instance together. The notion of similarity depends on the task at hand. Popular clustering algorithms include:
1) K-means
2) DBSCAN
3) agglomerative clustering
4) BIRCH
5) mean-shift
6) affinity propagation
7) spectral clustering.
What are some of the main applications of clustering algorithms (8)?
1) data analysis
2) customer segmentation
3) recommender systems
4) search engines,
5) image segmentation
6) semi supervised learning
7) anomaly detection
8) novelty detection
Describe two techniques to select the right number of clusters when using K-means.
1) The elbow rule: plot the inertia as a function of the number of clusters, and find the point in the curve where the intertia stops dropping fast (the elbow).
2) The silhouette: plot the silhouette score as a function of the number of clusters. There will often be a peak, and the optimal number of clusters is generally nearby. Note the silhouette score is the mean silhouette coefficient over all instances.
What is the label propagation? Why would you implement it, and how?
Label propagation is a technique that consists in copying some (or all) of the labels from the labeled instances to similar unlabeled instances.
Since labeling a dataset is costly and time-consuming, label propagation might come in handy.
Can you name two clustering algorithms that can scale to large datasets? and two that look for region of high density?
Scale to large dataset: K-means and BIrch
Look for region of high density: Mean-shift and DBSCAN
Can you think of a use case where active learning would be useful? How would you implement it?
Active learning is useful whenever you have plenty of unlabeled instances but labeling is costly. In this case (which is very common), rather than randomly selecting instances to label, it is often preferable to perform active learning, where human experts interact with the learning algorithm.
What is the difference between anomaly detection and novelty detection ?
In anomaly detection, the algorithm is trained on a dataset that may contain outliers, and the goal is typically to identify these outliers.
In novelty detection, the algorithm is trained on a dataset that is presumed to be “clean”, and the objective is to detect novelties strickly among new instances.
What is a Gaussian mixture? What tasks can you use it for?
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several gaussian distributions whose parameters are unknown.
In other word the assumption is that the data is grouped into a finite number of clusters, each with an ellipsoidal shape.
usefull for density estimation, clustering, and anomaly detection.
Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?
To plot plot either the Bayesian information criterion (BIC) or Akaike information criterion (AIC) as a function of the number of clusters.
Or use the Bayesian Gaussian mixture model, which automatically selects the number of cluster.
What is the main idea of K-means clustering? How does it work ?
1) the objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.
2) You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
What is the difference between hard and soft clustering?
When using Hard clustering each instance is assigned to a single cluster.
When using Soft clustering each instance is instead given a score per cluster. The score can be the distance between the instance and the centroid.
The K-means algorithm is not guaranteed to converge to the global minimum. How can we increase our chance of getting a good solution?
Centroid initialization: using the K-mean++ algorithm, which use a smarter initialization step that tends to select centroids that are distance from one another. Which make it less likely to converge to a suboptimal solution.
note the KMeans class uses this initialization method by default.
How can we accelerate the K-mean algorithm (2)?
1) Charles Elkan used the triangle inequality (i.e., a straight line is always the shortest distance between two points) and by keeping track of the lower and upper bounds for distance between instances and centroids. Note the kmean class uses this method by default.
2) minibatch k-means: instead of using the full dataset at each iteration, the algorithm is capable of using mini-batches, moving the centroids just slightly at each iteration. This speeds up the algorithm by a factor of three or 4. From sklearn.cluster import MiniBatchKMeans
The K-means algorithm aims to choose centroids that minimise the inertia. What are the formula and assumption of inertia ?
The formula:
Σmin(||xi-uj||2)
is a measure of how internally coherent cluster are.
Inertia makes the assumption that clusters are convex and isotropic (distances unchanged by translations and rotations), which is not always the case. It reponds poorly to elongated cluster, or manifolds with irregular shapes.
In k-means clustering, what is the silhouette score? How is it computed? What is it used for?
1) Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
2) The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.
3) By plotting the silhouette score against the number of clusters, we can make an accurate guess as to how many clusters are needed.