Unsupervised Learning Flashcards

Question 1

Q

How would you define clustering? Can you name a few clustering algorithms(7)?

Answer

A

Clustering is the unsupervised task of grouping similar instance together. The notion of similarity depends on the task at hand. Popular clustering algorithms include:

1) K-means
2) DBSCAN
3) agglomerative clustering
4) BIRCH
5) mean-shift
6) affinity propagation
7) spectral clustering.

Question 2

Q

What are some of the main applications of clustering algorithms (8)?

Answer

A

1) data analysis
2) customer segmentation
3) recommender systems
4) search engines,
5) image segmentation
6) semi supervised learning
7) anomaly detection
8) novelty detection

Question 3

Q

Describe two techniques to select the right number of clusters when using K-means.

Answer

A

1) The elbow rule: plot the inertia as a function of the number of clusters, and find the point in the curve where the intertia stops dropping fast (the elbow).
2) The silhouette: plot the silhouette score as a function of the number of clusters. There will often be a peak, and the optimal number of clusters is generally nearby. Note the silhouette score is the mean silhouette coefficient over all instances.

Question 4

Q

What is the label propagation? Why would you implement it, and how?

Answer

A

Label propagation is a technique that consists in copying some (or all) of the labels from the labeled instances to similar unlabeled instances.

Since labeling a dataset is costly and time-consuming, label propagation might come in handy.

Question 5

Q

Can you name two clustering algorithms that can scale to large datasets? and two that look for region of high density?

Answer

A

Scale to large dataset: K-means and BIrch

Look for region of high density: Mean-shift and DBSCAN

Question 6

Q

Can you think of a use case where active learning would be useful? How would you implement it?

Answer

A

Active learning is useful whenever you have plenty of unlabeled instances but labeling is costly. In this case (which is very common), rather than randomly selecting instances to label, it is often preferable to perform active learning, where human experts interact with the learning algorithm.

Question 7

Q

What is the difference between anomaly detection and novelty detection ?

Answer

A

In anomaly detection, the algorithm is trained on a dataset that may contain outliers, and the goal is typically to identify these outliers.

In novelty detection, the algorithm is trained on a dataset that is presumed to be “clean”, and the objective is to detect novelties strickly among new instances.

Question 8

Q

What is a Gaussian mixture? What tasks can you use it for?

Answer

A

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several gaussian distributions whose parameters are unknown.

In other word the assumption is that the data is grouped into a finite number of clusters, each with an ellipsoidal shape.

usefull for density estimation, clustering, and anomaly detection.

Question 9

Q

Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?

Answer

A

To plot plot either the Bayesian information criterion (BIC) or Akaike information criterion (AIC) as a function of the number of clusters.

Or use the Bayesian Gaussian mixture model, which automatically selects the number of cluster.

Question 10

Q

What is the main idea of K-means clustering? How does it work ?

Answer

A

1) the objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.
2) You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.

Question 11

Q

What is the difference between hard and soft clustering?

Answer

A

When using Hard clustering each instance is assigned to a single cluster.

When using Soft clustering each instance is instead given a score per cluster. The score can be the distance between the instance and the centroid.

Question 12

Q

The K-means algorithm is not guaranteed to converge to the global minimum. How can we increase our chance of getting a good solution?

Answer

A

Centroid initialization: using the K-mean++ algorithm, which use a smarter initialization step that tends to select centroids that are distance from one another. Which make it less likely to converge to a suboptimal solution.

note the KMeans class uses this initialization method by default.

Question 13

Q

How can we accelerate the K-mean algorithm (2)?

Answer

A

1) Charles Elkan used the triangle inequality (i.e., a straight line is always the shortest distance between two points) and by keeping track of the lower and upper bounds for distance between instances and centroids. Note the kmean class uses this method by default.
2) minibatch k-means: instead of using the full dataset at each iteration, the algorithm is capable of using mini-batches, moving the centroids just slightly at each iteration. This speeds up the algorithm by a factor of three or 4. From sklearn.cluster import MiniBatchKMeans

Question 14

Q

The K-means algorithm aims to choose centroids that minimise the inertia. What are the formula and assumption of inertia ?

Answer

A

The formula:

Σmin(||x_i-u_j||²)

is a measure of how internally coherent cluster are.

Inertia makes the assumption that clusters are convex and isotropic (distances unchanged by translations and rotations), which is not always the case. It reponds poorly to elongated cluster, or manifolds with irregular shapes.

Question 15

Q

In k-means clustering, what is the silhouette score? How is it computed? What is it used for?

Answer

A

1) Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
2) The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.
3) By plotting the silhouette score against the number of clusters, we can make an accurate guess as to how many clusters are needed.

Question 16

Q

What do we mean by active learning?

Answer

A

To continue improving your model and your training set, the next step could be to do a few rounds of active learning, which is when a human expert interacts with the learning algorithm, providing labels for specific instances when the algorithm requests them.

Question 17

Q

What is DBSCAN, and how does it work? What are some advantage?

Answer

A

This algorithm defines clusters as continuous regions of high density. Here is how it work:

1) For each instance, the algorithm counts how many instances are located within a small distance e from it. This region is called the instance’s e-neighborhood
2) IF an instance has at least min_samples instances in its e-neighborhood, then it is considered a core instance.
3) All instances in the neighborhood of a core instance belong to the same cluster.

Avantages

1) simple (has just two hyperparameters (e and min_samples)

2) robust to outliers
3) its computational complexity is roughly O(mlog(m)).

Question 18

Q

What is a Gaussian mixture model (GMM)?

Answer

A

Gaussian mixture models are a probabilistic model for representing normally distributed subpopulations within an overall population. Mixture models in general don’t require knowing which subpopulation a data point belongs to, allowing the model to learn the subpopulations automatically. Since subpopulation assignment is not known, this constitutes a form of unsupervised learning.

For example, in modeling human height data, height is typically modeled as a normal distribution for each gender with a mean of approximately 5’10” for males and 5’5” for females. Given only the height data and not the gender assignments for each data point, the distribution of all heights would follow the sum of two scaled (different variance) and shifted (different mean) normal distributions. A model making this assumption is an example of a Gaussian mixture model (GMM),

Question 19

Q

What are the pros and cons of Gaussian Mixtures model (GMM)?

Answer

A

Pros:

1) Is the fastest algorithm for learning mixture models
2) As this algorithm maximizes only the likelihood, it will not bias the means towards zero, or bias the cluster size to have specific structures that might or might not apply.

Cons:

1) When one has insufficiently many points per mixture, estimating the covariance matrices becomes difficult, and the algorithm is known to diverge and find solutions with infinite likelihood unless one regularizes the covariances artificially.
2) This algorithm will always use all the components it has access to, needing held-out data or information theoretical criteria to decide how many components to use in the absence of external cues.

Question 20

Q

Which algorithm the Gaussian mixture model use to estimate its parameters?

Answer

A

If the number of components K (number of Gaussian distribution in the dataset) is known, expectation maximization is the technique most commonly used to estimate the mixture model’s parameters.

Expectation maximization is an iterative algorithm and has the convenient property that the maximum likelihood of the data strictly increases with each subsequent iteration, meaning it is guaranteed to approach a local maximum or saddle point.