Quick Fire Flashcards

1
Q

Batch Size

A

Batch size controls the accuracy of the estimate of the error gradient when training neural networks
The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.

Batch Gradient Descent. Batch Size = Size of Training Set

Stochastic Gradient Descent. Batch Size = 1

Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

The error gradient is a statistical estimate. The more training examples used in the estimate, the more accurate this estimate will be and the more likely that the weights of the network will be adjusted in a way that will improve the performance of the model. The improved estimate of the error gradient comes at the cost of having to use the model to make many more predictions before the estimate can be calculated, and in turn, the weights updated.

Alternately, using fewer examples results in a less accurate estimate of the error gradient that is highly dependent on the specific training examples used.
This results in a noisy estimate that, in turn, results in noisy updates to the model weights, e.g. many updates with perhaps quite different estimates of the error gradient. Nevertheless, these noisy updates can result in faster learning and sometimes a more robust model.

Smaller batch sizes are used for two main reasons:

Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error.

Smaller batch sizes make it easier to fit one batch worth of training data in memory (i.e. when using a GPU).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Epoch

A

hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Trade off between batch size and learning rate

A

There’s three types of batch size. 1 < Batch size < full training set

Let’s say batch size = b
Let’s say epoch = e
And size of Training data = t
Weight updates are made at the end of every batch calculation of error. In total eb weight updates to the model
If the b is 1, for the same set of other parameters… there are 1
t*e updates. If b is t… then there are only e weight updates. For any mini batch style updates it is in between.

With batch style training, the model converges at 0.81 and test set does slightly better. The loss curve isn’t too noisy. If we set b=1 with the same learning rate you would see the model performance be poor and not converge. WMWM ZIGZAG PATTERN.

This would indicate that making the learning rate smaller would help with model training. Doing so shows that model performance hits 0.81 in a quarter of the number of epochs that we had originally, which makes sense based on the quick calcs above. Although it still looks noisy and hasn’t converged as much.

We can do something similar where we hold learning rate constant instead and try to find b that makes it learn quickly and well. To find it as a mini batch size.

More noisy updates to the model require a smaller learning rate, whereas less noisy more accurate estimates of the error gradient may be applied to the model more liberally. We can summarize this as follows:

Batch Gradient Descent: Use a relatively larger learning rate and more training epochs.

Stochastic Gradient Descent: Use a relatively smaller learning rate and fewer training epochs.

Batch sizeis a slider on the learning process.

Small values give a learning process that converges quickly at the cost of noise in the training process.

Large values give a learning process that converges slowly with accurate estimates of the error gradient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Batch Gradient Descent

A

Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.
One cycle through the entire training dataset is called atraining epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.

Upsides

Fewer updates to the model means this variant of gradient descent is more computationally efficient than stochastic gradient descent.

The decreased update frequency results in a more stable error gradient and may result in a more stable convergence on some problems.

The separation of the calculation of prediction errors and the model update lends the algorithm to parallel processing based implementations.

Downsides

The more stable error gradient may result in premature convergence of the model to a less optimal set of parameters.

The updates at the end of the training epoch require the additional complexity of accumulating prediction errors across all training examples.

Commonly, batch gradient descent is implemented in such a way that it requires the entire training dataset in memory and available to the algorithm.

Model updates, and in turn training speed, may become very slow for large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Stochastic Gradient Descent

A

Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.
The update of the model for each training example means that stochastic gradient descent is often called anonline machine learning algorithm.

Upsides

The frequent updates immediately give an insight into the performance of the model and the rate of improvement.

This variant of gradient descent may be the simplest to understand and implement, especially for beginners.

The increased model update frequency can result in faster learning on some problems.

The noisy update process can allow the model to avoid local minima (e.g. premature convergence).

Downsides

Updating the model so frequently is more computationally expensive than other configurations of gradient descent, taking significantly longer to train models on large datasets.

The frequent updates can result in a noisy gradient signal, which may cause the model parameters and in turn the model error to jump around (have a higher variance over training epochs).

The noisy learning process down the error gradient can also make it hard for the algorithm to settle on an error minimum for the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Mini batch Gradient Descent

A

Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.

Upsides

The model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima.

The batched updates provide a computationally more efficient process than stochastic gradient descent.

The batching allows both the efficiency of not having all training data in memory and algorithm implementations.

Downsides

Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm.

Error information must be accumulated across mini-batches of training examples like batch gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gradient Descent

A

Gradient descent is an optimization algorithm often used for finding the weights or coefficients of machine learning algorithms, such as artificial neural networks and logistic regression.
It works by having the model make predictions on training data and using the error on the predictions to update the model in such a way as to reduce the error.
The goal of the algorithm is to find model parameters (e.g. coefficients or weights) that minimize the error of the model on the training dataset. It does this by making changes to the model that move it along a gradient or slope of errors down toward a minimum error value. This gives the algorithm its name of “gradient descent.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

K means clustering

A

aims topartitionnobservations intokclusters in which each observation belongs to theclusterwith the nearestmean(cluster centers or clustercentroid), serving as a prototype of the cluster.

Given a set of observations (x1,x2, …,xn), where each observation is ad-dimensional real vector,k-means clustering aims to partition thenobservations intok(≤n) setsS={S1,S2,…,Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e.variance).

This is the same as maximizing the between-cluster sum of squares because total variance is constant

1.kinitial “means” (in this casek=3) are randomly generated within the data domain (shown in color). 2.K clusters are created by associating every observation with the nearest mean. The partitions here represent theVoronoi diagramgenerated by the means. 3. Thecentroidof each of thekclusters becomes the new mean. 4. Repeat 2&3 till convergence or Max iterations has been reached.

algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters.

Look at formulas here…
https://en.m.wikipedia.org/wiki/K-means_clustering#:~:text=k%2Dmeans%20clustering%20is%20a,a%20prototype%20of%20the%20cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

K means initialization

A

The Forgy method randomly chooseskobservations from the dataset and uses these as the initial means. The Forgy method tends to spread the initial means out.

The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s randomly assigned points. Random Partition places all of them close to the center of the data set.

Kmeans++

the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from theremainingdata points with probability proportional to its squared distance from the point’s closest existing cluster center.
The exact algorithm is as follows:

Choose one center uniformly at random among the data points.

For each data pointxnot chosen yet, compute D(x), the distance betweenxand the nearest center that has already been chosen.

Choose one new data point at random as a new center, using a weighted probability distribution where a pointxis chosen with probability proportional to D(x)2.

Repeat Steps 2 and 3 untilkcenters have been chosen.

Now that the initial centers have been chosen, proceed using standardk-means clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Drawbacks kmeans

A

-Choosingkmanually. Elbow method, information criterion, silhouette score

-Being dependent on initial values.
For a lowk, you can mitigate this dependence by running k-means several times with different initial values and picking the best result. Askincreases, you need advanced versions of k-means to pick better values of the initial centroids (calledk-means seeding). For a full discussion of k- means seeding see,A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithmby M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.

-Clustering data of varying sizes and density.
k-means has trouble clustering data where clusters are of varying sizes and density. If assumes a spherical distribution. To cluster such data, you need to generalize k-means as described in theAdvantagessection.
Center plot: Allow different cluster widths, resulting in more intuitive clusters of different sizes.
Right plot: Besides different cluster widths, allow different widths per dimension, resulting in elliptical instead of spherical clusters, improving the result.
http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/clustering.pdf

k-means which assumes that clusters are convex shaped.

-Clustering outliers.
Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.

-Scaling with number of dimensions.
As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality either by usingPCAon the feature data, or by using “spectral clustering” to modify the clustering algorithm as explained below.

https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages#:~:text=Disadvantages%20of%20k%2Dmeans&text=As%20increases%2C%20you%20need%20advanced,Means%20Clustering%20Algorithm%20by%20M.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Spectral clustering

A

Spectral clusteringavoids the curse of dimensionality by adding a pre-clustering step to your algorithm:

Reduce the dimensionality of feature data by using PCA.

Project all data points into the lower-dimensional subspace.

Cluster the data in this subspace by using your chosen algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mini batch k means

A

Its main idea is to use small
random batches of examples of a fixed size so they can be stored in memory. Each iteration a new
random sample from the dataset is obtained and used to update the clusters and this is repeated until
convergence. Each mini batch updates the clusters using a convex combination of the values of the
prototypes and the examples, applying a learning rate that decreases with the number of iterations.
This learning rate is the inverse of number of examples assigned to a cluster during the process.
As the number of iterations increases, the effect of new examples is reduced, so convergence can be
detected when no changes in the clusters occur in several consecutive iterations

Algorithm 1 Mini Batch K-Means algorithm
Given: k, mini-batch size b, iterations t, data set X
Initialize each c ∈ C with an x picked randomly from X
v ← 0
for i ← 1 to t do
M ← b examples picked randomly from X
for x ∈ M do
d[x] ← f(C,x)… distance matrix between centroids and x
end
for x ∈ M do
c ← d[x]…. find nearest centroid for x
v[c] ← v[c] + 1 …. update cluster count
η ← 1/v[c]… learning rate is inverse of cluster size, placing more importance on first samples
c ← (1-η)c+ηx… convex update of cluster center
end
end

https://www.google.com/url?sa=t&source=web&rct=j&url=https://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf&ved=2ahUKEwi91Zy298zvAhWSFlkFHQZ8BGIQFjANegQIDBAC&usg=AOvVaw3bEHzPtTrItiyfmfv-VCWm&cshid=1616733583017

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Dbscan

A

Density based clustering

algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept ofcore samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples)

Letεbe a parameter specifying the radius of a neighborhood with respect to some point. MinPts is the minimum number of points that must be in that radius for point p to be considered as a core point (minimum number of points required to form a dense region).

A pointpis acore pointif at leastminPtspoints are within distanceεof it (includingp).

A pointqisdirectly reachablefrompif pointqis within distanceεfrom core pointp. Points are only said to be directly reachable from core points.

A pointqisreachablefrompif there is a pathp1, …,pnwithp1=pandpn=q, where eachpi+1is directly reachable frompi. Note that this implies that the initial point and all points on the path must be core points, with the possible exception ofq.

All points not reachable from any other point areoutliersornoise points.

Now ifpis a core point, then it forms aclustertogether with all points (core or non-core) that are reachable from it. Each cluster contains at least one core point; non-core points can be part of a cluster, but they form its “edge”, since they cannot be used to reach more points. See diagram here https://en.m.wikipedia.org/wiki/DBSCAN

only core points can reach non-core points. So if you see the diagram, if there was a point inthe radius e of a yellow (non-core) point, but it did not fall under the radius of a core point (not reachable by the above definition ), then this would be an outlier too.so a non-core point may be reachable, but nothing can be reached from it

DBSCAN algorithm can be abstracted into the following steps:[4]

Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than minPts neighbors.

Find theconnected componentsofcorepoints on the neighbor graph, ignoring all non-core points.

Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise.

A naive implementation of this requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Dbscan ads disadvantages

A

Ads
DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed tok-means.

DBSCAN can find arbitrarily-shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.

DBSCAN has a notion of noise, and is robust tooutliers.

DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.)

DBSCAN is designed for use with databases that can accelerate region queries, e.g. using anR* tree.

The parameters minPts and ε can be set by a domain expert, if the data is well understood

Disadvantages
DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed. For most data sets and domains, this situation does not arise often and has little impact on the clustering result:[4]both on core points and noise points, DBSCAN is deterministic. DBSCAN*[8]is a variation that treats border points as noise, and this way achieves a fully deterministic result as well as a more consistent statistical interpretation of density-connected components.

The quality of DBSCAN depends on thedistance measureused in the function regionQuery(P,ε). The most common distance metric used isEuclidean distance. Especially forhigh-dimensional data, this metric can be rendered almost useless due to the so-called “Curse of dimensionality”, making it difficult to find an appropriate value for ε. This effect, however, is also present in any other algorithm based on Euclidean distance.

DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters.[9]

If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Parameter estimation for DBSCAN

A

Hi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Parameter estimation for kmeans

A

Elbow method

Silhouette score or other metrics such as the Davies-Bouldin index…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Affinity propagation

A

clustering algorithmbased on the concept of “message passing” between data points.

Similar tok-medoids, affinity propagation finds “exemplars,” members of the input set that are representative of clusters

Letx1throughxnbe a set of data points, with no assumptions made about their internal structure, and letsbe a function that quantifies the similarity between any two points, such thats(i,j) >s(i,k)iffxiis more similar toxjthan toxk. For this example, the negative squared distance of two data points was used

The diagonal ofsi.e.s(i,i)is particularly important, as it represents the instance preference, meaning how likely a particular instance is to become an exemplar

The algorithm proceeds by alternating between two message-passing steps, which update two matrices:[1]

The “responsibility” matrixRhas valuesr(i,k)that quantify how well-suitedxkis to serve as the exemplar forxi, relative to other candidate exemplars forxi.

The “availability” matrixAcontains valuesa(i,k)that represent how “appropriate” it would be forxito pickxkas its exemplar, taking into account other points’ preference forxkas an exemplar.

Both matrices are initialized to all zeroes, and can be viewed aslog-probabilitytables. Iterations are performed until either the cluster boundaries remain unchanged over a number of iterations, or some predetermined number (of iterations) is reached. The exemplars are extracted from the final matrices as those whose ‘responsibility + availability’ for themselves is positive.

main drawback of Affinity Propagation is its complexity. The algorithm has a time complexity of the orderO(N2T), whereNis the number of samples andTis the number of iterations until convergence. Further, the memory complexity is of the orderO(N2)if a dense similarity matrix

The damping parameter is to avoid oscillations during the iterative update

Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities.

it does not enforce equal-size clusters, and it can choose automatically the number of clusters from the data

18
Q

Mean shift clustering

A

clustering aims to discoverblobsin a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region using the bandwidth parameter. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

Mean shift is a procedure for locating the maxima of a density function given discrete data sampled from that function.

This is an iterative method, and we start with an initial estimatex. Let akernel functionK(x_{i}-x)be given. This function determines the weight of nearby points for re-estimation of the mean. Often Gaussian kernel function is used. Exp(-c(x-xi)2). Then the mean is estimated around x using its neighborhood of points, a set of points where K(xi) ≠0. Then it sets x to be m(x) the mean, and the algorithm proceeds until convergence or Max iterations. In each iteration of the algorithm, s

Procedure
To explain mean-shift we will consider a set of points in two-dimensional space like the above illustration. We begin with a circular sliding window centered at a point C (randomly selected) and having radius r as the kernel. Mean shift is a hill-climbing algorithm that involves shifting this kernel iteratively to a higher density region on each step until convergence.

At every iteration, the sliding window is shifted towards regions of higher density by shifting the center point to the mean of the points within the window (hence the name). The density within the sliding window is proportional to the number of points inside it. Naturally, by shifting to the mean of the points in the window it will gradually move towards areas of higher point density.

We continue shifting the sliding window according to the mean until there is no direction at which a shift can accommodate more points inside the kernel. Check out the graphic above; we keep moving the circle until we no longer are increasing the density (i.e number of points in the window).

This process of steps 1 to 3 is done with many sliding windows until all points lie within a window. When multiple sliding windows overlap the window containing the most points is preserved. The data points are then clustered according to the sliding window in which they reside.

An illustration of the entire process from end-to-end with all of the sliding windows is shown below. Each black dot represents the centroid of a sliding window and each gray dot is a data point.

19
Q

Mean shift clustering ad disads

A

Strengths
Mean shift is an application-independent tool suitable for real data analysis.

Does not assume any predefined shape on data clusters.

It is capable of handling arbitrary feature spaces.

The procedure relies on choice of a single parameter: bandwidth.

The bandwidth/window size ‘h’ has a physical meaning

In contrast to K-means clustering, there is no need to select the number of clusters as mean-shift automatically discovers this. That’s a massive advantage. The fact that the cluster centers converge towards the points of maximum density is also quite desirable as it is quite intuitive to understand and fits well in a naturally data-driven sense. The drawback is that the selection of the window size/radius “r” can be non-trivial

Weaknesses
The selection of a window size is not trivial.

Inappropriate window size can cause modes to be merged, or generate additional “shallow” modes.

Often requires using adaptive window size

Outliers are given some cluster label anyways

20
Q

Hierarchical clustering

A

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

Agglomerative is a bottom up approach. Divisive is top down.

In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriatemetric(a measure ofdistancebetween pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

Linkage
linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.

Maximum orcomplete-linkage clustering max \,{\,d(a,b):a\in A,\,b\in B\,} to find distance between clusters to see if merging/splitting is required
Minimum orsingle-linkage clustering
Unweighted average linkage clustering (orUPGMA)
Weighted average linkage clustering (orWPGMA)
Centroid linkage clustering
.Minimum energy clustering

Optionally, one can also construct adistance matrixat this stage, where the number in thei-th rowj-th column is the distance between thei-th andj-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters.On each iteration, we combine two clusters into one. The two clusters to be combined are selected as those with the smallest average linkage. I.e according to our selected distance metric, these two clusters have the smallest distance between each other and therefore are the most similar and should be combined

Wardminimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.

Maximumorcomplete linkageminimizes the maximum distance between observations of pairs of clusters.

Average linkageminimizes the average of the distances between all observations of pairs of clusters.

Single linkageminimizes the distance between the closest observations of pairs of clusters.

Agglomerative cluster has a “rich get richer” behavior that leads to uneven cluster sizes. In this regard, single linkage is the worst strategy, and Ward gives the most regular sizes. However, the affinity (or distance used in clustering) cannot be varied with Ward, thus for non Euclidean metrics, average linkage is a good alternative. Single linkage, while not robust to noisy data, can be computed very efficiently and can therefore be useful to provide hierarchical clustering of larger datasets. Single linkage can also perform well on non-globular data.

Connectivity constraints
connectivity constraints can be added to this algorithm (only adjacent clusters can be merged together), through a connectivity matrix that defines for each sample the neighboring samples following a given structure of the data. For instance, in the swiss-roll example below, the connectivity constraints forbid the merging of points that are not adjacent on the swiss roll, and thus avoid forming clusters that extend across overlapping folds of the roll. Can improve speed with this apriori information. Connectivity constraints and single, complete or average linkage can enhance the ‘rich getting richer’ aspect of agglomerative clustering

Notes on distance metric
l1distance is often good for sparse features, or sparse noise: i.e. many of the features are zero, as in text mining using occurrences of rare words.

cosinedistance is interesting because it is invariant to global scalings of the signal.

The guidelines for choosing a metric is to use one that maximizes the distance between samples in different classes, and minimizes that within each class

21
Q

Hierarchical clustering ads and disadvantages

A

Don’t need to specify number of clusters

We can use the dendrogram to visually identify the clusters bar on our choice of linkage and metric

particularly good use case of hierarchical clustering methods is when the underlying data has a hierarchical structure and you want to recover the hierarchy

Computation time is o(n3), much worse than kmeans

22
Q

Optics clustering

A

Ordering points to identify the clustering structure
basic idea is similar toDBSCAN,but it addresses one of DBSCAN’s major weaknesses: the problem of detecting meaningful clusters in data of varying density.

OPTICS requires two parameters:ε, which describes the maximum distance (radius) to consider, andMinPts, describing the number of points required to form a cluster. A pointpis acore pointif at leastMinPtspoints are found within itsε-neighborhoodNp(including pointpitself)

OPTICS also considers points that are part of a more densely packed cluster, so each point is assigned acore distancethat describes the distance to theMinPtsth closest point:
Reachability dist and core distance on Wikipedia.

generalization of DBSCAN that relaxes theepsrequirement from a single value to a value range. OPTICS algorithm builds areachabilitygraph, which assigns each sample both areachability_distance, and a spot within the clusterordering_attribute; these two attributes are assigned when the model is fitted, and are used to determine cluster membership

HDBSCAN implementation is multithreaded, and has better algorithmic runtime complexity than OPTICS, at the cost of worse memory scaling. For extremely large datasets that exhaust system memory using HDBSCAN, OPTICS will maintainn(as opposed ton2) memory scaling

Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets than the current sklearn implementation of DBSCAN.
Clusters are then extracted using a DBSCAN-like method (cluster_method = ‘dbscan’) or an automatic technique proposed in1.

The default cluster extraction with OPTICS looks at the steep slopes within the graph to find clusters, and the user can define what counts as a steep slope using the parameterxi. There are also other possibilities for analysis on the graph itself, such as generating hierarchical representations of the data through reachability-plot dendrograms, and the hierarchy of clusters detected by the algorithm can be accessed through thecluster_hierarchy_parameter

23
Q

Clustering performance evaluation- Rand Index

A

Rand index

  • function that measures thesimilarityof the two assignments, ignoring permutations
  • The two assignments are label_true and label_pred for each sample…
  • The Rand index does not ensure to obtain a value close to 0.0 for a random labelling. The adjusted Rand indexcorrects for chanceand will give such a baseline.
  • Furthermore, bothrand_scoreadjusted_rand_scorearesymmetric: swapping the argument does not change the scores. They can thus be used asconsensus measures.
  • Perfect labeling is scored 1.0
  • Poorly agreeing labels (e.g. independent labelings) have lower scores, and for the adjusted Rand index the score will be negative or close to zero. However, for the unadjusted Rand index the score, while lower, will not necessarily be close to zero

Adjusted ri =( ri - E[ri] )/(max(ri)-E[ri])

Let there be two subsets X and Y:

{\displaystyle a}, the number of pairs of elements in{\displaystyle S}that are in thesamesubset in{\displaystyle X}and in thesamesubset in{\displaystyle Y}

{\displaystyle b}, the number of pairs of elements in{\displaystyle S}that are indifferentsubsets in{\displaystyle X}and indifferentsubsets in{\displaystyle Y}

{\displaystyle c}, the number of pairs of elements in{\displaystyle S}that are in thesamesubset in{\displaystyle X}and indifferentsubsets in{\displaystyle Y}

{\displaystyle d}, the number of pairs of elements in{\displaystyle S}that are indifferentsubsets in{\displaystyle X}and in thesamesubset in{\displaystyle Y}

The Rand index,{\displaystyle R}, is:[1][2]
RI={\frac {a+b}{a+b+c+d}}={\frac {a+b}{n \choose 2}}}

Normalized to take chance into account.

Se example:

https://davetang.org/muse/2017/09/21/the-rand-index/

24
Q

Clustering performance evaluation- mutual information score

A

Basically it is the expected information gain. Theexpected valueof the information gain is themutual informationI(X;A)ofXandA– i.e. the reduction in theentropyofXachieved by learning the state of therandom variableA.

themutual information(MI) of tworandom variablesis a measure of the mutualdependencebetween the two variables. More specifically, it quantifies the “amount of information” (inunitssuch asshannons, commonly called bits) obtained about one random variable through observing the other random variable

mutual information measures the information that X and Yshare: It measures how much knowing one of these variables reduces uncertainty about the other.

MI = D_KL(P(X,Y) ll P(X) tensorproduct P(Y))

For example, if{\displaystyle X}and{\displaystyle Y}are independent, then knowing{\displaystyle X}does not give any information about{\displaystyle Y}and vice versa, so their mutual information is zero. At the other extreme, if{\displaystyle X}is a deterministic function of{\displaystyle Y}and{\displaystyle Y}is a deterministic function of{\displaystyle X}then all information conveyed by{\displaystyle X}is shared with{\displaystyle Y}: knowing{\displaystyle X}determines the value of{\displaystyle Y}and vice versa. As a result, in this case the mutual information is the same as the uncertainty contained in{\displaystyle Y}(or{\displaystyle X}) alone, namely theentropyof{\displaystyle Y}(or{\displaystyle X}). Moreover, this mutual information is the same as the entropy of{\displaystyle X}and as the entropy of{\displaystyle Y}

Adjusted MI is Normalized against chance only.

ADS

Random (uniform) label assignments have a AMI score close to 0.0for any value ofn_clustersandn_samples(which is not the case for raw Mutual Information or the V-measure for instance).

Upper bound of 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, an AMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).

Drawbacks

Contrary to inertia,MI-based measures require the knowledge of the ground truth classeswhile almost never available in practice or requires manual assignment by human annotators (as in the supervised learning setting).
However MI-based measures can also be useful in purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection.

NMI and MI are not adjusted against chance.

mutual information and also the normalized variant is not adjusted for chance and will tend to increase as the number of different labels (clusters) increases, regardless of the actual amount of “mutual information” between the label assignments.

25
Q

Entropy

A

entropyof arandom variableis the average level of “information”, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes

  • Sum over all x in X of (P(x)logP(x))

The basic idea of information theory is that the “informational value” of a communicated message depends on the degree to which the content of the message is surprising. If an event is very probable, it is no surprise (and generally uninteresting) when that event happens as expected; hence transmission of such a message carries very little new information. However, if an event is unlikely to occur, it is much more informative to learn that the event happened or will happen

Cool properties
I(p)ismonotonically decreasinginp: an increase in the probability of an event decreases the information from an observed event, and vice versa.

I(p) ≥ 0: information is anon-negativequantity.

I(1) = 0: events that always occur do not communicate information.

I(p1,p2) = I(p1) + I(p2): the information learned fromindependent eventsis the sum of the information learned from each event.

Conditional Entropy
theconditional entropyquantifies the amount of information needed to describe the outcome of arandom variable{\displaystyle Y} given that the value of another random variable{\displaystyle X} is known.

H(Y l X) = - Sum over all values of x and y (p(x,y)log(p(x,y)/p(x)))

26
Q

Marginal probability

A

Given a knownjoint distributionof twodiscreterandom variables, say,XandY, the marginal distribution of either variable –Xfor example — is theprobability distributionofXwhen the values ofYare not taken into consideration. This can be calculated by summing thejoint probabilitydistribution over all values ofY. Naturally, the converse is also true: the marginal distribution can be obtained forYby summing over the separate values ofX.

p(xi)=sum over all y from j=1 to J while xi is fixed (p(xi,yj))
p(yi)=sum over all x from k=1 to K while yi is fixed (p(xk,yi))

27
Q

Joint probability

A

The jointprobability mass functionof twodiscrete random variables X and Y:
p(x,y) = P(X=x and Y=y), probability that X is xi and Y is yi at the same time

Using conditional distributions:
p(x,y) = P(X=x l Y=y)×P(Y=y), probability that X is xi given Y is yi times the marginal distribution of Y

or

p(x,y) = P(Y=y l X=x)×P(X=x), probability that Y is yi given X is xi times the marginal distribution of X

For independent variables
p(x,y)=P(X=x)P(Y=y)

This means that acquiring any information about the value of one or more of the random variables leads to a conditional distribution of any other variable that is identical to its unconditional (marginal) distribution; thus no variable provides any information about any other variable.

28
Q

Difference betweenmarginal and conditional probability

A

Themarginal probabilityis the probability of a single event occurring, independent of other events. Aconditional probability, on the other hand, is the probability that an event occurs given that another specific eventhas alreadyoccurred. This means that the calculation for one variable is dependent on another variable

P(Y=y l X=x) = p(x,y)/P(X=x)

Example, you had a table like this: https://en.m.wikipedia.org/wiki/Marginal_distribution

Suppose there is data from classroom of 200 students on the amount of time studied (X) and the percent correct (Y).[4]Assuming thatXandYare discrete random variables, the joint distribution ofXandYcan be described by listing all the possible values ofp(xi,yj).

It would be: Theconditional distributioncan be used to determine the probability that a student scored 20 or below while also studying for 60 minutes or more.

While the marginal distributioncan be used to determine how many students that scored 20 or below

See these images:
https://courses.cs.cornell.edu/cs2800/wiki/index.php/Conditional_probability

29
Q

KL Divergence

A

Kullback–Leibler divergence,(also calledrelative entropy), is a measure of how oneprobability distributionis different from a second, reference probability distribution

D_KL(P(X) ll Q(X)) = sum over all X ( P(x)log(P(x)/Q(x)) )

In the context ofmachine learning,{\displaystyle D_{\text{KL}}(P\parallel Q)} is often called theinformation gainachieved ifP would be used instead ofQ which is currently used. By analogy with information theory, it is called therelative entropy of Pwith respect toQ. In the context ofcoding theory,{\displaystyle D_{\text{KL}}(P\parallel Q)}can be constructed by measuring the expected number of extrabitsrequired tocodesamples fromPusing a code optimized forQrather than the code optimized forP.

30
Q

Information gain and gini impurity

A

Synonymous to KL Divergence or MI depending on context

theamount of informationgained about arandom variableorsignalfrom observing another random variable. However, in the context of decision trees, the term is sometimes used synonymously withmutual information, which is theconditional expected valueof the Kullback–Leibler divergence of the univariateprobability distributionof one variable from theconditional distributionof this variablegiventhe other one.

In general terms, theexpectedinformation gain is the change ininformation entropyΗfrom a prior state to a state that takes some information as given:

IG(T,a) = H(T)-H(T l a)
H(T l a) is conditional entropy of T given attribute a
the expected information gain is the mutual information, meaning that on average, the reduction in the entropy of T is the mutual information.

Drawbacks - favors high cardinality features . Example : Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer’s credit card number. This attribute has a high mutual information, because it uniquely identifies each customer, but we donotwant to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven’t seen before (overfitting).

https://en.m.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity
Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini impurity can be computed by summing the probability{\displaystyle p_{i}}of an item with label{\displaystyle i}being chosen times the probability{\displaystyle \sum {k\neq i}p{k}=1-p_{i}}of a mistake in categorizing that item.

How gini impurity is used to decide splits in decision tree…the goal is to reduce Gini Impurity at every split, because you’d be reducing entropy which is a measure of uncertainty. Ie you wanna choose features that have a high information gain.

https: //www.displayr.com/how-is-splitting-decided-for-decision-trees/
https: //en.m.wikipedia.org/wiki/Information_gain_in_decision_trees (see example)

31
Q

Random forests and ensemble learning

A

Random forestsorrandom decision forestsare anensemble learningmethod forclassification,regressionand other tasks that operate by constructing a multitude ofdecision treesat training time and outputting the class that is themodeof the classes (classification) or mean/average prediction (regression) of the individual trees

Decision tree
Decision trees are a popular method for various machine learning tasks. Tree learning “come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining”, sayHastieet al., “because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate

Ensemble learning
Bagging
The training algorithm for random forests applies the general technique ofbootstrap aggregating, or bagging, to tree learners. Given a training setX=x1, …,xnwith responsesY=y1, …,yn, bagging repeatedly (Btimes) selects arandom sample with replacementof the training set and fits trees to these samples:
Forb= 1, …,B:

Sample, with replacement,ntraining examples fromX,Y; call theseXb,Yb.

Train a classification or regression treefbonXb,Yb.

After training, predictions for unseen samplesx’can be made by averaging the predictions from all the individual regression trees onx’:
{\displaystyle {\hat {f}}={\frac {1}{B}}\sum {b=1}^{B}f{b}(x’)}
or by taking the majority vote in the case of classification trees.
This bootstrapping procedure leads to better model performance because it decreases thevarianceof the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

RF does “feature bagging” that selects, at each candidate split in the learning process, arandom subset of the features. The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a fewfeaturesare very strong predictors for the response variable (target output), these features will be selected in many of theBtrees, causing them to become correlated.

32
Q

Clustering performance evaluation- v measure

A
Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric using conditional entropy analysis.
In particular Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assignment:

homogeneity: each cluster contains only members of a single class.
completeness: all members of a given class are assigned to the same cluster.

Homogeneity and completeness scores are formally given by:

h=1−H(C|K)/H(C)

c=1−H(K|C)/H(K)

whereH(C|K)is theconditional entropy of the classes given the cluster assignmentsand is given by:

H(C|K)=−∑c=1 to C ∑k=1to K (n_ck/n)×log⁡(n-ck/n_k)

andH(C)is theentropy of the classesand is given by:

H(C)=−∑c=1toC (n_c/n) ×log⁡(n_c/n)

withnthe total number of samples,ncandnkthe number of samples respectively belonging to classcand clusterk, and finallyn_ckthe number of samples from classcassigned to clusterk.
Theconditional entropy of clusters given classH(K|C)and theentropy of clustersH(K)are defined in a symmetric manner.

V measure is the harmonic mean of these two things. And is very similarly calculated to NMI.

v_measure_scoreissymmetric: it can be used to evaluate theagreementof two independent assignments on the same dataset.
This is not the case forcompleteness_scoreandhomogeneity_score: both are bound by the relationship:

homogeneity_score(a, b) == completeness_score(b, a)

Advantages

Bounded scores: 0.0 is as bad as it can be, 1.0 is a perfect score.

Intuitive interpretation: clustering with bad V-measure can bequalitatively analyzed in terms of homogeneity and completenessto better feel what ‘kind’ of mistakes is done by the assignment.

No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.

2.3.10.3.2.Drawbacks

The previously introduced metrics arenot normalized with regards to random labeling: this means that depending on the number of samples, clusters and ground truth classes, a completely random labeling will not always yield the same values for homogeneity, completeness and hence v-measure. In particularrandom labeling won’t yield zero scores especially when the number of clusters is large.
This problem can safely be ignored when the number of samples is more than a thousand and the number of clusters is less than 10.For smaller sample sizes or larger number of clusters it is safer to use an adjusted index such as the Adjusted Rand Index (ARI).
Also requires knowledge of ground truth classes

33
Q

Cluster evaluation measure -silhouette score

A

Need to usethe model itself in case ground truth is not known

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring cluster

where a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:

a: The mean distance between a sample and all other points in the same class.
b: The mean distance between a sample and all other points in thenext nearest cluster.

The Silhouette Coefficientsfor a single sample is then given as:

s=(b−a)/max(a,b)

Advantages

The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.

The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

2.3.10.5.2.Drawbacks

The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

34
Q

Cluster evaluation metric - Calinksi index

A

If the ground truth labels are not known, the Calinski-Harabasz index (sklearn.metrics.calinski_harabasz_score) - also known as the Variance Ratio Criterion - can be used to evaluate the model, where a higher Calinski-Harabasz score relates to a model with better defined clusters.
The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared):

Advantages

The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

The score is fast to compute.

2.3.10.6.2.Drawbacks

The Calinski-Harabasz index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

2.3.10.6.3.Mathematical formulation

For a set of dataEof sizenEwhich has been clustered intokclusters, the Calinski-Harabasz scoresis defined as the ratio of the between-clusters dispersion mean and the within-cluster dispersion:

s=(tr(Bk)/tr(Wk))×(nE−k)/(k−1)

wheretr(Bk)is trace of the between group dispersion matrix andtr(Wk)is the trace of the within-cluster dispersion matrix defined by:

Wk=∑q=1 to k ∑for each x∈Cq (x−cq)(x−cq)T

Bk=∑q=1 to k (nq×(cq−cE)(cq−cE)T)

withCqthe set of points in clusterq,cqthe center of clusterq,cEthe center ofE, andnqthe number of points in clusterq.

35
Q

Other Cluster evaluation

A

Contingency matrix. It reports the cardinality of each class with respect to each cluster.
CLUSTER 1 CLUSTER 2
CLASS A. 2. 0
CLASS B. 4. 3

This says that cluster 1 has 2 samples in class a and 4 in class b
The contingency matrix provides sufficient statistics for all clustering metrics where the samples are independent and identically distributed and one doesn’t need to account for some instances not being clustered.

Comparison-pair matrix…similar to Rand index but in matrix form.
pair confusion matrix (sklearn.metrics.cluster.pair_confusion_matrix) is a 2x2 similarity matrix

C=[C00 C01
C10 C11]

between two clusterings computed by considering all pairs of samples and counting pairs that are assigned into the same or into different clusters under the true and predicted clusterings.
It has the following entries:

C00: number of pairs with both clusterings having the samples not clustered together
C10: number of pairs with the true label clustering having the samples clustered together but the other clustering not having the samples clustered together
C01: number of pairs with the true label clustering not having the samples clustered together but the other clustering having the samples clustered together
C11: number of pairs with both clusterings having the samples clustered together

Considering a pair of samples that is clustered together a positive pair, then as in binary classification the count of true negatives isC00, false negatives isC10, true positives isC11and false positives isC01.
Perfectly matching labelings have all non-zero entries on the diagonal regardless of actual label values

36
Q

Cluster evaluation metric - Davies bouldin index

A

If the ground truth labels are not known, the Davies-Bouldin index (sklearn.metrics.davies_bouldin_score) can be used to evaluate the model, where a lower Davies-Bouldin index relates to a model with better separation between the clusters.
This index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves.
Zero is the lowest possible score. Values closer to zero indicate a better partition.

The index is defined as the average similarity between each clusterCifori=1,…,kand its most similar oneCj. In the context of this index, similarity is defined as a measureRijthat trades off:

si, the average distance between each point of clusteriand the centroid of that cluster – also know as cluster diameter. https://en.m.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index can be other norms as well…as long as this distance metric has to match with the metric used in the clustering scheme itself for meaningful results

dij, the distance between cluster centroidsiandj.

A simple choice to constructRijso that it is nonnegative and symmetric is:

Rij=(si+sj)/dij

Then the Davies-Bouldin index is defined as:

DB=∑i=1 to k (max Rij where i≠j)

Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined asSiabove. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index. Can be used to decide for k in kmeans…

Advantages
The computation of Davies-Bouldin is simpler than that of Silhouette scores.

The index is computed only quantities and features inherent to the dataset.

2.3.10.7.2.Drawbacks

The Davies-Boulding index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained from DBSCAN.

The usage of centroid distance limits the distance metric to Euclidean space.

drawback that a good value reported by this method does not imply the best information retrieval

37
Q

Tf-idf

A

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

Term frequency- inverse document frequency

TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

TF-IDF for a word in a document is calculated by multiplying two different metrics:

Theterm frequencyof a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.

Theinverse document frequencyof the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.

So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document.

38
Q

Gower distance

A

Can be used to compute a “distance” for nominal variables

39
Q

Gaussian mixture models

A

One of the major drawbacks of K-Means is its naive use of the mean value for the cluster center. We can see why this isn’t the best way of doing things by looking at the image of two circles, one inside the other. there are two circular clusters with different radius’ centered at the same mean. K-Means can’t handle this because the mean values of the clusters are very close together. K-Means also fails in cases where the clusters are not circular/convex, again as a result of using the mean as cluster center.

Gaussian Mixture Models (GMMs) give us more flexibility than K-Means. With GMMs we assume that the data points are Gaussian distributed; this is a less restrictive assumption than saying they are circular by using the mean. That way, we have two parameters to describe the shape of the clusters: the mean and the standard deviation! Taking an example in two dimensions, this means that the clusters can take any kind of elliptical shape (since we have a standard deviation in both the x and y directions). Thus, each Gaussian distribution is assigned to a single cluster.

To find the parameters of the Gaussian for each cluster (e.g the mean and standard deviation), we will use an optimization algorithm called Expectation–Maximization (EM)

We begin by selecting the number of clusters (like K-Means does) and randomly initializing the Gaussian distribution parameters for each cluster. One can try to provide a good guesstimate for the initial parameters by taking a quick look at the data too. Though note, as can be seen in the graphic above, this isn’t 100% necessary as the Gaussians start our as very poor but are quickly optimized.

Given these Gaussian distributions for each cluster, compute the probability that each data point belongs to a particular cluster. The closer a point is to the Gaussian’s center, the more likely it belongs to that cluster. This should make intuitive sense since with a Gaussian distribution we are assuming that most of the data lies closer to the center of the cluster.

Based on these probabilities, we compute a new set of parameters for the Gaussian distributions such that we maximize the probabilities of data points within the clusters. We compute these new parameters using a weighted sum of the data point positions, where the weights are the probabilities of the data point belonging in that particular cluster.

Steps 2 and 3 are repeated iteratively until convergence, where the distributions don’t change much from iteration to iteration.

Ads
Firstly GMMs are a lot moreflexiblein terms ofcluster covariancethan K-Means; due to the standard deviation parameter, the clusters can take on any ellipse shape, rather than being restricted to circles. K-Means is actually a special case of GMM in which each cluster’s covariance along all dimensions approaches 0. 
Secondly, since GMMs use probabilities, they can have multiple clusters per data point. So if a data point is in the middle of two overlapping clusters, we can simply define its class by saying it belongs X-percent to class 1 and Y-percent to class 2. I.e GMMs supportmixedmembership.

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68

40
Q

CNN

A

also known asshift invariantorspace invariant artificial neural networks(SIANN), based on the shared-weight architecture of the convolution kernels that shift over input features and provide translationequivariantresponses