Quick Fire Flashcards
Batch Size
Batch size controls the accuracy of the estimate of the error gradient when training neural networks
The batch size is a hyperparameter that defines the number of samples to work through before updating the internal model parameters.
Batch Gradient Descent. Batch Size = Size of Training Set
Stochastic Gradient Descent. Batch Size = 1
Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
The error gradient is a statistical estimate. The more training examples used in the estimate, the more accurate this estimate will be and the more likely that the weights of the network will be adjusted in a way that will improve the performance of the model. The improved estimate of the error gradient comes at the cost of having to use the model to make many more predictions before the estimate can be calculated, and in turn, the weights updated.
Alternately, using fewer examples results in a less accurate estimate of the error gradient that is highly dependent on the specific training examples used.
This results in a noisy estimate that, in turn, results in noisy updates to the model weights, e.g. many updates with perhaps quite different estimates of the error gradient. Nevertheless, these noisy updates can result in faster learning and sometimes a more robust model.
Smaller batch sizes are used for two main reasons:
Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error.
Smaller batch sizes make it easier to fit one batch worth of training data in memory (i.e. when using a GPU).
Epoch
hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
Trade off between batch size and learning rate
There’s three types of batch size. 1 < Batch size < full training set
Let’s say batch size = b
Let’s say epoch = e
And size of Training data = t
Weight updates are made at the end of every batch calculation of error. In total eb weight updates to the model
If the b is 1, for the same set of other parameters… there are 1t*e updates. If b is t… then there are only e weight updates. For any mini batch style updates it is in between.
With batch style training, the model converges at 0.81 and test set does slightly better. The loss curve isn’t too noisy. If we set b=1 with the same learning rate you would see the model performance be poor and not converge. WMWM ZIGZAG PATTERN.
This would indicate that making the learning rate smaller would help with model training. Doing so shows that model performance hits 0.81 in a quarter of the number of epochs that we had originally, which makes sense based on the quick calcs above. Although it still looks noisy and hasn’t converged as much.
We can do something similar where we hold learning rate constant instead and try to find b that makes it learn quickly and well. To find it as a mini batch size.
More noisy updates to the model require a smaller learning rate, whereas less noisy more accurate estimates of the error gradient may be applied to the model more liberally. We can summarize this as follows:
Batch Gradient Descent: Use a relatively larger learning rate and more training epochs.
Stochastic Gradient Descent: Use a relatively smaller learning rate and fewer training epochs.
Batch sizeis a slider on the learning process.
Small values give a learning process that converges quickly at the cost of noise in the training process.
Large values give a learning process that converges slowly with accurate estimates of the error gradient.
Batch Gradient Descent
Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.
One cycle through the entire training dataset is called atraining epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.
Upsides
Fewer updates to the model means this variant of gradient descent is more computationally efficient than stochastic gradient descent.
The decreased update frequency results in a more stable error gradient and may result in a more stable convergence on some problems.
The separation of the calculation of prediction errors and the model update lends the algorithm to parallel processing based implementations.
Downsides
The more stable error gradient may result in premature convergence of the model to a less optimal set of parameters.
The updates at the end of the training epoch require the additional complexity of accumulating prediction errors across all training examples.
Commonly, batch gradient descent is implemented in such a way that it requires the entire training dataset in memory and available to the algorithm.
Model updates, and in turn training speed, may become very slow for large datasets.
Stochastic Gradient Descent
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.
The update of the model for each training example means that stochastic gradient descent is often called anonline machine learning algorithm.
Upsides
The frequent updates immediately give an insight into the performance of the model and the rate of improvement.
This variant of gradient descent may be the simplest to understand and implement, especially for beginners.
The increased model update frequency can result in faster learning on some problems.
The noisy update process can allow the model to avoid local minima (e.g. premature convergence).
Downsides
Updating the model so frequently is more computationally expensive than other configurations of gradient descent, taking significantly longer to train models on large datasets.
The frequent updates can result in a noisy gradient signal, which may cause the model parameters and in turn the model error to jump around (have a higher variance over training epochs).
The noisy learning process down the error gradient can also make it hard for the algorithm to settle on an error minimum for the model.
Mini batch Gradient Descent
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.
Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient.
Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.
Upsides
The model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima.
The batched updates provide a computationally more efficient process than stochastic gradient descent.
The batching allows both the efficiency of not having all training data in memory and algorithm implementations.
Downsides
Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm.
Error information must be accumulated across mini-batches of training examples like batch gradient descent.
Gradient Descent
Gradient descent is an optimization algorithm often used for finding the weights or coefficients of machine learning algorithms, such as artificial neural networks and logistic regression.
It works by having the model make predictions on training data and using the error on the predictions to update the model in such a way as to reduce the error.
The goal of the algorithm is to find model parameters (e.g. coefficients or weights) that minimize the error of the model on the training dataset. It does this by making changes to the model that move it along a gradient or slope of errors down toward a minimum error value. This gives the algorithm its name of “gradient descent.”
K means clustering
aims topartitionnobservations intokclusters in which each observation belongs to theclusterwith the nearestmean(cluster centers or clustercentroid), serving as a prototype of the cluster.
Given a set of observations (x1,x2, …,xn), where each observation is ad-dimensional real vector,k-means clustering aims to partition thenobservations intok(≤n) setsS={S1,S2,…,Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e.variance).
This is the same as maximizing the between-cluster sum of squares because total variance is constant
1.kinitial “means” (in this casek=3) are randomly generated within the data domain (shown in color). 2.K clusters are created by associating every observation with the nearest mean. The partitions here represent theVoronoi diagramgenerated by the means. 3. Thecentroidof each of thekclusters becomes the new mean. 4. Repeat 2&3 till convergence or Max iterations has been reached.
algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters.
Look at formulas here…
https://en.m.wikipedia.org/wiki/K-means_clustering#:~:text=k%2Dmeans%20clustering%20is%20a,a%20prototype%20of%20the%20cluster.
K means initialization
The Forgy method randomly chooseskobservations from the dataset and uses these as the initial means. The Forgy method tends to spread the initial means out.
The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s randomly assigned points. Random Partition places all of them close to the center of the data set.
Kmeans++
the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from theremainingdata points with probability proportional to its squared distance from the point’s closest existing cluster center.
The exact algorithm is as follows:
Choose one center uniformly at random among the data points.
For each data pointxnot chosen yet, compute D(x), the distance betweenxand the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a pointxis chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 untilkcenters have been chosen.
Now that the initial centers have been chosen, proceed using standardk-means clustering.
Drawbacks kmeans
-Choosingkmanually. Elbow method, information criterion, silhouette score
-Being dependent on initial values.
For a lowk, you can mitigate this dependence by running k-means several times with different initial values and picking the best result. Askincreases, you need advanced versions of k-means to pick better values of the initial centroids (calledk-means seeding). For a full discussion of k- means seeding see,A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithmby M. Emre Celebi, Hassan A. Kingravi, Patricio A. Vela.
-Clustering data of varying sizes and density.
k-means has trouble clustering data where clusters are of varying sizes and density. If assumes a spherical distribution. To cluster such data, you need to generalize k-means as described in theAdvantagessection.
Center plot: Allow different cluster widths, resulting in more intuitive clusters of different sizes.
Right plot: Besides different cluster widths, allow different widths per dimension, resulting in elliptical instead of spherical clusters, improving the result.
http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/clustering.pdf
k-means which assumes that clusters are convex shaped.
-Clustering outliers.
Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering.
-Scaling with number of dimensions.
As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples. Reduce dimensionality either by usingPCAon the feature data, or by using “spectral clustering” to modify the clustering algorithm as explained below.
https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages#:~:text=Disadvantages%20of%20k%2Dmeans&text=As%20increases%2C%20you%20need%20advanced,Means%20Clustering%20Algorithm%20by%20M.
Spectral clustering
Spectral clusteringavoids the curse of dimensionality by adding a pre-clustering step to your algorithm:
Reduce the dimensionality of feature data by using PCA.
Project all data points into the lower-dimensional subspace.
Cluster the data in this subspace by using your chosen algorithm.
Mini batch k means
Its main idea is to use small
random batches of examples of a fixed size so they can be stored in memory. Each iteration a new
random sample from the dataset is obtained and used to update the clusters and this is repeated until
convergence. Each mini batch updates the clusters using a convex combination of the values of the
prototypes and the examples, applying a learning rate that decreases with the number of iterations.
This learning rate is the inverse of number of examples assigned to a cluster during the process.
As the number of iterations increases, the effect of new examples is reduced, so convergence can be
detected when no changes in the clusters occur in several consecutive iterations
Algorithm 1 Mini Batch K-Means algorithm
Given: k, mini-batch size b, iterations t, data set X
Initialize each c ∈ C with an x picked randomly from X
v ← 0
for i ← 1 to t do
M ← b examples picked randomly from X
for x ∈ M do
d[x] ← f(C,x)… distance matrix between centroids and x
end
for x ∈ M do
c ← d[x]…. find nearest centroid for x
v[c] ← v[c] + 1 …. update cluster count
η ← 1/v[c]… learning rate is inverse of cluster size, placing more importance on first samples
c ← (1-η)c+ηx… convex update of cluster center
end
end
https://www.google.com/url?sa=t&source=web&rct=j&url=https://upcommons.upc.edu/bitstream/handle/2117/23414/R13-8.pdf&ved=2ahUKEwi91Zy298zvAhWSFlkFHQZ8BGIQFjANegQIDBAC&usg=AOvVaw3bEHzPtTrItiyfmfv-VCWm&cshid=1616733583017
Dbscan
Density based clustering
algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept ofcore samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples)
Letεbe a parameter specifying the radius of a neighborhood with respect to some point. MinPts is the minimum number of points that must be in that radius for point p to be considered as a core point (minimum number of points required to form a dense region).
A pointpis acore pointif at leastminPtspoints are within distanceεof it (includingp).
A pointqisdirectly reachablefrompif pointqis within distanceεfrom core pointp. Points are only said to be directly reachable from core points.
A pointqisreachablefrompif there is a pathp1, …,pnwithp1=pandpn=q, where eachpi+1is directly reachable frompi. Note that this implies that the initial point and all points on the path must be core points, with the possible exception ofq.
All points not reachable from any other point areoutliersornoise points.
Now ifpis a core point, then it forms aclustertogether with all points (core or non-core) that are reachable from it. Each cluster contains at least one core point; non-core points can be part of a cluster, but they form its “edge”, since they cannot be used to reach more points. See diagram here https://en.m.wikipedia.org/wiki/DBSCAN
only core points can reach non-core points. So if you see the diagram, if there was a point inthe radius e of a yellow (non-core) point, but it did not fall under the radius of a core point (not reachable by the above definition ), then this would be an outlier too.so a non-core point may be reachable, but nothing can be reached from it
DBSCAN algorithm can be abstracted into the following steps:[4]
Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than minPts neighbors.
Find theconnected componentsofcorepoints on the neighbor graph, ignoring all non-core points.
Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise.
A naive implementation of this requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time.
Dbscan ads disadvantages
Ads
DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed tok-means.
DBSCAN can find arbitrarily-shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.
DBSCAN has a notion of noise, and is robust tooutliers.
DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.)
DBSCAN is designed for use with databases that can accelerate region queries, e.g. using anR* tree.
The parameters minPts and ε can be set by a domain expert, if the data is well understood
Disadvantages
DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed. For most data sets and domains, this situation does not arise often and has little impact on the clustering result:[4]both on core points and noise points, DBSCAN is deterministic. DBSCAN*[8]is a variation that treats border points as noise, and this way achieves a fully deterministic result as well as a more consistent statistical interpretation of density-connected components.
The quality of DBSCAN depends on thedistance measureused in the function regionQuery(P,ε). The most common distance metric used isEuclidean distance. Especially forhigh-dimensional data, this metric can be rendered almost useless due to the so-called “Curse of dimensionality”, making it difficult to find an appropriate value for ε. This effect, however, is also present in any other algorithm based on Euclidean distance.
DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters.[9]
If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.
Parameter estimation for DBSCAN
Hi
Parameter estimation for kmeans
Elbow method
Silhouette score or other metrics such as the Davies-Bouldin index…