Machine Learning Flashcards
adversarial training
AT: “training a model in a worst-case scenario, with inputs chosen by an adversary”
Adversarial training is often used to enforce constraints on random variables
GAN
GAN is a generative model that learns the probability distribution (or data distribution) of the training examples it is given.
From this distribution, we can then create sample outputs. GANs have seen their largest progress with image training examples, but this idea of modeling data distributions is one that can be applied with other forms of input
=> the key mathematical tool GANs give you is the ability to “estimate a ratio”
==> GANs are generative models that use supervised learning to approximate an intractable cost function by estimating ratios
discriminator network
t
convolutional neural network
a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery
deconvolutional neural network
t
backpropagation
In principle, all backpropagation does is (stochastic) gradient descent -> This converges to a local minimum, which are often enough surprisingly good
Backpropagation is a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used to train deep neural networks, a term used to explain neural networks with more than one hidden layer
inpainting
In the digital world, inpainting (also known as image interpolation or video interpolation) refers to the application of sophisticated algorithms to replace lost or corrupted parts of the image data (mainly small regions or to remove small defects)
damage and repair strategy
t
pretext task
t
autoencoder
is an artificial neural network used for unsupervised learning of efficient codings (ie feature learning)
The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction
low-level statistics
e.g. unusual local texture patterns
variability
t
epoch
An epoch is one complete presentation of the data set to be learned to a learning machine. Learning machines like feedforward neural nets that use iterative algorithms often need many epochs during their learning phase.
adversarial examples
inputs to machine learning models that an attacker has
intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines
-> in computer vision: usually an image formed by making small perturbations to an example image from a dataset
tensor
multidimensional array
GAN
The dueling-neural-network approach has vastly improved learning from unlabeled data.
spatial resolution
Spatial resolution is a term that refers to the number of pixels utilized in construction of a digital image. Images having higher spatial resolution are composed with a greater number of pixels than those of lower spatial resolution.
saccade
is a quick, simultaneous movement of both eyes between two or more phases of fixation in the same direction
no free lunch theorem
“averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points”
in other words, the most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.
Fortunately, these results hold only when we average over all possible datagenerating distributions.
But If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions
==> the no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task
hypothesis space
e.g. linear regression has a hypothesis space consisting of the set of linear functions of its input
regularization
we can regularize a model that learns a function f(x; θ) by adding a penalty called a regularizer to the cost function
Regularization is any modification we make to a
learning algorithm that is intended to reduce its generalization error but not its training error. Regularization is one of the central concerns of the field of machine learning, rivaled in its importance only by optimization
loss function
when we minimize the objective function, we may also call it the cost function or loss function
objective function
the function we want to minimize or maximize, also called criterion
=> when we are minimizing the objective function, we may also call it the cost function or loss function
squared L^2 norm
can be calculated simply as x^Tx
-> is more conenient to work with mathematically and computationally than the L^2 norm itself
sigma function
sigmoid function (logistic curve)
- > if x very negative then is close to 0
- > if x very positive then is close to 1
- > steadily increases around input x 0
eigenvalue
an eigenvector of a square matrix A is a nonzero vector v s.t. multiplication by A alters only the scale of v:
Av = cv
where c is a scalar known as the eigenvalue corresponding to this eigenvector
eigendecomposition
tbd
Orthogonal matrix
is a square matrix with real entries whose columns and rows are orthogonal unit vectors (i.e., orthonormal vectors):
Q^TQ = QQ^T = I
where I is identity matrix.
This leads to the equivalent characterization: a matrix Q is orthogonal if its transpose is equal to its inverse:
Q^T = Q^-1
singular matrix
a square matrix with linearly dependent columns
unit vector
A Unit Vector has a magnitude of 1
sigma function
sigmoid function (logistic curve)
ReLU
-> rectified linear unit
-> the rectifier is an activation function defined as the positive part of its argument:
f(x) = max(0,x)
-> a unit employing the rectifier is also called a rectified linear unit (ReLU)
-> easier to train than sigmoid
ReLU(a) = max(0,a)
rectified = gleichgerichtet
loss function is non-convex?
the optimization is prone to falling into local minima
neural networks are mostly used with non-linear activation functions (i.e. sigmoid), hence the optimization becomes non-convex
Gram matrix
the Gram matrix (Gramian matrix or Gramian) of a set of vectors v_1,…v_n in an inner product space is the Hermitian matrix of inner products
An important application is to compute linear independence:
a set of vectors is linearly independent if and only if the Gram determinant (the determinant of the Gram matrix) is non-zero
artifact
any error in the perception or representation of any information, introduced by the involved equipment or technique(s)
dilated convolution
vanilla convolutions struggle to integrate global context: the effective receptive field of units can only grow linearly with layers. This is very limiting, especially for high-resolution input images. Dilated convolutions to the rescue! even though the number of parameters grows only linearly with layers, the effective receptive field of units grows exponentially with layer depth
They can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers
http://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/
second derivative
tells us how well we can expect a gradient descent step to perform
-> it can be used to determine whether a critical point is a local maximujm, a local minimum, or a saddle point
Taylor series
a Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point
A function can be approximated by using a finite number of terms of its Taylor series.
Taylor’s theorem gives quantitative estimates on the error introduced by the use of such an approximation
Anwendungen von eigenvalues
At a critical point, where ∇_xf(x) = 0, we can examine the eigenvalues of the Hessian to determine whether
the critical point is a local maximum, local minimum, or saddle point.
-> When the Hessian is positive definite (all its eigenvalues are positive), the point is a local
minimum.
Anwendungen von eigendecomposition
Using the eigendecomposition of the Hessian matrix, we can generalize the second derivative test to multiple dimensions
univariate/multidimensional second derivative test
allows at a critical point to determine whether the critical point is alocal maximum, local minimum, or saddle point
Newton’s method
problem: poor condition number, choosing a good step size is difficult -> step size too small, too little significant progress made
- > This issue can be resolved by using information from the Hessian matrix to guide the search
- > Newton’s method is based on using a second-order Taylor series expansion to approximate f(x) near some point x^(0)
- > When f is a positive definite quadratic function, Newton’s method consists of applying equation 4.12 once to jump to the minimum of the function directly
- > When f is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton’s method consists of applying equation 4.12 multiple
times. - > Iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would. This is a useful property near a local minimum, but it can be a harmful property near a saddle point
first-order / second-order optimization algorithms
Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms. Optimization algorithms that also use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms
convex function
a function for which the Hessian is positive semidefinite everywhere
- Such functions are well-behaved because they lack saddle points and all of their local minima are necessarily global minima.
However, most problems in deep learning are difficult to express in terms of convex optimization
positive semidefinite
A matrix whose eigenvalues are all positive or zero-valued
positive definite
A matrix whose eigenvalues are all positive
line search
The line search approach first finds a descent direction along which the objective function f will be reduced and then computes a step size that determines how far x should move along that direction.
The descent direction can be computed by various methods, such as gradient descent, Newton’s method. The step size can be determined either exactly or inexactly.
determinant of a square matrix
det(A):
- is afunction mapping matrices toreal scalars
- The determinant is equalto the product ofall the
eigenvalues of the matrix.
- The absolute value of the determinant can be thought
of as a measure of how much multiplication by the matrix expands or contracts space.
- If the determinant is 0, then space is contracted completely along at least one dimension, causing it to lose all of its volume.
- If the determinant is 1, then the transformation preserves volume
frequentist probability
related directly to the rates at which events occur
Bayesian probability
related to qualitative levels of certainty (also: degree of belief)
Image synthesis
is the process of creating new images from some form of image description
Regression vs Classification
Regression: the output variable takes continuous values.
Regression involves estimating or predicting a response.
Classification: the output variable takes class labels. Classification is identifying group membership.
image patch
A patch is small (generally rectangular) piece of an image. For example, an 8x8 patch is a square patch containing 64 pixels of a larger image (of size say, 256x256 pixes). Due to the smaller size, some of the image processing algorithms such as denoising/super resolution etc. are easier to operate on patches rather than operating on the entire image itself. These algorithms split an image into several smaller sized patches (of size say, 8x8), operate individually on each of these patches, and finally tile all these patches at their respective locations.
VGG
de-facto standard for image generation tasks
classification network
neural networks are used for the purpose of
- clustering through unsupervised learning, -
classification through supervised learning, or
- regression
That is, they help group unlabeled data, categorize labeled data or predict continuous values.
Deep embeddings
answer one simple question: How similar are two images? Learning these embeddings is the bedrock of verification, zero-shot learning, and visual search
feature space
Feature space refers to the n-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features
channel dimension
Images are usually represented as Height x Width x #Channels where #Channels is 3 for RGB images and 1 for grayscale images. Sometimes you see Width x Height x #Channels, but the third dimension is the “channels.”
Inception Score
a recently proposed and widely used evaluation metric for generative models
Compressed sensing
is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems
probability mass vs density function
a probability distribution working with continuous random variables is called PDF
A probability distribution over discrete variables may be described using a probability mass function (PMF)
expectation
expectation or expected value of some function f(x) with respect to a probability distribution P (x) is the average or mean value that f takes on when x is drawn from P
variance
variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution
When the variance is low, the values of f (x) cluster near their expected value.
The square root of the variance is known as the standard deviation
standard deviation
The square root of the variance is known as the standard deviation
covariance
covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables
covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.
generative model
tbd
Voxel occupancy
is one approach for reconstructing the 3-dimensional shape of an object from multiple views
A voxel represents a value on a regular grid in three-dimensional space. As with pixels in a bitmap, voxels themselves do not typically have their position (their coordinates) explicitly encoded along with their values. Instead, rendering systems infer the position of a voxel based upon its position relative to other voxels
cross-entropy loss
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.
This is also called a negative log-likelihood function. Or logistic loss or cross-entropy loss.
multi-layer perceptron
tbd
multi-layer RNN
Generally, the current RNN models can only be stacked to 2 or 3 layers. Over 3 layers, the performance may drop. This is generally because the gradient vanishing problem in RNN.
perplexity
tbd
softmax function/classifier
It takes a vector of arbitrary real-valued scores (in z) and squashes it to a vector of values between zero and one that sum to one
dropout
technique to help mitigate the problem of overfitting: randomly drop out some nodes/neurons in hidden layers during training so that they don’t participate in producing the output and later help the model to better generalize
Mask R-CNN - Extending Faster R-CNN for Pixel Level Segmentation
Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is straight forward. Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel level segmentation?
Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object.
object detection
Object detection is the task of finding the different objects in an image and classifying them
R-CNN (1st version)
The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image are
Inputs: Image
Outputs: Bounding boxes + labels for each object in the image
R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search. At a high level, Selective Search looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects. Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet. On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and if so what object.
R-CNN in short (1st version)
- Generate a set of proposals for bounding boxes.
- Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is.
- Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.
image instance segmentation
The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are.
embedding
Embedding == representation
- simply means projecting an input into another more convenient representation space
multimodal
(of a frequency curve or distribution) having several modes or maxima
mode=The mode of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. it is the value that is most likely to be sampled
L1 Loss Function
used to minimize the error which is the sum of all the absolute differences between the true value and the predicted value.
activation function
purpose: introduce non-linearity into the network
in turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables
non-linear means that the output cannot be reproduced from a linear combination of the inputs
another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function
bias
- bias increases the flexibility of the model
- bias determines if a neuron is activated
So our neuron only “activates” (has a non-zero output value) when wTx+b>0 which is equivalent to wTx>−b. So the bias term for a neuron will act as an activation threshold in our setup (ReLU nonlinearities). Since we adaptively learn these bias terms via backpropagation, we may interpret this as we are allowing our neurons to learn when to activate and when not to activate.
batchnorm
- normalize the output from the activation function
- occurs on a per batch basis
- can speed up learning
- means that esp. later layers are not shifted as much by earlier layers
- has a slight regularization effect
internal representation
- Features or, more in general, an internal representation or a hierarchy of concepts should be learned automatically
- The internal representation should separate all factors of variation (i.e., concepts that summarize important variation of the data)
-> Deep Learning Introduces hierarchical representations (from simple to complex, from low-level features to high-level features)
Deep Learning
Introduces hierarchical representations
top-5 score
…
probability density vs mass function
probability density function (if x is continuous) or a probability mass function (if x is discrete)
Classification vs regression accuracy
tbd
average likelihood
tbd
Reinforcement
data is dynamically gathered based on previous experience
Unsupervised
data is composed of just x; here we typically aim for p(x) or a method to sample p(x)
predictor function vs loss function
….
Bayes risk, empirical risk
…
MLE
Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed.
E.g. assuming a Gaussian distribution: Maximum likelihood estimation is a method that will find the values of μ and σ that result in the curve that best fits the data.
“This is the normal distribution that has been “fit” to the data by using the maximum likelihood estimations for the mean and the standard deviation”
https://www.youtube.com/watch?v=XepXtl9YKwc
=> “this is how we fit a distribution to data”
features
- intermediate representation
-
SVM
Aim is to find a separation between two classes with the largest gap (margin) possible
Unsupervised learning
is associated with density estimation, learning to draw samples from a distribution, learning to denoise data from some distribution, finding a manifold that the data lies near to, or clustering the data into groups of related examples.
(batch) gradient descent vs. mini-batch gradient descent
BGD allows you to take one gradient descent step per epoch, (=a single pass through the training set) while mini-batch GD allows you to take t gradient descent steps where t * m = size of training set and m is the size of a mini-batch, e.g. t = 5000 m = 1000 #trainingSet = 5M
when you have a large training set, mini-batch GD runs much faster
size of mini-batch (in GD)
let m size of training set (X,Y):
if mini-batch size = m -> batch gradient descent, (X^{1},Y^{1}) = (X,Y); takes too long per iteration
if mini-batch size = 1 -> stochastic gradient descent (every example is its own mini-batch); lose speedup from vectorization (very inefficient)
in practice: somewhere in-between 1 and m (too small and too large): gives the fastest learning because
- > you get a lot of vectorization
- > make progress without processing entire training set
manifold learning
In ML manifold learning aims at finding a low-dimensional embedding to represent data
key constraints in feedforward NN
the I/O (compositional) dependencies
Softmax Unit
An extension of the logistic sigmoid to multiple variables
sigmoid unit
Used to predict binary variables or to predict the
probability of binary variables
- Used as the output of a multi-class classifier
- Softmax is an extension to the logistic sigmoid where we have 2 variables and z_1 = 0, z_2 = z
affine transformation
is a linear mapping method that preserves points, straight lines, and planes. Sets of parallel lines remain parallel after an affine transformation.
Universal Approximation
(theorem)
A feedforward network with a linear output layer and enough (but at least one) hidden nonlinear layers (e.g., the logistic sigmoid unit) can approximate up to any desired precision any (Borel measurable) function between two finite- dimensional spaces
however, we are not guaranteed that the learning algorithm will be able to build that representation
no free lunch theorem
…
depth
A general rule is that depth helps generalization
- it is better to have many simple layers than few
highly complex ones - Another interpretation is that depth allows a more gradual abstraction
A deep network might give a useful representation, where concepts are gradually more and more abstract
capacity
…
gradient descent assumptions
Gradient descent will reach a local minimum under some constraints on both the cost function and the learning rate:
1) One such constraint is Lipschitz continuity of the gradient of the cost function:
Lipschitz continuity defines the maximum slope in the whole domain which the gradient will not enter
2) small enough learning rate
diagnosing GD
case 1: Lipschitzianity
The cost function does not satisfy the Lipschitz condition for any L
-> Solution: Smooth the cost function until an L exists
case 2: learning rate
not small enough
-> Solution: Make it smaller until gradient descent starts to work
regularization
Regularization aims at reducing the generalization
error of an algorithm
regularization aims at reducing variance
Often the best option in deep learning is that what works best is a large model with a good regularization.
inverse problem
An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them.
It is called an inverse problem because it starts with the results and then calculates the causes. This is the inverse of a forward problem, which starts with the causes and then calculates the results.
underfitting / high bias
the model does not fit the training data (-> high error rate on training data)
high bias: the model has a high preconception about the data and contrary to the examples it keeps this preconception stubbornly such that it fits the training data poorly
overfitting / high variance
high variance: intuition:
overfitting: if we have too many examples, the learned model may fit the training set very well but fails to generalize to new examples (e.g. predict prices on new examples)
addressing overfitting
1) reduce number of features
2) use regularization
- > keep all features, but reduce magnitude/values of weight/bias parameters
chromatic abberration
misalignment between the color channels
Semi-Supervised Learning
uses unlabeled samples from p(x) and labeled samples from p(x,y) to build p(y|x) or directly predict y from x
feature representation
…
compute the receptive field (CNN)
siehe separates Blatt
iteration
1 execution of gradient descent (could be 1 forward pass or n forward passes, depending on the mini-batch size)
feature map
The feature map is the output of one filter applied to the previous layer. A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature map. You can see that if the receptive field is moved one pixel from activation to activation, then the field will overlap with the previous activation by (field width - 1) input values.
For instance, In a 32 × 32 image , dragging the 5 × 5 receptive field across the input image data with a stride width of 1 will result in a feature map of 28 × 28 (32–5+1 × 32–5+1) output values or 784 distinct activations per image.
feature vs activation map
Feature map and activation map mean exactly the same thing. It is called an activation map because it is a mapping that corresponds to the activation of different parts of the image, and also a feature map because it is also a mapping of where a certain kind of feature is found in the image. A high activation means a certain feature was found.
Undercomplete Autoencoders
A way to obtain a useful representation h is to constrain it to have a smaller dimension than x
In this case the AE is called undercomplete
We force the AE to focus on the most important attributes of the training data
autoencoders
- choosing the representation size and the capacity of the encoder/decorder depends on the complexity of the data distribution
latent variables
variables we do not directly observe
Regularized Autoencoders
…
Gradient clipping
Gradient clipping will ‘clip’ the gradients or cap them to a Threshold value to prevent the gradients from getting too large.
In the above image, Gradient is clipped from Overshooting and our cost function follows the Dotted values rather than its original trajectory.
https://hackernoon.com/gradient-clipping-57f04f0adae
label smoothing
helps as regularizer
softmax outputs never hard 1 or 0
can help the model converge
Batch normalization
stabilizes learning by normalizing the input to each unit to have zero mean and unit variance.
This helps deal with training problems that arise due to poor initialization and helps gradient flow in deeper models
conditional distribution
t
the log (likelihood) trick
In machine learning, we generally assume the independence of different samples. Therefore,
we often have to deal with the product of a (large) number of distributions. When our goal
is to optimize functions of such products, it is often easier if we first work with the logarithm
of such functions. As the logarithmic function is a strictly increasing function, it will not
distort where the maximum is located.
the log makes a product of terms (the likelihood) into a sum (the log-likelihood) with which we can then work
on one term (i.e., one training sample) at a time, because they are summed together rather
than multiplied together.
=> the log trick makes the term more manageable
interpolation
interpolation is a method of constructing new data points within the range of a discrete set of known data points
approximate inference
Approximate inference methods make it possible to learn realistic models from big data by trading off computation time for accuracy, when exact learning and inference are computationally intractable
Bayes theorem is intractable. So how can we approximately solve Bayes theorem for complex cases, so that we can scale up Bayesian learning to the types of interesting, high-dimensional datasets that we want to deal with today in ML. There has been a lot of really excellent work on improving these approximations.
We can roughly divide approximate inference schemes into two categories: deterministic and stochastic.
Stoachastic:
based on the idea of Monte-Carlo sampling i.e., we can approximate any expectation w.r.t. a distribution as a mean of samples from it
Deterministic:
the typical approach is to approximate the nasty posterior with a nice, simple, tractable distribution. We can parameterise the approximation with some variational parameters, and then minimise a probabilistic divergence (e.g., the Kullback-Liebler divergence) w.r.t. the variational parameters. We then use the trained approximate distribution instead of the true, intractable one
See https://www.quora.com/What-is-approximate-inference
partition function
see 16.2.3 The Partition Function p. 568
The normalizing constant Z is known as the partition function, a term borrowed from statistical physics.
Since Z is an integral or sum over all possible joint assignments of the state x it is often intractable to compute.
marginal distribution
the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables.
COGAN
coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images
domain adaption
Domain Adaptation from source to target distribution
Eg. we applied the proposed framework to the problem for adapting a classifie
trained using labeled samples in one domain (source domain) to classify samples in a new domain
(target domain) where labeled samples in the new domain are unavailable during training
VAE vs AE
AE: autoencoders learn a “compressed representation” of input automatically by first compressing the input (encoder) and decompressing it back (decoder) to match the original input. The learning is aided by using distance function that quantifies the information loss that occurs from the lossy compression. So learning in an autoencoder is a form of unsupervised learning (or self-supervised as some refer to it) - there is no labeled data.
VAE: Instead of just learning a function representing the data like AE (a compressed representation), variational autoencoders learn the parameters of a probability distribution representing the data. Since it learns to model the data, we can sample from the distribution and generate new input data samples. So it is a generative model like, for instance, GANs.
VAE
VAEs optimize a variational bound
https://www.quora.com/Whats-the-difference-between-a-Variational-Autoencoder-VAE-and-an-Autoencoder
VAE: variational encoder tends to produce codings that look as they were sampled from Gaussian distribution (with mean and variance shown in figure). Advantage of such approach is after training you could just sample from the distribution followed by decoding and generating new data.
ill-posed
tbd
Laplacian distribution
The Laplace distribution, also called the double exponential distribution, is the distribution of differences between two independent variates (a variate is a generalization of the concept of a random variable) with identical exponential distributions
variational bound
??
image analogy
An image analogy is a method of creating an image filter automatically from training data. In an image analogy process, the transformation between two images A and A’ is “learned”. Later, given a different image B, its “analogy” image B’ can be generated based on the learned transformation.
https://mrl.nyu.edu/publications/image-analogies/analogies-72dpi.pdf
Tensorflow padding SAME vs VALID
When stride is 1, think of the following distinction:
"SAME": output size is the same as input size. This requires the filter window to slip outside input map, hence the need to pad. SAME: Apply padding to input (if needed) so that input image gets fully covered by filter and stride you specified. For stride 1, this will ensure that output image size is same as input. In SAME (i.e. auto-pad mode), Tensorflow will try to spread padding evenly on both left and right.
"VALID": Filter window stays at valid position inside input map, so output size shrinks by filter_size - 1. No padding occurs. VALID: Don't apply any padding, i.e., assume that all dimensions are valid so that input image fully gets covered by filter and stride you specified. In VALID (i.e. no padding mode), Tensorflow will drop right and/or bottom cells if your filter and stride doesn't full cover input image.
https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t
image analogy
An image analogy is a method of creating an image filter automatically from training data. In an image analogy process, the transformation between two images A and A’ is “learned”. Later, given a different image B, its “analogy” image B’ can be generated based on the learned transformation.
disentangled representation
A disentangled representation is simply a concatenation of coordinates along each underlying factor of variation
logit
The logit function is the inverse of the sigmoidal “logistic” function or logistic transform used in mathematics. Often, sigmoid function refers to the special case of the logistic function
In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf))
logits
The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function.
binary cross entropy
Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.
see https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy
weak labeling
we only know what attribute has changed between two images, although we do not know by how much
InfoGAN
learns a subset of factors of variation by reproducing parts of the input vector with the discriminator
instance normalization (IN vs BN)
The main difference between BN and IN is that the latter just computes the mean and standard deviation across the spatial domain of the input and not along the batch dimension
posterior distribution p ( θ | X)
is the probability of the parameter θ given the evidence X: p( θ | X)
It contrasts with the likelihood function, which is the probability of the evidence given the parameters: p( X | θ)
the posterior probability of a random event is the conditional probability that is assigned after the relevant evidence or background is taken into account
Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey
“Posterior”, in this context, means after taking into account the relevant evidence related to the particular case being examined
prior distribution p(x)
the prior of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account
likelihood function p( X | θ)
the likelihood function, which is the probability of the evidence X given the parameters θ: p( X | θ)
It contrasts with the posterior probability which is the probability of the parameter θ given the evidence X: p( θ | X)
pre-training / pretraining a NN
cf. transfer learning
manifold
a manifold is a connected region. Mathematically, it is a set of points associated with a neighborhood around each point. From any given point, the manifold locally appears to be a Euclidean space.
In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point.
in machine learning it tends to be used more loosely to designate a connected set of points that can be approximated well by considering only a small number of dimensions, embedded in a higher-dimensional space. Each dimension corresponds to a local direction of variation.
manifold learning algorithm
Manifold learning algorithms assume that most of R^n consists of invalid inputs and that interesting inputs occur only along a collection of manifolds containing a small subset of points, with interesting variations in the output of the learned function occurring only along directions that lie on the manifold, or with interesting variations happening only when we move from one manifold to another.
Manifold learning was introduced in the case of continuous-valued data and the unsupervised learning setting
probability concentration idea
see manifold learning.
the key assumption is that probability mass is highly concentrated along a (low-dimensional) manifold where the data lies
manifold hypothesis
In ML, the assumption is that the data lies along a low-dimensional manifold (the manifold hypothesis), which is at least approximately correct by argument of two observations:
1) that the probability distribution over images, text strings, and sounds that occur in real life is highly concentrated. This suggests that the images encountered in AI applications occupy a negligible proportion of the volume of the image space.
2) we can imagine a manifold i.e. neighboorhoods of interconnected examples (traversable by applying transformations) at least informally. In the case of images, we can certainly think of many possible transformations that allow us to trace out a manifold in image space: we can gradually dim or brighten the lights, gradually move or rotate objects in the image, gradually alter the colors on the surfaces of objects, etc. It remains likely that there are multiple manifolds involved in most applications. For example, the manifold of images of human faces may not be connected to the manifold of images of cat faces.
feature learning
an attempt to recover a parsimonious set of latent random variables that describe a distribution over the observed data
ill-conditioning
ill-conditioning of the Hessian matrix H
a very general problem in most numerical optimization, convex or otherwise.
Ill-conditioning can manifest by causing SGD to get
“stuck” in the sense that even very small steps increase the cost function.
receptive field (CNN)
(equivalently this is the filter size)
The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by)
For computational/feasibility reasons, each neuron is connected to only a local region in the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size).
The extent of the connectivity along the depth axis is always equal to the depth of the input volume.
It is important to emphasize this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.
Jensen–Shannon divergence
the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions
It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it is always a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen-Shannon distance
mode collapse
most severe form of non-convergence in GAN training
-> G mostly produces samples for one mode only (e.g. one dog, all with beach)
reason why it happens for GAN game : in practice we don’t do minimax game (with max. for D in the inner loop) which would guarantee convergence to the correct distribution, insted we do SGD for both players (G and D) simultaneously
-> because simultaneous SGD can sometimes behave like minmax and a little bit like maxmin and lot of times it behaves a bit more like maxmin
min-max and max-min do different things
D in inner loop: convergence to correct distribution
G in inner loop: place all mass on most likely point
- > ways to reduce this problem:
1) use mini-batch features: if a single sample is too close to other samples in the mini-batch, it can be rejected as having collapsed to a single mode
2) Unrolled GANs: backprop through k updates of the discriminator to prevent mode collapse (this to make sure we’re actually doing min-max rather than max-min)
the maximum likelihood learning rule
tbd
Curse of dimensionality
true dimensionality often much lower than the possible dimensionality
why is it a problem to have high-dimensionality?
- ML methods are statistical by nature
=> count observations in various regions of some space
- as dimensionality grows, fewer observations per region -> many regions without observations -> the observations become sparser and sparser
=> search space grows very quickly with more dimensions but the number of examples you have stays the same -> there is less and less redudandancy for your ML algorithm to sink its teeth into -> it will perform worse and worse
=> the problem is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.
=> Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.
e.g. true dimensionality on digits: possible variations of the pen-stroke
high-dimensional space
tbd
ML & curse of dimensionality
In machine learning problems that involve learning a “state-of-nature” from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values
cross-covariance
the cross-covariance is a function that gives the covariance of one process with the other at pairs of time points.
If X and Y are independent, then their covariance is zero. The converse, however, is not generally true. if two variables are uncorrelated, that does not in general imply that they are independent. A nonlinear relationship can exist that still would result in a covariance value of zero.
Lipschitz constant
tbd
vector norm
a vector norm is a function that assigns a strictly positive length or size to each vector in a vector space
matrix norm
a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices
spectral norm
The spectral norm of a matrix A is the largest singular value of A.
The induced matrix norm of the L2-norm for vectors is the spectral norm.
The spectral norm is the maximum singular value of a matrix. Intuitively, think of it as the maximum ‘scale’ by which the matrix can ‘stretch’ a vector.
residual
The residual of a data point is the difference between the actual value and the predicted value
maximum likelihood
the goal of ML is to find the optimal way to fit a distribution to the data
linear regression
a linear model that uses one independent variable to predict a (dependent) variable
-> uses “least squares” to fit the line
multiple regression
like linear regression but uses multiple independent variables
logistic regression
predicts whether something is True or False, instead of predicting something continuous like size
logistic regression fits an “S” shaped “logistic function” that cure goes from 0 to 1
logistic regression is usually used for classification
note well: LR can use continuous and discrete measurements to provide probabilities (and classify new samples)
=> uses “maximum likelihood” to fit the line (vs linear regression that uses “least squares” to fit the line)
rank of a tensor
the rank refers to the number of dimensions present within the tensor
TF: The rank of a tensor is the number of indices required to uniquely select each element of the tensor. Rank is also known as “order”, “degree”, or “ndims.”
The rank of a tf.Tensor object is its number of dimensions.
residual layer
So this motivated them to use skip connections and use so-called deep residual layers to allow their network to learn deviations from the identity layer, hence the term residual, residual here referring to difference from the identity.
stationary vs non-stationary data
In contrast to the non-stationary process that has a variable variance and a mean that does not remain near, or returns to a long-run mean over time, the stationary process reverts around a constant long-term mean and has a constant variance independent of time.
Most statistical forecasting methods are based on the assumption that the time series are approximately stationary.
Unfortunately, most price series are not stationary. They are either drifting upward or downward.
https://www.quora.com/What-is-Stationary-series-and-non-Stationary-series
Features should be stationary!
https://youtu.be/UQmKh84OZls?t=2513
translation invariance (CNN)
put simply: CNNs can detect the same object in an image even if it’s moved around, resized, rotated etc.
there’s an innate prior in CNNs – the assumption that an image processing system should be translationally invariant – which is enforced through an architectural design choice (weight sharing)
Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0
https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo
mean average precision (mAP)
a metric to measure the accuracy of object detectors like Faster R-CNN, SSD, etc
https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173
attention
TBD
accuracy vs precision
accuracy: involves to how close you come to the actual result
(accuracy without precision: arrows cluster around correct result i.e. the apple but without certainty of a bull’s eye for any given shot)
precision: how consistently you can get that result using the same method
(precision without accuracy: arrows consistently hit center of head but not the apple)
- > while we ultimately strive for accuracy,
- > precision reflects our certainty of reliably achieving accuracy
Bayesian networks (belief networks)
- to compute uncertainties by using the concept of probability
- used to model uncertainties by using directed acyclic graphs
- used in predictive modeling, in descriptive analysis
- e.g. Monty Hall problem
downsampling
downscale an image
cross validation
cross-validation
k-CV, 5-CV
see
https://towardsdatascience.com/k-fold-cross-validation-explained-in-plain-english-659e33c0bc0
https: //stats.stackexchange.com/questions/266225/step-by-step-explanation-of-k-fold-cross-validation-with-grid-search-to-optimise
https: //scikit-learn.org/stable/modules/cross_validation.html
nested cross validation
nested CV
https://mlfromscratch.com/nested-cross-validation-python-code/#/
Co-adaptation
In neural networks, co-adaptation refers to when different hidden units in a neural network have highly correlated behavior.
It is better for computational efficiency and the model’s ability to learn a general representation if hidden units can detect features independently of each other.
A few different regularization techniques aim at reducing co-adapatation – dropout being a notable one.
In neural networks, co-adaptation means that some neurons are highly dependent on others. If those independent neurons receive “bad” inputs, then the dependent neurons can be affected as well, and ultimately it can significantly alter the model performance, which is what might happen with overfitting.
mixed precision training
- Mixed precision is the combined use of different numerical precisions in a computational method.
- Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.
- significant training speedups are experienced by switching to mixed precision – up to 3x overall speedup on the most arithmetically intense model architectures
Half precision (also known as FP16) data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers.
Single precision (also known as 32-bit) is a common floating point format (float in C-derived programming languages), and 64-bit, known as double precision (double).
Mixed precision consists in using the full precision
(i.e., float32) for some key specific layers (e.g., loss layer) while reducing most of the other layers to half precision (i.e., float16). The training process therefore requires less memory due to faster data transfer operations while at the same time math-intensive and memory-limited operations are sped up. These benefits are ensured at no accuracy expense compared to a full precision training.
atrous convolution (aka dilated convolution)
Atrous (“with holes” convolution is an alternative for the down sampling layer. It increases the receptive field whilst maintains the spatial dimension of feature maps.
Sensitivity / Recall
- True Positive Rate
- Recall
= TP / (TP + FN)
Specificity
- True Negative Rate
= TN / (TN + FP)
Precision
= TP / (TP + FP)
Accuracy
= (TP + TN) / (TP + TN + FP + FN)
Accuracy can be a misleading metric for imbalanced data sets. Consider a sample with 95 negative and 5 positive values. Classifying all values as negative in this case gives 0.95 accuracy score.
translation equivariance (CNN)
Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0