Machine Learning Flashcards
adversarial training
AT: “training a model in a worst-case scenario, with inputs chosen by an adversary”
Adversarial training is often used to enforce constraints on random variables
GAN
GAN is a generative model that learns the probability distribution (or data distribution) of the training examples it is given.
From this distribution, we can then create sample outputs. GANs have seen their largest progress with image training examples, but this idea of modeling data distributions is one that can be applied with other forms of input
=> the key mathematical tool GANs give you is the ability to “estimate a ratio”
==> GANs are generative models that use supervised learning to approximate an intractable cost function by estimating ratios
discriminator network
t
convolutional neural network
a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery
deconvolutional neural network
t
backpropagation
In principle, all backpropagation does is (stochastic) gradient descent -> This converges to a local minimum, which are often enough surprisingly good
Backpropagation is a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used to train deep neural networks, a term used to explain neural networks with more than one hidden layer
inpainting
In the digital world, inpainting (also known as image interpolation or video interpolation) refers to the application of sophisticated algorithms to replace lost or corrupted parts of the image data (mainly small regions or to remove small defects)
damage and repair strategy
t
pretext task
t
autoencoder
is an artificial neural network used for unsupervised learning of efficient codings (ie feature learning)
The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction
low-level statistics
e.g. unusual local texture patterns
variability
t
epoch
An epoch is one complete presentation of the data set to be learned to a learning machine. Learning machines like feedforward neural nets that use iterative algorithms often need many epochs during their learning phase.
adversarial examples
inputs to machine learning models that an attacker has
intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines
-> in computer vision: usually an image formed by making small perturbations to an example image from a dataset
tensor
multidimensional array
GAN
The dueling-neural-network approach has vastly improved learning from unlabeled data.
spatial resolution
Spatial resolution is a term that refers to the number of pixels utilized in construction of a digital image. Images having higher spatial resolution are composed with a greater number of pixels than those of lower spatial resolution.
saccade
is a quick, simultaneous movement of both eyes between two or more phases of fixation in the same direction
no free lunch theorem
“averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points”
in other words, the most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.
Fortunately, these results hold only when we average over all possible datagenerating distributions.
But If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions
==> the no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task
hypothesis space
e.g. linear regression has a hypothesis space consisting of the set of linear functions of its input
regularization
we can regularize a model that learns a function f(x; θ) by adding a penalty called a regularizer to the cost function
Regularization is any modification we make to a
learning algorithm that is intended to reduce its generalization error but not its training error. Regularization is one of the central concerns of the field of machine learning, rivaled in its importance only by optimization
loss function
when we minimize the objective function, we may also call it the cost function or loss function
objective function
the function we want to minimize or maximize, also called criterion
=> when we are minimizing the objective function, we may also call it the cost function or loss function
squared L^2 norm
can be calculated simply as x^Tx
-> is more conenient to work with mathematically and computationally than the L^2 norm itself
sigma function
sigmoid function (logistic curve)
- > if x very negative then is close to 0
- > if x very positive then is close to 1
- > steadily increases around input x 0
eigenvalue
an eigenvector of a square matrix A is a nonzero vector v s.t. multiplication by A alters only the scale of v:
Av = cv
where c is a scalar known as the eigenvalue corresponding to this eigenvector
eigendecomposition
tbd
Orthogonal matrix
is a square matrix with real entries whose columns and rows are orthogonal unit vectors (i.e., orthonormal vectors):
Q^TQ = QQ^T = I
where I is identity matrix.
This leads to the equivalent characterization: a matrix Q is orthogonal if its transpose is equal to its inverse:
Q^T = Q^-1
singular matrix
a square matrix with linearly dependent columns
unit vector
A Unit Vector has a magnitude of 1
sigma function
sigmoid function (logistic curve)
ReLU
-> rectified linear unit
-> the rectifier is an activation function defined as the positive part of its argument:
f(x) = max(0,x)
-> a unit employing the rectifier is also called a rectified linear unit (ReLU)
-> easier to train than sigmoid
ReLU(a) = max(0,a)
rectified = gleichgerichtet
loss function is non-convex?
the optimization is prone to falling into local minima
neural networks are mostly used with non-linear activation functions (i.e. sigmoid), hence the optimization becomes non-convex
Gram matrix
the Gram matrix (Gramian matrix or Gramian) of a set of vectors v_1,…v_n in an inner product space is the Hermitian matrix of inner products
An important application is to compute linear independence:
a set of vectors is linearly independent if and only if the Gram determinant (the determinant of the Gram matrix) is non-zero
artifact
any error in the perception or representation of any information, introduced by the involved equipment or technique(s)
dilated convolution
vanilla convolutions struggle to integrate global context: the effective receptive field of units can only grow linearly with layers. This is very limiting, especially for high-resolution input images. Dilated convolutions to the rescue! even though the number of parameters grows only linearly with layers, the effective receptive field of units grows exponentially with layer depth
They can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers
http://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/
second derivative
tells us how well we can expect a gradient descent step to perform
-> it can be used to determine whether a critical point is a local maximujm, a local minimum, or a saddle point
Taylor series
a Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point
A function can be approximated by using a finite number of terms of its Taylor series.
Taylor’s theorem gives quantitative estimates on the error introduced by the use of such an approximation
Anwendungen von eigenvalues
At a critical point, where ∇_xf(x) = 0, we can examine the eigenvalues of the Hessian to determine whether
the critical point is a local maximum, local minimum, or saddle point.
-> When the Hessian is positive definite (all its eigenvalues are positive), the point is a local
minimum.
Anwendungen von eigendecomposition
Using the eigendecomposition of the Hessian matrix, we can generalize the second derivative test to multiple dimensions
univariate/multidimensional second derivative test
allows at a critical point to determine whether the critical point is alocal maximum, local minimum, or saddle point
Newton’s method
problem: poor condition number, choosing a good step size is difficult -> step size too small, too little significant progress made
- > This issue can be resolved by using information from the Hessian matrix to guide the search
- > Newton’s method is based on using a second-order Taylor series expansion to approximate f(x) near some point x^(0)
- > When f is a positive definite quadratic function, Newton’s method consists of applying equation 4.12 once to jump to the minimum of the function directly
- > When f is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton’s method consists of applying equation 4.12 multiple
times. - > Iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would. This is a useful property near a local minimum, but it can be a harmful property near a saddle point
first-order / second-order optimization algorithms
Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms. Optimization algorithms that also use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms
convex function
a function for which the Hessian is positive semidefinite everywhere
- Such functions are well-behaved because they lack saddle points and all of their local minima are necessarily global minima.
However, most problems in deep learning are difficult to express in terms of convex optimization
positive semidefinite
A matrix whose eigenvalues are all positive or zero-valued
positive definite
A matrix whose eigenvalues are all positive
line search
The line search approach first finds a descent direction along which the objective function f will be reduced and then computes a step size that determines how far x should move along that direction.
The descent direction can be computed by various methods, such as gradient descent, Newton’s method. The step size can be determined either exactly or inexactly.
determinant of a square matrix
det(A):
- is afunction mapping matrices toreal scalars
- The determinant is equalto the product ofall the
eigenvalues of the matrix.
- The absolute value of the determinant can be thought
of as a measure of how much multiplication by the matrix expands or contracts space.
- If the determinant is 0, then space is contracted completely along at least one dimension, causing it to lose all of its volume.
- If the determinant is 1, then the transformation preserves volume
frequentist probability
related directly to the rates at which events occur
Bayesian probability
related to qualitative levels of certainty (also: degree of belief)
Image synthesis
is the process of creating new images from some form of image description
Regression vs Classification
Regression: the output variable takes continuous values.
Regression involves estimating or predicting a response.
Classification: the output variable takes class labels. Classification is identifying group membership.
image patch
A patch is small (generally rectangular) piece of an image. For example, an 8x8 patch is a square patch containing 64 pixels of a larger image (of size say, 256x256 pixes). Due to the smaller size, some of the image processing algorithms such as denoising/super resolution etc. are easier to operate on patches rather than operating on the entire image itself. These algorithms split an image into several smaller sized patches (of size say, 8x8), operate individually on each of these patches, and finally tile all these patches at their respective locations.
VGG
de-facto standard for image generation tasks
classification network
neural networks are used for the purpose of
- clustering through unsupervised learning, -
classification through supervised learning, or
- regression
That is, they help group unlabeled data, categorize labeled data or predict continuous values.
Deep embeddings
answer one simple question: How similar are two images? Learning these embeddings is the bedrock of verification, zero-shot learning, and visual search
feature space
Feature space refers to the n-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features
channel dimension
Images are usually represented as Height x Width x #Channels where #Channels is 3 for RGB images and 1 for grayscale images. Sometimes you see Width x Height x #Channels, but the third dimension is the “channels.”
Inception Score
a recently proposed and widely used evaluation metric for generative models
Compressed sensing
is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems
probability mass vs density function
a probability distribution working with continuous random variables is called PDF
A probability distribution over discrete variables may be described using a probability mass function (PMF)
expectation
expectation or expected value of some function f(x) with respect to a probability distribution P (x) is the average or mean value that f takes on when x is drawn from P
variance
variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution
When the variance is low, the values of f (x) cluster near their expected value.
The square root of the variance is known as the standard deviation
standard deviation
The square root of the variance is known as the standard deviation
covariance
covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables
covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.
generative model
tbd
Voxel occupancy
is one approach for reconstructing the 3-dimensional shape of an object from multiple views
A voxel represents a value on a regular grid in three-dimensional space. As with pixels in a bitmap, voxels themselves do not typically have their position (their coordinates) explicitly encoded along with their values. Instead, rendering systems infer the position of a voxel based upon its position relative to other voxels
cross-entropy loss
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.
This is also called a negative log-likelihood function. Or logistic loss or cross-entropy loss.
multi-layer perceptron
tbd
multi-layer RNN
Generally, the current RNN models can only be stacked to 2 or 3 layers. Over 3 layers, the performance may drop. This is generally because the gradient vanishing problem in RNN.
perplexity
tbd
softmax function/classifier
It takes a vector of arbitrary real-valued scores (in z) and squashes it to a vector of values between zero and one that sum to one
dropout
technique to help mitigate the problem of overfitting: randomly drop out some nodes/neurons in hidden layers during training so that they don’t participate in producing the output and later help the model to better generalize
Mask R-CNN - Extending Faster R-CNN for Pixel Level Segmentation
Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is straight forward. Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel level segmentation?
Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object.
object detection
Object detection is the task of finding the different objects in an image and classifying them
R-CNN (1st version)
The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image are
Inputs: Image
Outputs: Bounding boxes + labels for each object in the image
R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search. At a high level, Selective Search looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects. Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet. On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and if so what object.
R-CNN in short (1st version)
- Generate a set of proposals for bounding boxes.
- Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is.
- Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.
image instance segmentation
The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are.
embedding
Embedding == representation
- simply means projecting an input into another more convenient representation space