Machine Learning Flashcards

1
Q

adversarial training

A

AT: “training a model in a worst-case scenario, with inputs chosen by an adversary”

Adversarial training is often used to enforce constraints on random variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

GAN

A

GAN is a generative model that learns the probability distribution (or data distribution) of the training examples it is given.
From this distribution, we can then create sample outputs. GANs have seen their largest progress with image training examples, but this idea of modeling data distributions is one that can be applied with other forms of input

=> the key mathematical tool GANs give you is the ability to “estimate a ratio”

==> GANs are generative models that use supervised learning to approximate an intractable cost function by estimating ratios

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

discriminator network

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

convolutional neural network

A

a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

deconvolutional neural network

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

backpropagation

A

In principle, all backpropagation does is (stochastic) gradient descent -> This converges to a local minimum, which are often enough surprisingly good

Backpropagation is a method used in artificial neural networks to calculate a gradient that is needed in the calculation of the weights to be used in the network. It is commonly used to train deep neural networks, a term used to explain neural networks with more than one hidden layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

inpainting

A

In the digital world, inpainting (also known as image interpolation or video interpolation) refers to the application of sophisticated algorithms to replace lost or corrupted parts of the image data (mainly small regions or to remove small defects)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

damage and repair strategy

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

pretext task

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

autoencoder

A

is an artificial neural network used for unsupervised learning of efficient codings (ie feature learning)

The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

low-level statistics

A

e.g. unusual local texture patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

variability

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

epoch

A

An epoch is one complete presentation of the data set to be learned to a learning machine. Learning machines like feedforward neural nets that use iterative algorithms often need many epochs during their learning phase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

adversarial examples

A

inputs to machine learning models that an attacker has
intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines

-> in computer vision: usually an image formed by making small perturbations to an example image from a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

tensor

A

multidimensional array

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

GAN

A

The dueling-neural-network approach has vastly improved learning from unlabeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

spatial resolution

A

Spatial resolution is a term that refers to the number of pixels utilized in construction of a digital image. Images having higher spatial resolution are composed with a greater number of pixels than those of lower spatial resolution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

saccade

A

is a quick, simultaneous movement of both eyes between two or more phases of fixation in the same direction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

no free lunch theorem

A

“averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points”
in other words, the most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.
Fortunately, these results hold only when we average over all possible datagenerating distributions.
But If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions

==> the no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

hypothesis space

A

e.g. linear regression has a hypothesis space consisting of the set of linear functions of its input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

regularization

A

we can regularize a model that learns a function f(x; θ) by adding a penalty called a regularizer to the cost function

Regularization is any modification we make to a
learning algorithm that is intended to reduce its generalization error but not its training error. Regularization is one of the central concerns of the field of machine learning, rivaled in its importance only by optimization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

loss function

A

when we minimize the objective function, we may also call it the cost function or loss function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

objective function

A

the function we want to minimize or maximize, also called criterion

=> when we are minimizing the objective function, we may also call it the cost function or loss function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

squared L^2 norm

A

can be calculated simply as x^Tx

-> is more conenient to work with mathematically and computationally than the L^2 norm itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

sigma function

A

sigmoid function (logistic curve)

  • > if x very negative then is close to 0
  • > if x very positive then is close to 1
  • > steadily increases around input x 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

eigenvalue

A

an eigenvector of a square matrix A is a nonzero vector v s.t. multiplication by A alters only the scale of v:

Av = cv

where c is a scalar known as the eigenvalue corresponding to this eigenvector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

eigendecomposition

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Orthogonal matrix

A

is a square matrix with real entries whose columns and rows are orthogonal unit vectors (i.e., orthonormal vectors):

Q^TQ = QQ^T = I

where I is identity matrix.

This leads to the equivalent characterization: a matrix Q is orthogonal if its transpose is equal to its inverse:
Q^T = Q^-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

singular matrix

A

a square matrix with linearly dependent columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

unit vector

A

A Unit Vector has a magnitude of 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

sigma function

A

sigmoid function (logistic curve)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

ReLU

A

-> rectified linear unit
-> the rectifier is an activation function defined as the positive part of its argument:
f(x) = max(0,x)
-> a unit employing the rectifier is also called a rectified linear unit (ReLU)

-> easier to train than sigmoid

ReLU(a) = max(0,a)

rectified = gleichgerichtet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

loss function is non-convex?

A

the optimization is prone to falling into local minima

neural networks are mostly used with non-linear activation functions (i.e. sigmoid), hence the optimization becomes non-convex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Gram matrix

A

the Gram matrix (Gramian matrix or Gramian) of a set of vectors v_1,…v_n in an inner product space is the Hermitian matrix of inner products

An important application is to compute linear independence:
a set of vectors is linearly independent if and only if the Gram determinant (the determinant of the Gram matrix) is non-zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

artifact

A

any error in the perception or representation of any information, introduced by the involved equipment or technique(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

dilated convolution

A

vanilla convolutions struggle to integrate global context: the effective receptive field of units can only grow linearly with layers. This is very limiting, especially for high-resolution input images. Dilated convolutions to the rescue! even though the number of parameters grows only linearly with layers, the effective receptive field of units grows exponentially with layer depth

They can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers

http://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

second derivative

A

tells us how well we can expect a gradient descent step to perform

-> it can be used to determine whether a critical point is a local maximujm, a local minimum, or a saddle point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Taylor series

A

a Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point

A function can be approximated by using a finite number of terms of its Taylor series.

Taylor’s theorem gives quantitative estimates on the error introduced by the use of such an approximation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Anwendungen von eigenvalues

A

At a critical point, where ∇_xf(x) = 0, we can examine the eigenvalues of the Hessian to determine whether
the critical point is a local maximum, local minimum, or saddle point.
-> When the Hessian is positive definite (all its eigenvalues are positive), the point is a local
minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Anwendungen von eigendecomposition

A

Using the eigendecomposition of the Hessian matrix, we can generalize the second derivative test to multiple dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

univariate/multidimensional second derivative test

A

allows at a critical point to determine whether the critical point is alocal maximum, local minimum, or saddle point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Newton’s method

A

problem: poor condition number, choosing a good step size is difficult -> step size too small, too little significant progress made

  • > This issue can be resolved by using information from the Hessian matrix to guide the search
  • > Newton’s method is based on using a second-order Taylor series expansion to approximate f(x) near some point x^(0)
  • > When f is a positive definite quadratic function, Newton’s method consists of applying equation 4.12 once to jump to the minimum of the function directly
  • > When f is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton’s method consists of applying equation 4.12 multiple
    times.
  • > Iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would. This is a useful property near a local minimum, but it can be a harmful property near a saddle point
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

first-order / second-order optimization algorithms

A

Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms. Optimization algorithms that also use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

convex function

A

a function for which the Hessian is positive semidefinite everywhere
- Such functions are well-behaved because they lack saddle points and all of their local minima are necessarily global minima.
However, most problems in deep learning are difficult to express in terms of convex optimization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

positive semidefinite

A

A matrix whose eigenvalues are all positive or zero-valued

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

positive definite

A

A matrix whose eigenvalues are all positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

line search

A

The line search approach first finds a descent direction along which the objective function f will be reduced and then computes a step size that determines how far x should move along that direction.
The descent direction can be computed by various methods, such as gradient descent, Newton’s method. The step size can be determined either exactly or inexactly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

determinant of a square matrix

A

det(A):
- is afunction mapping matrices toreal scalars
- The determinant is equalto the product ofall the
eigenvalues of the matrix.
- The absolute value of the determinant can be thought
of as a measure of how much multiplication by the matrix expands or contracts space.
- If the determinant is 0, then space is contracted completely along at least one dimension, causing it to lose all of its volume.
- If the determinant is 1, then the transformation preserves volume

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

frequentist probability

A

related directly to the rates at which events occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Bayesian probability

A

related to qualitative levels of certainty (also: degree of belief)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Image synthesis

A

is the process of creating new images from some form of image description

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Regression vs Classification

A

Regression: the output variable takes continuous values.
Regression involves estimating or predicting a response.

Classification: the output variable takes class labels.
Classification is identifying group membership.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

image patch

A

A patch is small (generally rectangular) piece of an image. For example, an 8x8 patch is a square patch containing 64 pixels of a larger image (of size say, 256x256 pixes). Due to the smaller size, some of the image processing algorithms such as denoising/super resolution etc. are easier to operate on patches rather than operating on the entire image itself. These algorithms split an image into several smaller sized patches (of size say, 8x8), operate individually on each of these patches, and finally tile all these patches at their respective locations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

VGG

A

de-facto standard for image generation tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

classification network

A

neural networks are used for the purpose of
- clustering through unsupervised learning, -
classification through supervised learning, or
- regression

That is, they help group unlabeled data, categorize labeled data or predict continuous values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Deep embeddings

A

answer one simple question: How similar are two images? Learning these embeddings is the bedrock of verification, zero-shot learning, and visual search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

feature space

A

Feature space refers to the n-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

channel dimension

A

Images are usually represented as Height x Width x #Channels where #Channels is 3 for RGB images and 1 for grayscale images. Sometimes you see Width x Height x #Channels, but the third dimension is the “channels.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Inception Score

A

a recently proposed and widely used evaluation metric for generative models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Compressed sensing

A

is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

probability mass vs density function

A

a probability distribution working with continuous random variables is called PDF

A probability distribution over discrete variables may be described using a probability mass function (PMF)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

expectation

A

expectation or expected value of some function f(x) with respect to a probability distribution P (x) is the average or mean value that f takes on when x is drawn from P

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

variance

A

variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution
When the variance is low, the values of f (x) cluster near their expected value.
The square root of the variance is known as the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

standard deviation

A

The square root of the variance is known as the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

covariance

A

covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables

covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

generative model

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Voxel occupancy

A

is one approach for reconstructing the 3-dimensional shape of an object from multiple views

A voxel represents a value on a regular grid in three-dimensional space. As with pixels in a bitmap, voxels themselves do not typically have their position (their coordinates) explicitly encoded along with their values. Instead, rendering systems infer the position of a voxel based upon its position relative to other voxels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

cross-entropy loss

A

A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity.

This is also called a negative log-likelihood function. Or logistic loss or cross-entropy loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

multi-layer perceptron

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

multi-layer RNN

A

Generally, the current RNN models can only be stacked to 2 or 3 layers. Over 3 layers, the performance may drop. This is generally because the gradient vanishing problem in RNN.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

perplexity

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

softmax function/classifier

A

It takes a vector of arbitrary real-valued scores (in z) and squashes it to a vector of values between zero and one that sum to one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

dropout

A

technique to help mitigate the problem of overfitting: randomly drop out some nodes/neurons in hidden layers during training so that they don’t participate in producing the output and later help the model to better generalize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Mask R-CNN - Extending Faster R-CNN for Pixel Level Segmentation

A

Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is straight forward. Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel level segmentation?
Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

object detection

A

Object detection is the task of finding the different objects in an image and classifying them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

R-CNN (1st version)

A

The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image are
Inputs: Image
Outputs: Bounding boxes + labels for each object in the image

R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search. At a high level, Selective Search looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects. Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet. On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and if so what object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

R-CNN in short (1st version)

A
  1. Generate a set of proposals for bounding boxes.
  2. Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is.
  3. Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

image instance segmentation

A

The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

embedding

A

Embedding == representation

  • simply means projecting an input into another more convenient representation space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

multimodal

A

(of a frequency curve or distribution) having several modes or maxima

mode=The mode of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. it is the value that is most likely to be sampled

81
Q

L1 Loss Function

A

used to minimize the error which is the sum of all the absolute differences between the true value and the predicted value.

82
Q

activation function

A

purpose: introduce non-linearity into the network

in turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

non-linear means that the output cannot be reproduced from a linear combination of the inputs

another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function

83
Q

bias

A
  • bias increases the flexibility of the model
  • bias determines if a neuron is activated

So our neuron only “activates” (has a non-zero output value) when wTx+b>0 which is equivalent to wTx>−b. So the bias term for a neuron will act as an activation threshold in our setup (ReLU nonlinearities). Since we adaptively learn these bias terms via backpropagation, we may interpret this as we are allowing our neurons to learn when to activate and when not to activate.

84
Q

batchnorm

A
  • normalize the output from the activation function
  • occurs on a per batch basis
  • can speed up learning
  • means that esp. later layers are not shifted as much by earlier layers
  • has a slight regularization effect
85
Q

internal representation

A
  • Features or, more in general, an internal representation or a hierarchy of concepts should be learned automatically
  • The internal representation should separate all factors of variation (i.e., concepts that summarize important variation of the data)

-> Deep Learning Introduces hierarchical representations (from simple to complex, from low-level features to high-level features)

86
Q

Deep Learning

A

Introduces hierarchical representations

87
Q

top-5 score

A

88
Q

probability density vs mass function

A

probability density function (if x is continuous) or a probability mass function (if x is discrete)

89
Q

Classification vs regression accuracy

A

tbd

90
Q

average likelihood

A

tbd

91
Q

Reinforcement

A

data is dynamically gathered based on previous experience

92
Q

Unsupervised

A

data is composed of just x; here we typically aim for p(x) or a method to sample p(x)

93
Q

predictor function vs loss function

A

….

94
Q

Bayes risk, empirical risk

A

95
Q

MLE

A

Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed.

E.g. assuming a Gaussian distribution: Maximum likelihood estimation is a method that will find the values of μ and σ that result in the curve that best fits the data.

“This is the normal distribution that has been “fit” to the data by using the maximum likelihood estimations for the mean and the standard deviation”
https://www.youtube.com/watch?v=XepXtl9YKwc
=> “this is how we fit a distribution to data”

96
Q

features

A
  • intermediate representation

-

97
Q

SVM

A

Aim is to find a separation between two classes with the largest gap (margin) possible

98
Q

Unsupervised learning

A

is associated with density estimation, learning to draw samples from a distribution, learning to denoise data from some distribution, finding a manifold that the data lies near to, or clustering the data into groups of related examples.

99
Q

(batch) gradient descent vs. mini-batch gradient descent

A
BGD allows you to take one gradient descent step per epoch, (=a single pass through the training set) while mini-batch GD allows you to take t gradient descent steps where t * m = size of training set and m is the size of a mini-batch, e.g.
t = 5000
m = 1000
#trainingSet = 5M

when you have a large training set, mini-batch GD runs much faster

100
Q

size of mini-batch (in GD)

A

let m size of training set (X,Y):
if mini-batch size = m -> batch gradient descent, (X^{1},Y^{1}) = (X,Y); takes too long per iteration
if mini-batch size = 1 -> stochastic gradient descent (every example is its own mini-batch); lose speedup from vectorization (very inefficient)

in practice: somewhere in-between 1 and m (too small and too large): gives the fastest learning because

  • > you get a lot of vectorization
  • > make progress without processing entire training set
101
Q

manifold learning

A

In ML manifold learning aims at finding a low-dimensional embedding to represent data

102
Q

key constraints in feedforward NN

A

the I/O (compositional) dependencies

103
Q

Softmax Unit

A

An extension of the logistic sigmoid to multiple variables

104
Q

sigmoid unit

A

Used to predict binary variables or to predict the
probability of binary variables

  • Used as the output of a multi-class classifier
  • Softmax is an extension to the logistic sigmoid where we have 2 variables and z_1 = 0, z_2 = z
105
Q

affine transformation

A

is a linear mapping method that preserves points, straight lines, and planes. Sets of parallel lines remain parallel after an affine transformation.

106
Q

Universal Approximation

A

(theorem)
A feedforward network with a linear output layer and enough (but at least one) hidden nonlinear layers (e.g., the logistic sigmoid unit) can approximate up to any desired precision any (Borel measurable) function between two finite- dimensional spaces

however, we are not guaranteed that the learning algorithm will be able to build that representation

107
Q

no free lunch theorem

A

108
Q

depth

A

A general rule is that depth helps generalization

  • it is better to have many simple layers than few
    highly complex ones
  • Another interpretation is that depth allows a more gradual abstraction

A deep network might give a useful representation, where concepts are gradually more and more abstract

109
Q

capacity

A

110
Q

gradient descent assumptions

A

Gradient descent will reach a local minimum under some constraints on both the cost function and the learning rate:

1) One such constraint is Lipschitz continuity of the gradient of the cost function:
Lipschitz continuity defines the maximum slope in the whole domain which the gradient will not enter

2) small enough learning rate

111
Q

diagnosing GD

A

case 1: Lipschitzianity
The cost function does not satisfy the Lipschitz condition for any L
-> Solution: Smooth the cost function until an L exists

case 2: learning rate
not small enough
-> Solution: Make it smaller until gradient descent starts to work

112
Q

regularization

A

Regularization aims at reducing the generalization
error of an algorithm

regularization aims at reducing variance

Often the best option in deep learning is that what works best is a large model with a good regularization.

113
Q

inverse problem

A

An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them.

It is called an inverse problem because it starts with the results and then calculates the causes. This is the inverse of a forward problem, which starts with the causes and then calculates the results.

114
Q

underfitting / high bias

A

the model does not fit the training data (-> high error rate on training data)

high bias: the model has a high preconception about the data and contrary to the examples it keeps this preconception stubbornly such that it fits the training data poorly

115
Q

overfitting / high variance

A

high variance: intuition:

overfitting: if we have too many examples, the learned model may fit the training set very well but fails to generalize to new examples (e.g. predict prices on new examples)

116
Q

addressing overfitting

A

1) reduce number of features

2) use regularization
- > keep all features, but reduce magnitude/values of weight/bias parameters

117
Q

chromatic abberration

A

misalignment between the color channels

118
Q

Semi-Supervised Learning

A
uses unlabeled samples
from p(x) and labeled samples from p(x,y) to build p(y|x) or directly predict y from x
119
Q

feature representation

A

120
Q

compute the receptive field (CNN)

A

siehe separates Blatt

121
Q

iteration

A

1 execution of gradient descent (could be 1 forward pass or n forward passes, depending on the mini-batch size)

122
Q

feature map

A

The feature map is the output of one filter applied to the previous layer. A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature map. You can see that if the receptive field is moved one pixel from activation to activation, then the field will overlap with the previous activation by (field width - 1) input values.

For instance, In a 32 × 32 image , dragging the 5 × 5 receptive field across the input image data with a stride width of 1 will result in a feature map of 28 × 28 (32–5+1 × 32–5+1) output values or 784 distinct activations per image.

123
Q

feature vs activation map

A

Feature map and activation map mean exactly the same thing. It is called an activation map because it is a mapping that corresponds to the activation of different parts of the image, and also a feature map because it is also a mapping of where a certain kind of feature is found in the image. A high activation means a certain feature was found.

124
Q

Undercomplete Autoencoders

A

A way to obtain a useful representation h is to constrain it to have a smaller dimension than x

In this case the AE is called undercomplete

We force the AE to focus on the most important attributes of the training data

125
Q

autoencoders

A
  • choosing the representation size and the capacity of the encoder/decorder depends on the complexity of the data distribution
126
Q

latent variables

A

variables we do not directly observe

127
Q

Regularized Autoencoders

A

128
Q

Gradient clipping

A

Gradient clipping will ‘clip’ the gradients or cap them to a Threshold value to prevent the gradients from getting too large.

In the above image, Gradient is clipped from Overshooting and our cost function follows the Dotted values rather than its original trajectory.
https://hackernoon.com/gradient-clipping-57f04f0adae

129
Q

label smoothing

A

helps as regularizer
softmax outputs never hard 1 or 0
can help the model converge

130
Q

Batch normalization

A

stabilizes learning by normalizing the input to each unit to have zero mean and unit variance.

This helps deal with training problems that arise due to poor initialization and helps gradient flow in deeper models

131
Q

conditional distribution

A

t

132
Q

the log (likelihood) trick

A

In machine learning, we generally assume the independence of different samples. Therefore,
we often have to deal with the product of a (large) number of distributions. When our goal
is to optimize functions of such products, it is often easier if we first work with the logarithm
of such functions. As the logarithmic function is a strictly increasing function, it will not
distort where the maximum is located.

the log makes a product of terms (the likelihood) into a sum (the log-likelihood) with which we can then work
on one term (i.e., one training sample) at a time, because they are summed together rather
than multiplied together.
=> the log trick makes the term more manageable

133
Q

interpolation

A

interpolation is a method of constructing new data points within the range of a discrete set of known data points

134
Q

approximate inference

A

Approximate inference methods make it possible to learn realistic models from big data by trading off computation time for accuracy, when exact learning and inference are computationally intractable

Bayes theorem is intractable. So how can we approximately solve Bayes theorem for complex cases, so that we can scale up Bayesian learning to the types of interesting, high-dimensional datasets that we want to deal with today in ML. There has been a lot of really excellent work on improving these approximations.
We can roughly divide approximate inference schemes into two categories: deterministic and stochastic.

Stoachastic:
based on the idea of Monte-Carlo sampling i.e., we can approximate any expectation w.r.t. a distribution as a mean of samples from it

Deterministic:
the typical approach is to approximate the nasty posterior with a nice, simple, tractable distribution. We can parameterise the approximation with some variational parameters, and then minimise a probabilistic divergence (e.g., the Kullback-Liebler divergence) w.r.t. the variational parameters. We then use the trained approximate distribution instead of the true, intractable one

See https://www.quora.com/What-is-approximate-inference

135
Q

partition function

A

see 16.2.3 The Partition Function p. 568

The normalizing constant Z is known as the partition function, a term borrowed from statistical physics.
Since Z is an integral or sum over all possible joint assignments of the state x it is often intractable to compute.

136
Q

marginal distribution

A

the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables.

137
Q

COGAN

A

coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images

138
Q

domain adaption

A

Domain Adaptation from source to target distribution

Eg. we applied the proposed framework to the problem for adapting a classifie
trained using labeled samples in one domain (source domain) to classify samples in a new domain
(target domain) where labeled samples in the new domain are unavailable during training

139
Q

VAE vs AE

A

AE: autoencoders learn a “compressed representation” of input automatically by first compressing the input (encoder) and decompressing it back (decoder) to match the original input. The learning is aided by using distance function that quantifies the information loss that occurs from the lossy compression. So learning in an autoencoder is a form of unsupervised learning (or self-supervised as some refer to it) - there is no labeled data.

VAE: Instead of just learning a function representing the data like AE (a compressed representation), variational autoencoders learn the parameters of a probability distribution representing the data. Since it learns to model the data, we can sample from the distribution and generate new input data samples. So it is a generative model like, for instance, GANs.

140
Q

VAE

A

VAEs optimize a variational bound

https://www.quora.com/Whats-the-difference-between-a-Variational-Autoencoder-VAE-and-an-Autoencoder

VAE: variational encoder tends to produce codings that look as they were sampled from Gaussian distribution (with mean and variance shown in figure). Advantage of such approach is after training you could just sample from the distribution followed by decoding and generating new data.

141
Q

ill-posed

A

tbd

142
Q

Laplacian distribution

A

The Laplace distribution, also called the double exponential distribution, is the distribution of differences between two independent variates (a variate is a generalization of the concept of a random variable) with identical exponential distributions

143
Q

variational bound

A

??

144
Q

image analogy

A

An image analogy is a method of creating an image filter automatically from training data. In an image analogy process, the transformation between two images A and A’ is “learned”. Later, given a different image B, its “analogy” image B’ can be generated based on the learned transformation.

https://mrl.nyu.edu/publications/image-analogies/analogies-72dpi.pdf

145
Q

Tensorflow padding SAME vs VALID

A

When stride is 1, think of the following distinction:

"SAME": output size is the same as input size. This requires the filter window to slip outside input map, hence the need to pad.
SAME: Apply padding to input (if needed) so that input image gets fully covered by filter and stride you specified. For stride 1, this will ensure that output image size is same as input. 
In SAME (i.e. auto-pad mode), Tensorflow will try to spread padding evenly on both left and right.
"VALID": Filter window stays at valid position inside input map, so output size shrinks by filter_size - 1. No padding occurs.
VALID: Don't apply any padding, i.e., assume that all dimensions are valid so that input image fully gets covered by filter and stride you specified. 
In VALID (i.e. no padding mode), Tensorflow will drop right and/or bottom cells if your filter and stride doesn't full cover input image.

https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t

146
Q

image analogy

A

An image analogy is a method of creating an image filter automatically from training data. In an image analogy process, the transformation between two images A and A’ is “learned”. Later, given a different image B, its “analogy” image B’ can be generated based on the learned transformation.

147
Q

disentangled representation

A

A disentangled representation is simply a concatenation of coordinates along each underlying factor of variation

148
Q

logit

A

The logit function is the inverse of the sigmoidal “logistic” function or logistic transform used in mathematics. Often, sigmoid function refers to the special case of the logistic function

In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf))

149
Q

logits

A

The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function.

150
Q

binary cross entropy

A

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.

see https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy

151
Q

weak labeling

A

we only know what attribute has changed between two images, although we do not know by how much

152
Q

InfoGAN

A

learns a subset of factors of variation by reproducing parts of the input vector with the discriminator

153
Q

instance normalization (IN vs BN)

A

The main difference between BN and IN is that the latter just computes the mean and standard deviation across the spatial domain of the input and not along the batch dimension

154
Q

posterior distribution p ( θ | X)

A

is the probability of the parameter θ given the evidence X: p( θ | X)

It contrasts with the likelihood function, which is the probability of the evidence given the parameters: p( X | θ)

the posterior probability of a random event is the conditional probability that is assigned after the relevant evidence or background is taken into account

Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey

“Posterior”, in this context, means after taking into account the relevant evidence related to the particular case being examined

155
Q

prior distribution p(x)

A

the prior of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account

156
Q

likelihood function p( X | θ)

A

the likelihood function, which is the probability of the evidence X given the parameters θ: p( X | θ)

It contrasts with the posterior probability which is the probability of the parameter θ given the evidence X: p( θ | X)

157
Q

pre-training / pretraining a NN

A

cf. transfer learning

158
Q

manifold

A

a manifold is a connected region. Mathematically, it is a set of points associated with a neighborhood around each point. From any given point, the manifold locally appears to be a Euclidean space.

In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point.

in machine learning it tends to be used more loosely to designate a connected set of points that can be approximated well by considering only a small number of dimensions, embedded in a higher-dimensional space. Each dimension corresponds to a local direction of variation.

159
Q

manifold learning algorithm

A

Manifold learning algorithms assume that most of R^n consists of invalid inputs and that interesting inputs occur only along a collection of manifolds containing a small subset of points, with interesting variations in the output of the learned function occurring only along directions that lie on the manifold, or with interesting variations happening only when we move from one manifold to another.

Manifold learning was introduced in the case of continuous-valued data and the unsupervised learning setting

160
Q

probability concentration idea

A

see manifold learning.
the key assumption is that probability mass is highly concentrated along a (low-dimensional) manifold where the data lies

161
Q

manifold hypothesis

A

In ML, the assumption is that the data lies along a low-dimensional manifold (the manifold hypothesis), which is at least approximately correct by argument of two observations:
1) that the probability distribution over images, text strings, and sounds that occur in real life is highly concentrated. This suggests that the images encountered in AI applications occupy a negligible proportion of the volume of the image space.

2) we can imagine a manifold i.e. neighboorhoods of interconnected examples (traversable by applying transformations) at least informally. In the case of images, we can certainly think of many possible transformations that allow us to trace out a manifold in image space: we can gradually dim or brighten the lights, gradually move or rotate objects in the image, gradually alter the colors on the surfaces of objects, etc. It remains likely that there are multiple manifolds involved in most applications. For example, the manifold of images of human faces may not be connected to the manifold of images of cat faces.

162
Q

feature learning

A

an attempt to recover a parsimonious set of latent random variables that describe a distribution over the observed data

163
Q

ill-conditioning

A

ill-conditioning of the Hessian matrix H

a very general problem in most numerical optimization, convex or otherwise.

Ill-conditioning can manifest by causing SGD to get
“stuck” in the sense that even very small steps increase the cost function.

164
Q

receptive field (CNN)

A

(equivalently this is the filter size)

The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by)

For computational/feasibility reasons, each neuron is connected to only a local region in the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size).
The extent of the connectivity along the depth axis is always equal to the depth of the input volume.

It is important to emphasize this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.

165
Q

Jensen–Shannon divergence

A

the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions

It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it is always a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen-Shannon distance

166
Q

mode collapse

A

most severe form of non-convergence in GAN training
-> G mostly produces samples for one mode only (e.g. one dog, all with beach)

reason why it happens for GAN game : in practice we don’t do minimax game (with max. for D in the inner loop) which would guarantee convergence to the correct distribution, insted we do SGD for both players (G and D) simultaneously
-> because simultaneous SGD can sometimes behave like minmax and a little bit like maxmin and lot of times it behaves a bit more like maxmin

min-max and max-min do different things

D in inner loop: convergence to correct distribution
G in inner loop: place all mass on most likely point

  • > ways to reduce this problem:
    1) use mini-batch features: if a single sample is too close to other samples in the mini-batch, it can be rejected as having collapsed to a single mode
    2) Unrolled GANs: backprop through k updates of the discriminator to prevent mode collapse (this to make sure we’re actually doing min-max rather than max-min)
167
Q

the maximum likelihood learning rule

A

tbd

168
Q

Curse of dimensionality

A

true dimensionality often much lower than the possible dimensionality
why is it a problem to have high-dimensionality?
- ML methods are statistical by nature
=> count observations in various regions of some space
- as dimensionality grows, fewer observations per region -> many regions without observations -> the observations become sparser and sparser
=> search space grows very quickly with more dimensions but the number of examples you have stays the same -> there is less and less redudandancy for your ML algorithm to sink its teeth into -> it will perform worse and worse

=> the problem is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.
=> Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

e.g. true dimensionality on digits: possible variations of the pen-stroke

169
Q

high-dimensional space

A

tbd

170
Q

ML & curse of dimensionality

A

In machine learning problems that involve learning a “state-of-nature” from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values

171
Q

cross-covariance

A

the cross-covariance is a function that gives the covariance of one process with the other at pairs of time points.

If X and Y are independent, then their covariance is zero. The converse, however, is not generally true. if two variables are uncorrelated, that does not in general imply that they are independent. A nonlinear relationship can exist that still would result in a covariance value of zero.

172
Q

Lipschitz constant

A

tbd

173
Q

vector norm

A

a vector norm is a function that assigns a strictly positive length or size to each vector in a vector space

174
Q

matrix norm

A

a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices

175
Q

spectral norm

A

The spectral norm of a matrix A is the largest singular value of A.
The induced matrix norm of the L2-norm for vectors is the spectral norm.

The spectral norm is the maximum singular value of a matrix. Intuitively, think of it as the maximum ‘scale’ by which the matrix can ‘stretch’ a vector.

176
Q

residual

A

The residual of a data point is the difference between the actual value and the predicted value

177
Q

maximum likelihood

A

the goal of ML is to find the optimal way to fit a distribution to the data

178
Q

linear regression

A

a linear model that uses one independent variable to predict a (dependent) variable

-> uses “least squares” to fit the line

179
Q

multiple regression

A

like linear regression but uses multiple independent variables

180
Q

logistic regression

A

predicts whether something is True or False, instead of predicting something continuous like size

logistic regression fits an “S” shaped “logistic function” that cure goes from 0 to 1

logistic regression is usually used for classification

note well: LR can use continuous and discrete measurements to provide probabilities (and classify new samples)

=> uses “maximum likelihood” to fit the line (vs linear regression that uses “least squares” to fit the line)

181
Q

rank of a tensor

A

the rank refers to the number of dimensions present within the tensor

TF: The rank of a tensor is the number of indices required to uniquely select each element of the tensor. Rank is also known as “order”, “degree”, or “ndims.”
The rank of a tf.Tensor object is its number of dimensions.

182
Q

residual layer

A

So this motivated them to use skip connections and use so-called deep residual layers to allow their network to learn deviations from the identity layer, hence the term residual, residual here referring to difference from the identity.

183
Q

stationary vs non-stationary data

A

In contrast to the non-stationary process that has a variable variance and a mean that does not remain near, or returns to a long-run mean over time, the stationary process reverts around a constant long-term mean and has a constant variance independent of time.

Most statistical forecasting methods are based on the assumption that the time series are approximately stationary.

Unfortunately, most price series are not stationary. They are either drifting upward or downward.
https://www.quora.com/What-is-Stationary-series-and-non-Stationary-series

Features should be stationary!
https://youtu.be/UQmKh84OZls?t=2513

184
Q

translation invariance (CNN)

A

put simply: CNNs can detect the same object in an image even if it’s moved around, resized, rotated etc.

there’s an innate prior in CNNs – the assumption that an image processing system should be translationally invariant – which is enforced through an architectural design choice (weight sharing)

Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0

https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo

185
Q

mean average precision (mAP)

A

a metric to measure the accuracy of object detectors like Faster R-CNN, SSD, etc

https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173

186
Q

attention

A

TBD

187
Q

accuracy vs precision

A

accuracy: involves to how close you come to the actual result
(accuracy without precision: arrows cluster around correct result i.e. the apple but without certainty of a bull’s eye for any given shot)

precision: how consistently you can get that result using the same method
(precision without accuracy: arrows consistently hit center of head but not the apple)

  • > while we ultimately strive for accuracy,
  • > precision reflects our certainty of reliably achieving accuracy
188
Q

Bayesian networks (belief networks)

A
  • to compute uncertainties by using the concept of probability
  • used to model uncertainties by using directed acyclic graphs
  • used in predictive modeling, in descriptive analysis
  • e.g. Monty Hall problem
189
Q

downsampling

A

downscale an image

190
Q

cross validation
cross-validation
k-CV, 5-CV

A

see
https://towardsdatascience.com/k-fold-cross-validation-explained-in-plain-english-659e33c0bc0

https: //stats.stackexchange.com/questions/266225/step-by-step-explanation-of-k-fold-cross-validation-with-grid-search-to-optimise
https: //scikit-learn.org/stable/modules/cross_validation.html

191
Q

nested cross validation

nested CV

A

https://mlfromscratch.com/nested-cross-validation-python-code/#/

192
Q

Co-adaptation

A

In neural networks, co-adaptation refers to when different hidden units in a neural network have highly correlated behavior.

It is better for computational efficiency and the model’s ability to learn a general representation if hidden units can detect features independently of each other.

A few different regularization techniques aim at reducing co-adapatation – dropout being a notable one.

In neural networks, co-adaptation means that some neurons are highly dependent on others. If those independent neurons receive “bad” inputs, then the dependent neurons can be affected as well, and ultimately it can significantly alter the model performance, which is what might happen with overfitting.

193
Q

mixed precision training

A
  • Mixed precision is the combined use of different numerical precisions in a computational method.
  • Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.
  • significant training speedups are experienced by switching to mixed precision – up to 3x overall speedup on the most arithmetically intense model architectures

Half precision (also known as FP16) data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers.

Single precision (also known as 32-bit) is a common floating point format (float in C-derived programming languages), and 64-bit, known as double precision (double).

Mixed precision consists in using the full precision
(i.e., float32) for some key specific layers (e.g., loss layer) while reducing most of the other layers to half precision (i.e., float16). The training process therefore requires less memory due to faster data transfer operations while at the same time math-intensive and memory-limited operations are sped up. These benefits are ensured at no accuracy expense compared to a full precision training.

194
Q

atrous convolution (aka dilated convolution)

A

Atrous (“with holes” convolution is an alternative for the down sampling layer. It increases the receptive field whilst maintains the spatial dimension of feature maps.

195
Q

Sensitivity / Recall

A
  • True Positive Rate
  • Recall

= TP / (TP + FN)

196
Q

Specificity

A
  • True Negative Rate

= TN / (TN + FP)

197
Q

Precision

A

= TP / (TP + FP)

198
Q

Accuracy

A

= (TP + TN) / (TP + TN + FP + FN)

Accuracy can be a misleading metric for imbalanced data sets. Consider a sample with 95 negative and 5 positive values. Classifying all values as negative in this case gives 0.95 accuracy score.

199
Q

translation equivariance (CNN)

A

Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0