Machine Learning Flashcards

Question

sigma function

Answer 1

sigmoid function (logistic curve) - > if x very negative then is close to 0 - > if x very positive then is close to 1 - > steadily increases around input x 0

Answer 2

an eigenvector of a square matrix A is a nonzero vector v s.t. multiplication by A alters only the scale of v: Av = cv where c is a scalar known as the eigenvalue corresponding to this eigenvector

Answer 3

is a square matrix with real entries whose columns and rows are orthogonal unit vectors (i.e., orthonormal vectors): Q^TQ = QQ^T = I where I is identity matrix. This leads to the equivalent characterization: a matrix Q is orthogonal if its transpose is equal to its inverse: Q^T = Q^-1

Answer 4

a square matrix with linearly dependent columns

Answer 5

A Unit Vector has a magnitude of 1

Answer 6

sigmoid function (logistic curve)

Answer 7

-> rectified linear unit -> the rectifier is an activation function defined as the positive part of its argument: f(x) = max(0,x) -> a unit employing the rectifier is also called a rectified linear unit (ReLU) -> easier to train than sigmoid ReLU(a) = max(0,a) rectified = gleichgerichtet

Answer 8

the optimization is prone to falling into local minima neural networks are mostly used with non-linear activation functions (i.e. sigmoid), hence the optimization becomes non-convex

Answer 9

the Gram matrix (Gramian matrix or Gramian) of a set of vectors v_1,...v_n in an inner product space is the Hermitian matrix of inner products An important application is to compute linear independence: a set of vectors is linearly independent if and only if the Gram determinant (the determinant of the Gram matrix) is non-zero

Answer 10

any error in the perception or representation of any information, introduced by the involved equipment or technique(s)

Answer 11

vanilla convolutions struggle to integrate global context: the effective receptive field of units can only grow linearly with layers. This is very limiting, especially for high-resolution input images. Dilated convolutions to the rescue! even though the number of parameters grows only linearly with layers, the effective receptive field of units grows exponentially with layer depth They can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers http://www.inference.vc/dilated-convolutions-and-kronecker-factorisation/

Answer 12

tells us how well we can expect a gradient descent step to perform -> it can be used to determine whether a critical point is a local maximujm, a local minimum, or a saddle point

Answer 13

a Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function's derivatives at a single point A function can be approximated by using a finite number of terms of its Taylor series. Taylor's theorem gives quantitative estimates on the error introduced by the use of such an approximation

Answer 14

At a critical point, where ∇_xf(x) = 0, we can examine the eigenvalues of the Hessian to determine whether the critical point is a local maximum, local minimum, or saddle point. -> When the Hessian is positive definite (all its eigenvalues are positive), the point is a local minimum.

Answer 15

Using the eigendecomposition of the Hessian matrix, we can generalize the second derivative test to multiple dimensions

Answer 16

allows at a critical point to determine whether the critical point is alocal maximum, local minimum, or saddle point

Answer 17

problem: poor condition number, choosing a good step size is difficult -> step size too small, too little significant progress made - > This issue can be resolved by using information from the Hessian matrix to guide the search - > Newton’s method is based on using a second-order Taylor series expansion to approximate f(x) near some point x^(0) - > When f is a positive definite quadratic function, Newton’s method consists of applying equation 4.12 once to jump to the minimum of the function directly - > When f is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton’s method consists of applying equation 4.12 multiple times. - > Iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would. This is a useful property near a local minimum, but it can be a harmful property near a saddle point

Answer 18

Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms. Optimization algorithms that also use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms

Answer 19

a function for which the Hessian is positive semidefinite everywhere - Such functions are well-behaved because they lack saddle points and all of their local minima are necessarily global minima. However, most problems in deep learning are difficult to express in terms of convex optimization

Answer 20

A matrix whose eigenvalues are all positive or zero-valued

Answer 21

A matrix whose eigenvalues are all positive

Answer 22

The line search approach first finds a descent direction along which the objective function f will be reduced and then computes a step size that determines how far x should move along that direction. The descent direction can be computed by various methods, such as gradient descent, Newton's method. The step size can be determined either exactly or inexactly.

Answer 23

det(A): - is a function mapping matrices to real scalars - The determinant is equal to the product of all the eigenvalues of the matrix. - The absolute value of the determinant can be thought of as a measure of how much multiplication by the matrix expands or contracts space. - If the determinant is 0, then space is contracted completely along at least one dimension, causing it to lose all of its volume. - If the determinant is 1, then the transformation preserves volume

Answer 24

related directly to the rates at which events occur

Answer 25

related to qualitative levels of certainty (also: degree of belief)

Answer 26

is the process of creating new images from some form of image description

Answer 27

Regression: the output variable takes continuous values. Regression involves estimating or predicting a response. ``` Classification: the output variable takes class labels. Classification is identifying group membership. ```

Answer 28

A patch is small (generally rectangular) piece of an image. For example, an 8x8 patch is a square patch containing 64 pixels of a larger image (of size say, 256x256 pixes). Due to the smaller size, some of the image processing algorithms such as denoising/super resolution etc. are easier to operate on patches rather than operating on the entire image itself. These algorithms split an image into several smaller sized patches (of size say, 8x8), operate individually on each of these patches, and finally tile all these patches at their respective locations.

Answer 29

de-facto standard for image generation tasks

Answer 30

neural networks are used for the purpose of - clustering through unsupervised learning, - classification through supervised learning, or - regression That is, they help group unlabeled data, categorize labeled data or predict continuous values.

Answer 31

answer one simple question: How similar are two images? Learning these embeddings is the bedrock of verification, zero-shot learning, and visual search

Answer 32

Feature space refers to the n-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features

Answer 33

Images are usually represented as Height x Width x #Channels where #Channels is 3 for RGB images and 1 for grayscale images. Sometimes you see Width x Height x #Channels, but the third dimension is the “channels.”

Answer 34

a recently proposed and widely used evaluation metric for generative models

Answer 35

is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems

Answer 36

a probability distribution working with continuous random variables is called PDF A probability distribution over discrete variables may be described using a probability mass function (PMF)

Answer 37

expectation or expected value of some function f(x) with respect to a probability distribution P (x) is the average or mean value that f takes on when x is drawn from P

Answer 38

variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution When the variance is low, the values of f (x) cluster near their expected value. The square root of the variance is known as the standard deviation

Answer 39

The square root of the variance is known as the standard deviation

Answer 40

covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

Answer 41

is one approach for reconstructing the 3-dimensional shape of an object from multiple views A voxel represents a value on a regular grid in three-dimensional space. As with pixels in a bitmap, voxels themselves do not typically have their position (their coordinates) explicitly encoded along with their values. Instead, rendering systems infer the position of a voxel based upon its position relative to other voxels

Answer 42

A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions. See also perplexity. This is also called a negative log-likelihood function. Or logistic loss or cross-entropy loss.

Answer 43

Generally, the current RNN models can only be stacked to 2 or 3 layers. Over 3 layers, the performance may drop. This is generally because the gradient vanishing problem in RNN.

Answer 44

It takes a vector of arbitrary real-valued scores (in z) and squashes it to a vector of values between zero and one that sum to one

Answer 45

technique to help mitigate the problem of overfitting: randomly drop out some nodes/neurons in hidden layers during training so that they don't participate in producing the output and later help the model to better generalize

Answer 46

Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is straight forward. Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel level segmentation? Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object.

Answer 47

Object detection is the task of finding the different objects in an image and classifying them

Answer 48

The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image are Inputs: Image Outputs: Bounding boxes + labels for each object in the image R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search. At a high level, Selective Search looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects. Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet. On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and if so what object.

Answer 49

1. Generate a set of proposals for bounding boxes. 2. Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is. 3. Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.

Answer 50

The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are.

Answer 51

Embedding == representation - simply means projecting an input into another more convenient representation space

Answer 52

(of a frequency curve or distribution) having several modes or maxima mode=The mode of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. it is the value that is most likely to be sampled

Answer 53

used to minimize the error which is the sum of all the absolute differences between the true value and the predicted value.

Answer 54

purpose: introduce non-linearity into the network in turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables non-linear means that the output cannot be reproduced from a linear combination of the inputs another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function

Answer 55

- bias increases the flexibility of the model - bias determines if a neuron is activated So our neuron only "activates" (has a non-zero output value) when wTx+b>0 which is equivalent to wTx>−b. So the bias term for a neuron will act as an activation threshold in our setup (ReLU nonlinearities). Since we adaptively learn these bias terms via backpropagation, we may interpret this as we are allowing our neurons to learn when to activate and when not to activate.

Answer 56

- normalize the output from the activation function - occurs on a per batch basis - can speed up learning - means that esp. later layers are not shifted as much by earlier layers - has a slight regularization effect

Answer 57

- Features or, more in general, an internal representation or a hierarchy of concepts should be learned automatically - The internal representation should separate all factors of variation (i.e., concepts that summarize important variation of the data) -> Deep Learning Introduces hierarchical representations (from simple to complex, from low-level features to high-level features)

Answer 58

Introduces hierarchical representations

Answer 59

probability density function (if x is continuous) or a probability mass function (if x is discrete)

Answer 60

data is dynamically gathered based on previous experience

Answer 61

data is composed of just x; here we typically aim for p(x) or a method to sample p(x)

Answer 62

Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed. E.g. assuming a Gaussian distribution: Maximum likelihood estimation is a method that will find the values of μ and σ that result in the curve that best fits the data. "This is the normal distribution that has been "fit" to the data by using the maximum likelihood estimations for the mean and the standard deviation" https://www.youtube.com/watch?v=XepXtl9YKwc => "this is how we fit a distribution to data"

Answer 63

- intermediate representation | -

Answer 64

Aim is to find a separation between two classes with the largest gap (margin) possible

Answer 65

is associated with density estimation, learning to draw samples from a distribution, learning to denoise data from some distribution, finding a manifold that the data lies near to, or clustering the data into groups of related examples.

Answer 66

``` BGD allows you to take one gradient descent step per epoch, (=a single pass through the training set) while mini-batch GD allows you to take t gradient descent steps where t * m = size of training set and m is the size of a mini-batch, e.g. t = 5000 m = 1000 #trainingSet = 5M ``` when you have a large training set, mini-batch GD runs much faster

Answer 67

let m size of training set (X,Y): if mini-batch size = m -> batch gradient descent, (X^{1},Y^{1}) = (X,Y); takes too long per iteration if mini-batch size = 1 -> stochastic gradient descent (every example is its own mini-batch); lose speedup from vectorization (very inefficient) in practice: somewhere in-between 1 and m (too small and too large): gives the fastest learning because - > you get a lot of vectorization - > make progress without processing entire training set

Answer 68

In ML manifold learning aims at finding a low-dimensional embedding to represent data

Answer 69

the I/O (compositional) dependencies

Answer 70

An extension of the logistic sigmoid to multiple variables

Answer 71

Used to predict binary variables or to predict the probability of binary variables - Used as the output of a multi-class classifier - Softmax is an extension to the logistic sigmoid where we have 2 variables and z_1 = 0, z_2 = z

Answer 72

is a linear mapping method that preserves points, straight lines, and planes. Sets of parallel lines remain parallel after an affine transformation.

Answer 73

(theorem) A feedforward network with a linear output layer and enough (but at least one) hidden nonlinear layers (e.g., the logistic sigmoid unit) can approximate up to any desired precision any (Borel measurable) function between two finite- dimensional spaces however, we are not guaranteed that the learning algorithm will be able to build that representation

Answer 74

A general rule is that depth helps generalization - it is better to have many simple layers than few highly complex ones - Another interpretation is that depth allows a more gradual abstraction A deep network might give a useful representation, where concepts are gradually more and more abstract

Answer 75

Gradient descent will reach a local minimum under some constraints on both the cost function and the learning rate: 1) One such constraint is Lipschitz continuity of the gradient of the cost function: Lipschitz continuity defines the maximum slope in the whole domain which the gradient will not enter 2) small enough learning rate

Answer 76

case 1: Lipschitzianity The cost function does not satisfy the Lipschitz condition for any L -> Solution: Smooth the cost function until an L exists case 2: learning rate not small enough -> Solution: Make it smaller until gradient descent starts to work

Answer 77

Regularization aims at reducing the generalization error of an algorithm regularization aims at reducing variance Often the best option in deep learning is that what works best is a large model with a good regularization.

Answer 78

An inverse problem in science is the process of calculating from a set of observations the causal factors that produced them. It is called an inverse problem because it starts with the results and then calculates the causes. This is the inverse of a forward problem, which starts with the causes and then calculates the results.

Answer 79

the model does not fit the training data (-> high error rate on training data) high bias: the model has a high preconception about the data and contrary to the examples it keeps this preconception stubbornly such that it fits the training data poorly

Answer 80

high variance: intuition: overfitting: if we have too many examples, the learned model may fit the training set very well but fails to generalize to new examples (e.g. predict prices on new examples)

Answer 81

1) reduce number of features 2) use regularization - > keep all features, but reduce magnitude/values of weight/bias parameters

Answer 82

misalignment between the color channels

Answer 83

``` uses unlabeled samples from p(x) and labeled samples from p(x,y) to build p(y|x) or directly predict y from x ```

Answer 84

siehe separates Blatt

Answer 85

1 execution of gradient descent (could be 1 forward pass or n forward passes, depending on the mini-batch size)

Answer 86

The feature map is the output of one filter applied to the previous layer. A given filter is drawn across the entire previous layer, moved one pixel at a time. Each position results in an activation of the neuron and the output is collected in the feature map. You can see that if the receptive field is moved one pixel from activation to activation, then the field will overlap with the previous activation by (field width - 1) input values. For instance, In a 32 × 32 image , dragging the 5 × 5 receptive field across the input image data with a stride width of 1 will result in a feature map of 28 × 28 (32–5+1 × 32–5+1) output values or 784 distinct activations per image.

Answer 87

Feature map and activation map mean exactly the same thing. It is called an activation map because it is a mapping that corresponds to the activation of different parts of the image, and also a feature map because it is also a mapping of where a certain kind of feature is found in the image. A high activation means a certain feature was found.

Answer 88

A way to obtain a useful representation h is to constrain it to have a smaller dimension than x In this case the AE is called undercomplete We force the AE to focus on the most important attributes of the training data

Answer 89

- choosing the representation size and the capacity of the encoder/decorder depends on the complexity of the data distribution

Answer 90

variables we do not directly observe

Answer 91

Gradient clipping will ‘clip’ the gradients or cap them to a Threshold value to prevent the gradients from getting too large. In the above image, Gradient is clipped from Overshooting and our cost function follows the Dotted values rather than its original trajectory. https://hackernoon.com/gradient-clipping-57f04f0adae

Answer 92

helps as regularizer softmax outputs never hard 1 or 0 can help the model converge

Answer 93

stabilizes learning by normalizing the input to each unit to have zero mean and unit variance. This helps deal with training problems that arise due to poor initialization and helps gradient flow in deeper models

Answer 94

In machine learning, we generally assume the independence of different samples. Therefore, we often have to deal with the product of a (large) number of distributions. When our goal is to optimize functions of such products, it is often easier if we first work with the logarithm of such functions. As the logarithmic function is a strictly increasing function, it will not distort where the maximum is located. the log makes a product of terms (the likelihood) into a sum (the log-likelihood) with which we can then work on one term (i.e., one training sample) at a time, because they are summed together rather than multiplied together. => the log trick makes the term more manageable

Answer 95

interpolation is a method of constructing new data points within the range of a discrete set of known data points

Answer 96

Approximate inference methods make it possible to learn realistic models from big data by trading off computation time for accuracy, when exact learning and inference are computationally intractable Bayes theorem is intractable. So how can we approximately solve Bayes theorem for complex cases, so that we can scale up Bayesian learning to the types of interesting, high-dimensional datasets that we want to deal with today in ML. There has been a lot of really excellent work on improving these approximations. We can roughly divide approximate inference schemes into two categories: deterministic and stochastic. Stoachastic: based on the idea of Monte-Carlo sampling i.e., we can approximate any expectation w.r.t. a distribution as a mean of samples from it Deterministic: the typical approach is to approximate the nasty posterior with a nice, simple, tractable distribution. We can parameterise the approximation with some variational parameters, and then minimise a probabilistic divergence (e.g., the Kullback-Liebler divergence) w.r.t. the variational parameters. We then use the trained approximate distribution instead of the true, intractable one See https://www.quora.com/What-is-approximate-inference

Answer 97

see 16.2.3 The Partition Function p. 568 The normalizing constant Z is known as the partition function, a term borrowed from statistical physics. Since Z is an integral or sum over all possible joint assignments of the state x it is often intractable to compute.

Answer 98

the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables.

Answer 99

coupled generative adversarial network (CoGAN) for learning a joint distribution of multi-domain images

Answer 100

Domain Adaptation from source to target distribution Eg. we applied the proposed framework to the problem for adapting a classifie trained using labeled samples in one domain (source domain) to classify samples in a new domain (target domain) where labeled samples in the new domain are unavailable during training

Answer 101

AE: autoencoders learn a “compressed representation” of input automatically by first compressing the input (encoder) and decompressing it back (decoder) to match the original input. The learning is aided by using distance function that quantifies the information loss that occurs from the lossy compression. So learning in an autoencoder is a form of unsupervised learning (or self-supervised as some refer to it) - there is no labeled data. VAE: Instead of just learning a function representing the data like AE (a compressed representation), variational autoencoders learn the parameters of a probability distribution representing the data. Since it learns to model the data, we can sample from the distribution and generate new input data samples. So it is a generative model like, for instance, GANs.

Answer 102

VAEs optimize a variational bound https://www.quora.com/Whats-the-difference-between-a-Variational-Autoencoder-VAE-and-an-Autoencoder VAE: variational encoder tends to produce codings that look as they were sampled from Gaussian distribution (with mean and variance shown in figure). Advantage of such approach is after training you could just sample from the distribution followed by decoding and generating new data.

Answer 103

The Laplace distribution, also called the double exponential distribution, is the distribution of differences between two independent variates (a variate is a generalization of the concept of a random variable) with identical exponential distributions

Answer 104

An image analogy is a method of creating an image filter automatically from training data. In an image analogy process, the transformation between two images A and A' is "learned". Later, given a different image B, its "analogy" image B' can be generated based on the learned transformation. https://mrl.nyu.edu/publications/image-analogies/analogies-72dpi.pdf

Answer 105

When stride is 1, think of the following distinction: ``` "SAME": output size is the same as input size. This requires the filter window to slip outside input map, hence the need to pad. SAME: Apply padding to input (if needed) so that input image gets fully covered by filter and stride you specified. For stride 1, this will ensure that output image size is same as input. In SAME (i.e. auto-pad mode), Tensorflow will try to spread padding evenly on both left and right. ``` ``` "VALID": Filter window stays at valid position inside input map, so output size shrinks by filter_size - 1. No padding occurs. VALID: Don't apply any padding, i.e., assume that all dimensions are valid so that input image fully gets covered by filter and stride you specified. In VALID (i.e. no padding mode), Tensorflow will drop right and/or bottom cells if your filter and stride doesn't full cover input image. ``` https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t

Answer 106

An image analogy is a method of creating an image filter automatically from training data. In an image analogy process, the transformation between two images A and A' is "learned". Later, given a different image B, its "analogy" image B' can be generated based on the learned transformation.

Answer 107

A disentangled representation is simply a concatenation of coordinates along each underlying factor of variation

Answer 108

The logit function is the inverse of the sigmoidal "logistic" function or logistic transform used in mathematics. Often, sigmoid function refers to the special case of the logistic function In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf))

Answer 109

The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function.

Answer 110

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. see https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy

Answer 111

we only know what attribute has changed between two images, although we do not know by how much

Answer 112

learns a subset of factors of variation by reproducing parts of the input vector with the discriminator

Answer 113

The main difference between BN and IN is that the latter just computes the mean and standard deviation across the spatial domain of the input and not along the batch dimension

Answer 114

is the probability of the parameter θ given the evidence X: p( θ | X) It contrasts with the likelihood function, which is the probability of the evidence given the parameters: p( X | θ) the posterior probability of a random event is the conditional probability that is assigned after the relevant evidence or background is taken into account Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined

Answer 115

the prior of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account

Answer 116

the likelihood function, which is the probability of the evidence X given the parameters θ: p( X | θ) It contrasts with the posterior probability which is the probability of the parameter θ given the evidence X: p( θ | X)

Answer 117

cf. transfer learning

Answer 118

a manifold is a connected region. Mathematically, it is a set of points associated with a neighborhood around each point. From any given point, the manifold locally appears to be a Euclidean space. In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. in machine learning it tends to be used more loosely to designate a connected set of points that can be approximated well by considering only a small number of dimensions, embedded in a higher-dimensional space. Each dimension corresponds to a local direction of variation.

Answer 119

Manifold learning algorithms assume that most of R^n consists of invalid inputs and that interesting inputs occur only along a collection of manifolds containing a small subset of points, with interesting variations in the output of the learned function occurring only along directions that lie on the manifold, or with interesting variations happening only when we move from one manifold to another. Manifold learning was introduced in the case of continuous-valued data and the unsupervised learning setting

Answer 120

see manifold learning. the key assumption is that probability mass is highly concentrated along a (low-dimensional) manifold where the data lies

Answer 121

In ML, the assumption is that the data lies along a low-dimensional manifold (the manifold hypothesis), which is at least approximately correct by argument of two observations: 1) that the probability distribution over images, text strings, and sounds that occur in real life is highly concentrated. This suggests that the images encountered in AI applications occupy a negligible proportion of the volume of the image space. 2) we can imagine a manifold i.e. neighboorhoods of interconnected examples (traversable by applying transformations) at least informally. In the case of images, we can certainly think of many possible transformations that allow us to trace out a manifold in image space: we can gradually dim or brighten the lights, gradually move or rotate objects in the image, gradually alter the colors on the surfaces of objects, etc. It remains likely that there are multiple manifolds involved in most applications. For example, the manifold of images of human faces may not be connected to the manifold of images of cat faces.

Answer 122

an attempt to recover a parsimonious set of latent random variables that describe a distribution over the observed data

Answer 123

ill-conditioning of the Hessian matrix H a very general problem in most numerical optimization, convex or otherwise. Ill-conditioning can manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function.

Answer 124

(equivalently this is the filter size) The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by) For computational/feasibility reasons, each neuron is connected to only a local region in the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.

Answer 125

the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it is always a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen-Shannon distance

Answer 126

most severe form of non-convergence in GAN training -> G mostly produces samples for one mode only (e.g. one dog, all with beach) reason why it happens for GAN game : in practice we don't do minimax game (with max. for D in the inner loop) which would guarantee convergence to the correct distribution, insted we do SGD for both players (G and D) simultaneously -> because simultaneous SGD can sometimes behave like minmax and a little bit like maxmin and lot of times it behaves a bit more like maxmin min-max and max-min do different things D in inner loop: convergence to correct distribution G in inner loop: place all mass on most likely point - > ways to reduce this problem: 1) use mini-batch features: if a single sample is too close to other samples in the mini-batch, it can be rejected as having collapsed to a single mode 2) Unrolled GANs: backprop through k updates of the discriminator to prevent mode collapse (this to make sure we're actually doing min-max rather than max-min)

Answer 127

true dimensionality often much lower than the possible dimensionality why is it a problem to have high-dimensionality? - ML methods are statistical by nature => count observations in various regions of some space - as dimensionality grows, fewer observations per region -> many regions without observations -> the observations become sparser and sparser => search space grows very quickly with more dimensions but the number of examples you have stays the same -> there is less and less redudandancy for your ML algorithm to sink its teeth into -> it will perform worse and worse => the problem is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. => Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient. e.g. true dimensionality on digits: possible variations of the pen-stroke

Answer 128

In machine learning problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values

Answer 129

the cross-covariance is a function that gives the covariance of one process with the other at pairs of time points. If X and Y are independent, then their covariance is zero. The converse, however, is not generally true. if two variables are uncorrelated, that does not in general imply that they are independent. A nonlinear relationship can exist that still would result in a covariance value of zero.

Answer 130

a vector norm is a function that assigns a strictly positive length or size to each vector in a vector space

Answer 131

a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices

Answer 132

The spectral norm of a matrix A is the largest singular value of A. The induced matrix norm of the L2-norm for vectors is the spectral norm. The spectral norm is the maximum singular value of a matrix. Intuitively, think of it as the maximum 'scale' by which the matrix can 'stretch' a vector.

Answer 133

The residual of a data point is the difference between the actual value and the predicted value

Answer 134

the goal of ML is to find the optimal way to fit a distribution to the data

Answer 135

a linear model that uses one independent variable to predict a (dependent) variable -> uses "least squares" to fit the line

Answer 136

like linear regression but uses multiple independent variables

Answer 137

predicts whether something is True or False, instead of predicting something continuous like size logistic regression fits an "S" shaped "logistic function" that cure goes from 0 to 1 logistic regression is usually used for classification note well: LR can use continuous and discrete measurements to provide probabilities (and classify new samples) => uses "maximum likelihood" to fit the line (vs linear regression that uses "least squares" to fit the line)

Answer 138

the rank refers to the number of dimensions present within the tensor TF: The rank of a tensor is the number of indices required to uniquely select each element of the tensor. Rank is also known as "order", "degree", or "ndims." The rank of a tf.Tensor object is its number of dimensions.

Answer 139

So this motivated them to use skip connections and use so-called deep residual layers to allow their network to learn deviations from the identity layer, hence the term residual, residual here referring to difference from the identity.

Answer 140

In contrast to the non-stationary process that has a variable variance and a mean that does not remain near, or returns to a long-run mean over time, the stationary process reverts around a constant long-term mean and has a constant variance independent of time. Most statistical forecasting methods are based on the assumption that the time series are approximately stationary. Unfortunately, most price series are not stationary. They are either drifting upward or downward. https://www.quora.com/What-is-Stationary-series-and-non-Stationary-series Features should be stationary! https://youtu.be/UQmKh84OZls?t=2513

Answer 141

put simply: CNNs can detect the same object in an image even if it's moved around, resized, rotated etc. there's an innate prior in CNNs – the assumption that an image processing system should be translationally invariant – which is enforced through an architectural design choice (weight sharing) Invariant to translation means that a translation of input features doe not change the outputs at all. So if your pattern 0,3,2,0,0 on the input results in 0,1,0 in the output, then the pattern 0,0,3,2,0 would also lead to 0,1,0 https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo

Answer 142

a metric to measure the accuracy of object detectors like Faster R-CNN, SSD, etc https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173

Answer 143

accuracy: involves to how close you come to the actual result (accuracy without precision: arrows cluster around correct result i.e. the apple but without certainty of a bull's eye for any given shot) precision: how consistently you can get that result using the same method (precision without accuracy: arrows consistently hit center of head but not the apple) - > while we ultimately strive for accuracy, - > precision reflects our certainty of reliably achieving accuracy

Answer 144

- to compute uncertainties by using the concept of probability - used to model uncertainties by using directed acyclic graphs - used in predictive modeling, in descriptive analysis - e.g. Monty Hall problem

Answer 145

downscale an image

Answer 146

see https://towardsdatascience.com/k-fold-cross-validation-explained-in-plain-english-659e33c0bc0 https: //stats.stackexchange.com/questions/266225/step-by-step-explanation-of-k-fold-cross-validation-with-grid-search-to-optimise https: //scikit-learn.org/stable/modules/cross_validation.html

Answer 147

https://mlfromscratch.com/nested-cross-validation-python-code/#/

Answer 148

In neural networks, co-adaptation refers to when different hidden units in a neural network have highly correlated behavior. It is better for computational efficiency and the model’s ability to learn a general representation if hidden units can detect features independently of each other. A few different regularization techniques aim at reducing co-adapatation – dropout being a notable one. ------ In neural networks, co-adaptation means that some neurons are highly dependent on others. If those independent neurons receive “bad” inputs, then the dependent neurons can be affected as well, and ultimately it can significantly alter the model performance, which is what might happen with overfitting.

Answer 149

- Mixed precision is the combined use of different numerical precisions in a computational method. - Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. - significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures Half precision (also known as FP16) data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers. Single precision (also known as 32-bit) is a common floating point format (float in C-derived programming languages), and 64-bit, known as double precision (double). Mixed precision consists in using the full precision (i.e., float32) for some key specific layers (e.g., loss layer) while reducing most of the other layers to half precision (i.e., float16). The training process therefore requires less memory due to faster data transfer operations while at the same time math-intensive and memory-limited operations are sped up. These benefits are ensured at no accuracy expense compared to a full precision training.

Answer 150

Atrous ("with holes" convolution is an alternative for the down sampling layer. It increases the receptive field whilst maintains the spatial dimension of feature maps.

Answer 151

- True Positive Rate - Recall = TP / (TP + FN)

Answer 152

- True Negative Rate = TN / (TN + FP)

Answer 153

= TP / (TP + FP)

Answer 154

= (TP + TN) / (TP + TN + FP + FN) Accuracy can be a misleading metric for imbalanced data sets. Consider a sample with 95 negative and 5 positive values. Classifying all values as negative in this case gives 0.95 accuracy score.

Answer 155

Equivariant to translation means that a translation of input features results in an equivalent translation of outputs. So if your pattern 0,3,2,0,0 on the input results in 0,1,0,0 in the output, then the pattern 0,0,3,2,0 might lead to 0,0,1,0

Machine Learning Flashcards

(199 cards)