Neural Networks Flashcards

Question

Key thing about backprop

Answer 1

we're storing intermediate quantities

Answer 2

* Reuse in the forward computation * Reuse in the backward computation

Answer 3

* The gradients of the objective function with respect to: * each parameter * The bias

Answer 4

All gradients are computed before updating any parameter values

Answer 5

The true label

Answer 6

The predicted label

Answer 7

* The parameter matrix of the first layer? * Two dimensional

Answer 8

Each entry of the resulting matrix is at the row of the 1st matrix and the column index from the 2nd matrix

Answer 9

layers where all the inputs from one layer are connected to every activation unit of the next layer

Answer 10

* Back propagation is computing the gradients * SGD is updating the parameters

Answer 11

of a directed graph: a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering.

Answer 12

Typically a directed acyclic graph

Answer 13

used in image processing for tasks such as edge detection, blurring, sharpening, etc.

Answer 14

Just has a 1 in the middle and zeros for every other part of the kernel, so it doesn't change the image at all (except make it smaller if you don't do any padding)

Answer 15

Have higher values in the middle of the kernel. Get a blurred version of the image

Answer 16

Amount of pixels by which you slide the kernel at each step

Answer 17

The CNNs have a convolution layer and max pooling layer

Answer 18

Treat the convolution matrix values as parameters and learn them

Answer 19

The matrix that has the values that are multiplied by each pixel in the inner product

Answer 20

Weights of convolution are fixed to a uniform distribution (i.e. they're all the same)

Answer 21

* A form of downsampling? * Instead of having weights in the convolution matrix, take the max value of those in the window

Answer 22

Back propagation

Answer 23

* The convention in this class is minimize the objective function. * So if you need to maximize an objective function, you need to put a minus sign in front of the objective function and minimize it

Answer 24

Take any line in the set of the slopes of the tangent lines?

Answer 25

Stretch the 3D tensor out into a long vector, then matrix multiply it by a weigh matrix and it will result in a linear layer

Answer 26

Hidden layers are nonlinear functions of the word embeddings of every word that came before it and the word at the current time step.

Answer 27

Builds up a fixed length vector representation of all the previous words

Answer 28

* Assumption: Training data and test data come from the same distribution * This allows us to make statements about theoretical guarantees and bounds on performance for hypotheses we learn

Answer 29

It's always unknown

Answer 30

True function that we're trying to learn. It labeled the training data

Answer 31

The empirical risk minimizer has the lowest training error

Answer 32

No because the decision boundaries each could be shaped in ways that are impossible to make for zero error.

Answer 33

H is trained on some random sample of our data.

Answer 34

* Minimum number of training samples we need in order to ensure that the PAC criterion is met * Depends on epsilon and delta

Answer 35

Hypothesis is consistent w/ the training data if it achieves zero training error R hat of h = 0

Answer 36

c\* is in our hypothesis space H

Answer 37

c\* may or may not be in our hypothesis space H

Answer 38

In the case of classification, It correctly classifies it

Answer 39

Forward Propagation is the process of calculating the value of your loss function, given data, weights, and activation functions

Answer 40

Analogous to y intercept in 2d plots. It allows us to offset the fitted function from the origin. Similar to how an intercept term in linear regression allows it to better fit data, the bias term helps the neural network better fit its data as well.

Answer 41

The composition of two linear functions is itself a linear function. We want to learn more interesting patterns than what can be expressed in linear functions, and the multiplication of the inputs by the weights is a linear function. Thus, in order to make a neural network nonlinear, we need to use nonlinear activation functions. A neural network with only linear activation functions would be no different than a linear regression. (Try forward propagating with only linear functions on the given example)

Answer 42

The gradients with respect to α and β are used in updating. The rest are intermediate values used to calculate these two gradients

Answer 43

1. Allows us to train networks with much less data. * Using filters slide over the input via convolution allows us to use fewer parameters while still processing the entire input through multiple layers, allowing us to train networks with much less data. 2. Since a square kernel operates on multiple rows at each time step, the 2-dimensional nature of the image is taken into account. When you flatten the image into a vector, since convolution has already been applied, the information in the 2d structure is preserved (at least partially)

Answer 44

* This allows us to use the same filter for detecting a feature **regardless of where in the image the feature occurs.** * This is important when we want to detect a feature regardless of where it occurs in the input data. * Applies to convolutional layers

Answer 45

Often square with odd side lenghts

Answer 46

* ____ refers to a 2-tensor, or matrix, of weights. * ____ refers to the set of kernels being used. * If only one is used, then the terms filter and kernel are interchangeable.

Answer 47

* If you have a stride of s, you skip s-1 positions at each time step

Answer 48

* adding fake values (according to one of a variety of rules) around the borders of the image. usually applied equally to all sides of the image * Helps ensure each part of the kernel is applied to each part of the image

Answer 49

Input size, filter size, stride, and padding

Answer 50

Smaller stride -\> more information, but requires more computation and produces more output data

Answer 51

Usually have the same stride value in all dimensions

Answer 52

* H - K + 1 * W - K + 1

Answer 53

We're only dealing with situations in which our filter consists of one kernel applied to one channel at a time.

Answer 54

* α - matrix of weights from inputs to the hidden layer * a - input data times the weights * z - output of the activation function applied on a * β - matrix of weights from the hidden layer to the output layer * y hat - output layer * x subzero - bias at the input * z subzero - bias at the hidden layer

Answer 55

Each row in the alpha matrix corresponds to one unit in the hidden layer z.

Answer 56

It's our x matrix with a bias term added in

Answer 57

A transpose of the other input matrix (not the one you're taking the gradient with respect to)

Answer 58

Shape of the gradient matrix = shape of the matrix that you're taking the gradient with respect to (if it's differentiating a scalar)

Answer 59

* # of rows = # of neurons in the previous layer * # of columns = # of neurons in the next layer

Answer 60

To extend linear models to represent nonlinear functions of x, we can apply the linear model not to x itself but to a transformed input φ(x), where φ is a nonlinear transformation

Answer 61

Compute the hidden layer values

Answer 62

Linear models can't understand the interaction between any two input variables.

Answer 63

No, it's nonlinear. However, the function remains very close to linear, in the sense that is a piecewise linear function with two linear pieces. Because rectified linear units are nearly linear, they preserve many of the properties that make linear models easy to optimize with gradientbased methods

Answer 64

* rectified linear units are nearly linear, they preserve many of the properties that make linear models easy to optimize with gradient-based methods. * They also preserve many of the properties that make linear models generalize wel

Answer 65

the nonlinearity of a neural network causes most interesting loss functions to become non-convex

Answer 66

* For feedforward neural networks, it is important to initialize all weights to small random values. * The biases may be initialized to zero or to small positive values

Answer 67

It's the number of input nodes

Answer 68

* Adagrad implicitly changes the step size based on the shape of the function inferred from the gradients. * Each parameter has its own learning rate that improves performance on problems with sparse gradients.

Answer 69

Learning rate decreases slowly over time

Answer 70

We want to use a large step size (aka learning rate) where possible, but smaller LR where we are in danger of overshooting the optima.

Answer 71

=The number of input variables

Answer 72

The number of outputs associated with each input

Answer 73

1. Based on the data, draw an expected decision boundary to separate the classes. 2. Express the decision boundary as a set of lines. Note that the combination of such lines must yield to the decision boundary. 3. The number of selected lines represents the number of hidden neurons in the first hidden layer. 4. To connect the lines created by the previous layer, a new hidden layer is added. Note that a new hidden layer is added each time you need to create connections among the lines in the previous hidden layer. 5. The number of hidden neurons in each new hidden layer equals the number of connections to be made.

Answer 74

hidden layers are required if and only if the data must be separated non-linearly.

Answer 75

A single layer perceptron

Answer 76

* y = w\_1\*x\_1 + w\_2\*x\_2 + ⋯ + w\_i\*x\_i + b * It's a linear classifier

Answer 77

* By using multiple lines. * ANN is a multilayer perception. Each perception adds a line. Each perceptron/line corresponds to one neuron in the hidden layer?

Answer 78

* Draw the ideal decision boundary curve * One line intersection needs to represent each change in direction of the ideal DB curve. Add lines accordingly * The output layer does the merging of the two lines

Answer 79

* from 0 to infiniti, - a hyperparameter that weights the relative contribution of omega(theta)? * parameter norm penalty term, omega(theta) * (fill this in)

Answer 80

for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized. The biases typically require less data to fit accurately than the weights. Each weight specifies how two variables interact. Fitting the weight well requires observing both variables in a variety of conditions. Each bias controls only a single variable. This means that **we do not induce too much variance by leaving the biases unregularized.** Also, regularizing the bias parameters can introduce a significant amount of underfitting.

Answer 81

Because it can be expensive to search for the correct value of multiple hyperparameters, it is still reasonable to use the same weight decay at all layers

Answer 82

it is sometimes desirable to use a separate penalty with a different α coefficient for each layer of the network

Answer 83

synonymous with L2 regularization

Answer 84

Synonymous with L2 regularization

Answer 85

Name for the L2 parameter norm penalty, not for L2 regularization in general?

Answer 86

* drives the weights closer to the origin1 by adding a regularization term to the objective function * Only directions along which the parameters contribute significantly to reducing the objective function are preserved relatively intact.

Answer 87

The L1 norm of ω

Answer 88

The sum of absolute values of the individual elements of the vector

Answer 89

* For L1, the regularization contribution to the gradient doesn't scale linearly with each lowercase omega sub i; instead it is a constant factor with a sign equal to sign(lowercase omega sub i). * One consequence of this form of the gradient is that we will not necessarily see clean algebraic solutions to quadratic approximations of J(X, y;lowercase omega) as we did for L2 regularization

Answer 90

* Generally, they shift the values of lowercase omega toward zero. * Technically, it doesn't have to be zero in either case, but that's usually how it's implemented

Answer 91

* L1 results in a solution that is more sparse

Answer 92

* In the case of a large enough α * Never

Answer 93

MAP Bayesian inference with a Gaussian prior on the weights

Answer 94

* reduce the test error (i.e. increase its ability to generalize), possibly at the expense of increased training error

Answer 95

Given the input data x, we multiply it by the given weights, α, then apply the corresponding activation function to it and finally pass the result to the next layer

Neural Networks Flashcards

(133 cards)