Neural Networks Flashcards

1
Q

sigmoid

What does the activation function look like?

What’s the equation?

A

Output from 0-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

hyperbolic tangent

What does the activation function look like?

What’s the equation?

A

Output from 0-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s the logistic function?

A

Same as sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some characteristics of tanh?

A
  • Helps to avoid vanishing gradients problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Comparse sigmoid and tanh

A

Tanh converges faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s a problem with ReLU? How is it addressed?

A

Derivative at x<0 is 0

One solution: Instead use an ELU (exponential linear unit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe Relu (2) and ELU (2)

A

In general, ELU outperforms ReLU?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s one thing you could do to convert a NN classifier to a regressor?

A

“chop off” sigmoid activation function and juset keep output linear layer which ranges from -infinity to +infinity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s an objective function you could use for regression?

A

Quadratic Loss:

the same objective as Linear Regression –

i.e. mean squared erroradd an additional “softmax” layer at the end of our network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What an objective function you could use for classification?

A

Cross-Entropy:

  • -the same objective as Logistic Regression
  • – i.e. negative log likelihood
  • – This requires probabilities, so we add an additional “softmax” layer at the end of our network
  • “any time you’re using classification, this is a good choice” - Matt
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does a softmax do? (1)

A

Takes scores and transforms them into a probability distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For which functions does there exist a one-hidden layer neural network that achieves zero error?

A

Any function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Universal Approximation Theorem? (2)

A
  • a neural network with 1 hidden layer can approximate any continuous function for inputs within a specific range. (might require a ridiculous amount of hidden units, though)
  • If the function jumps around or has large gaps, we won’t be able to approximate it.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do we know about the objective function for a NN? (1)

A

It’s nonconvex, so you might end up converging on a local min/max

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which way are you stepping in SGD?

A

Opposite the gradient (verify this)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What’s the relationship between Backpropagation and reverse mode automatic differentiation?

A

Backpropagation is a special case of a more
general algorithm called reverse mode automatic differentiation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What’s a benefit of reverse mode automatic differentiation?

A

Can compute the gradient of any differentiable function efficiently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
  • When can we compute the gradients for an arbitrary neural network?
A

(question from lecture)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When can we make the gradient computation for an arbitrary NN efficient?

A

(lecture question)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the ways of computing gradients? (4)

A
  • Finite Difference Method
  • Symbolic Differentiation
  • Automatic Differentiation - Reverse Mode
  • Automatic Differentiation - Forward Mode
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe automatic differentiation - reverse mode (a pro, con, and requirement)

A
  • Note: Called Backpropagation when applied to Neural Nets –
  • Pro: Computes partial derivatives of one output f(x)iwith respect to all inputs xj in time proportional to computation of f(x) –
  • Con: Slow for high dimensional outputs (e.g. vector-valued functions) –
  • Required: Algorithm for computing f(x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Describe automatic differentiation - forward mode (a pro, con, and requirement)

A
  • Note: Easy to implement. Uses dual numbers.
  • Pro: Computes partial derivatives of all outputs f(x)i with respect to one input xj in time proportional to computation of f(x) –
  • Con: Slow for high dimensional inputs (e.g. vector-valued x)
  • Required: Algorithm for computing f(x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Describe the finite difference method (a pro, con, and requirement)

When is it appropriate to use?

A
  • Pro: Great for testing implementations of backpropagation –
  • Con: Slow for high dimensional inputs / outputs
  • Con: In practice, suffers from issues of floating point precision
  • Required: Ability to call the function f(x) on any input x
  • only appropriate to use on small examples with an appropriately chosen epsilon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe symbolic differentiation (2 notes, a pro, con, and requirement)

A
  • Note: The method you learned in high-school –
  • Note: Used by Mathematica / Wolfram Alpha / Maple –
  • Pro: Yields easily interpretable derivatives –
  • Con: Leads to exponential computation time if not carefully implemented –
  • Required: Mathematical expression that defines f(x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Key thing about backprop

A

we’re storing intermediate quantities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why is backpropagation efficient?

A
  • Reuse in the forward computation
  • Reuse in the backward computation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the gradients that we store from the forward pass in backprop?

A
  • The gradients of the objective function with respect to:
    • each parameter
    • The bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What’s one important thing to remember about back propagation?

A

All gradients are computed before updating any parameter values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Say the main steps of the pseudocode for SGD with backpropagation for a NN

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is y*?

A

The true label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is y^?

A

The predicted label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q
  • What is α(1)?
  • What is its shape?
A
  • The parameter matrix of the first layer?
  • Two dimensional
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

When doing matrix multiplication, what’s a useful thing to remember?

A

Each entry of the resulting matrix is at the row of the 1st matrix and the column index from the 2nd matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What’s a fully connected layer?

A

layers where all the inputs from one layer are connected to every activation unit of the next layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What’s the difference between SGD and backprop?

A
  • Back propagation is computing the gradients
  • SGD is updating the parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What’s the derivative of sigmoid, s?

A

s(1-s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Define topological order

A

of a directed graph: a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Say the general pseudocode for backpropagation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What do we know about the computation graph for a NN?

A

Typically a directed acyclic graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are some uses of a convolution matrix?

A

used in image processing for tasks such as edge detection, blurring, sharpening, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the identity convolution?

A

Just has a 1 in the middle and zeros for every other part of the kernel, so it doesn’t change the image at all (except make it smaller if you don’t do any padding)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What’s blurring convolution?

A

Have higher values in the middle of the kernel.

Get a blurred version of the image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What’s the basic idea of convolution?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is a stride?

A

Amount of pixels by which you slide the kernel at each step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What’s the difference between CNNs and the basic feed forward NNs we’ve been talking about?

A

The CNNs have a convolution layer and max pooling layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What’s the key idea of CNNs?

A

Treat the convolution matrix values as parameters and learn them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is a convolution matrix?

A

The matrix that has the values that are multiplied by each pixel in the inner product

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is downsampling?

A

Weights of convolution are fixed to a uniform distribution (i.e. they’re all the same)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is max pooling?

A
  • A form of downsampling?
  • Instead of having weights in the convolution matrix, take the max value of those in the window
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What do CNNs use to train?

A

Back propagation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Do we minimize or maximize the objective function? What are the implications?

A
  • The convention in this class is minimize the objective function.
  • So if you need to maximize an objective function, you need to put a minus sign in front of the objective function and minimize it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

If you have a subdifferentiable function e.g. relu what do you do to perform back prop?

A

Take any line in the set of the slopes of the tangent lines?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

If you want a fully connected layer following a 3D tensor, what do you do?

A

Stretch the 3D tensor out into a long vector, then matrix multiply it by a weigh matrix and it will result in a linear layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is an n-Gram language model?

A
55
Q

What’s a key idea about RNNs?

A

Hidden layers are nonlinear functions of the word embeddings of every word that came before it and the word at the current time step.

56
Q

What is the chain rule of probability?

A
57
Q

how does an RNN work?

A

Builds up a fixed length vector representation of all the previous words

58
Q

What’s the key idea of a sequence to sequence model?

A
59
Q
A
60
Q

What’s a fundamental assumption of ML? What are its implications?

A
  • Assumption: Training data and test data come from the same distribution
  • This allows us to make statements about theoretical guarantees and bounds on performance for hypotheses we learn
61
Q

What do we know about true error?

A

It’s always unknown

62
Q

What is c*

A

True function that we’re trying to learn.

It labeled the training data

63
Q

What is R^

A

The empirical risk minimizer

has the lowest training error

64
Q

Does the function with the lowest expected error equal c*, the function we’re trying to model?

A

No because the decision boundaries each could be shaped in ways that are impossible to make for zero error.

65
Q

What’s the key idea of PAC learning?

A
66
Q

What does PAC do?

What does it stand for?

A
67
Q

What about PAC requires us to write the PAC criterion as a probabilistic statement?

A

H is trained on some random sample of our data.

68
Q

What is sample complexity? What does it depend on?

A
  • Minimum number of training samples we need in order to ensure that the PAC criterion is met
  • Depends on epsilon and delta
69
Q

Define consistent

A

Hypothesis is consistent w/ the training data if it achieves zero training error R hat of h = 0

70
Q

What does realizable mean?

A

c* is in our hypothesis space H

71
Q

What does agnostic mean?

A

c* may or may not be in our hypothesis space H

72
Q

What does it mean for some h to be consistent with a particular training sample?

A

In the case of classification, It correctly classifies it

73
Q

Describe forward propagation using one sentence

A

Forward Propagation is the process of calculating the value of your loss function, given data, weights, and activation functions

74
Q

Why do we include a bias term in the input and in the hidden-layer?

A

Analogous to y intercept in 2d plots. It allows us to offset the fitted function from the origin.

Similar to how an intercept term in linear regression allows it to better fit data, the bias term helps the neural network better fit its data as well.

75
Q

Why do we need to use nonlinear activation functions in our neural net?

A

The composition of two linear functions is itself a linear function. We want to learn more interesting patterns than what can be expressed in linear functions, and the multiplication of the inputs by the weights is a linear function. Thus, in order to make a neural network nonlinear, we need to use nonlinear activation functions.

A neural network with only linear activation functions would be no different than a linear regression. (Try forward propagating with only linear functions on the given example)

76
Q

Which of the gradients calculated in Back propagation directly update the weights? Do not include intermediate value(s) used to calculate these gradient(s).

A

The gradients with respect to α and β are used in updating. The rest are intermediate values used to calculate these two gradients

77
Q

What are two advantages of CNNs?

A
  1. Allows us to train networks with much less data.
    • Using filters slide over the input via convolution allows us to use fewer parameters while still processing the entire input through multiple layers, allowing us to train networks with much less data.
  2. Since a square kernel operates on multiple rows at each time step, the 2-dimensional nature of the image is taken into account. When you flatten the image into a vector, since convolution has already been applied, the information in the 2d structure is preserved (at least partially)
78
Q

What is translation invariance? What does it apply to?

A
  • This allows us to use the same filter for detecting a feature regardless of where in the image the feature occurs.
  • This is important when we want to detect a feature regardless of where it occurs in the input data.
  • Applies to convolutional layers
79
Q

What do kernels usually look like?

A

Often square with odd side lenghts

80
Q

What is a kernel? What’s a filter?

A
  • ____ refers to a 2-tensor, or matrix, of weights.
  • ____ refers to the set of kernels being used.
  • If only one is used, then the terms filter and kernel are interchangeable.
81
Q

What is stride?

A
  • If you have a stride of s, you skip s-1 positions at each time step
82
Q
  • What is padding?
  • Why is it used?
A
  • adding fake values (according to one of a variety of rules)
    around the borders of the image. usually applied equally to all sides of the image
  • Helps ensure each part of the kernel is applied to each part of the image
83
Q

What effects the output size of a given layer? (4)

A

Input size, filter size, stride, and padding

84
Q

What is the tradeoff of different stride values?

A

Smaller stride -> more information, but requires more computation and produces more output data

85
Q

What do we know about stride values? (1)

A

Usually have the same stride value in all dimensions

86
Q

How many positions can the kernel fit when S=1 and p=0

  • horizontally?
  • Vertically?
A
  • H - K + 1
  • W - K + 1
87
Q

What’s the scope of kernels and filters in this class

A

We’re only dealing with situations in which our filter consists of one kernel applied to one channel at a time.

88
Q

In the case of our NN, what is

  • α
  • a
  • z
  • β
  • y hat
  • x subzero
  • z subzero
A
  • α - matrix of weights from inputs to the hidden layer
  • a - input data times the weights
  • z - output of the activation function applied on a
  • β - matrix of weights from the hidden layer to the output layer
  • y hat - output layer
  • x subzero - bias at the input
  • z subzero - bias at the hidden layer
89
Q

What’s the correspondence between alpha and z?

A

Each row in the alpha matrix corresponds to one unit in the hidden layer z.

90
Q

What is x hat?

A

It’s our x matrix with a bias term added in

91
Q

What’s the derivative of a matrix multiplication with respect to one of the input matrices?

A

A transpose of the other input matrix (not the one you’re taking the gradient with respect to)

92
Q

How do we know that our gradient matrix is the right shape during back propagation?

A

Shape of the gradient matrix = shape of the matrix that you’re taking the gradient with respect to (if it’s differentiating a scalar)

93
Q

What’s the intuition behind the dimensions of a weight matrix? One rule for row and one for columns

A
  • # of rows = # of neurons in the previous layer
  • # of columns = # of neurons in the next layer
94
Q

How are neural networks able to approximate nonlinear functions?

A

To extend linear models to represent nonlinear functions of x, we can apply the linear model not to x itself but to a transformed input φ(x), where φ is a nonlinear transformation

95
Q

In simple terms, what do activation functions do?

A

Compute the hidden layer values

96
Q

What’s one way to describe the limitation of linear functions?

A

Linear models can’t understand the interaction between any two input variables.

97
Q

Is a Relu linear?

A

No, it’s nonlinear. However, the function remains very close to linear, in the sense that is a piecewise linear function with two linear pieces. Because rectified linear units are nearly linear, they preserve many of the properties that make linear models easy to optimize with gradientbased methods

98
Q

Why is ReLU popular?

A
  • rectified linear units are nearly linear, they preserve many of the properties that make linear models easy to optimize with gradient-based methods.
  • They also preserve many of the properties that make linear models generalize wel
99
Q

What do we know about optimization for a neural network? (1)

A

the nonlinearity of a neural network causes most interesting loss functions to become non-convex

100
Q
  • For feedforward NNs, What values should weights be intialized to?
  • What about biases?
A
  • For feedforward neural networks, it is important to initialize all weights to small random values.
  • The biases may be initialized to zero or to small positive values
101
Q

How do we know the number of features by looking at a neural network diagram?

A

It’s the number of input nodes

102
Q

Describe how adagrad works (2)

A
  • Adagrad implicitly changes the step size based on the shape of the function inferred from the gradients.
  • Each parameter has its own learning rate that improves performance on problems with sparse gradients.
103
Q

What does adagrad do? (1)

A

Learning rate decreases slowly over time

104
Q

What’s the reason for using adagrad?

A

We want to use a large step size (aka learning rate) where possible, but smaller LR where we are in danger of overshooting the optima.

105
Q

How many neurons are there in the input layer?

A

=The number of input variables

106
Q

How many neurons are there in the output layer?

A

The number of outputs associated with each input

107
Q

How do you calculate the number of neurons in a hidden layer? (5)

A
  1. Based on the data, draw an expected decision boundary to separate the classes.
  2. Express the decision boundary as a set of lines. Note that the combination of such lines must yield to the decision boundary.
  3. The number of selected lines represents the number of hidden neurons in the first hidden layer.
  4. To connect the lines created by the previous layer, a new hidden layer is added. Note that a new hidden layer is added each time you need to create connections among the lines in the previous hidden layer.
  5. The number of hidden neurons in each new hidden layer equals the number of connections to be made.
108
Q

How do we know if hidden layers are required in a neural network?

A

hidden layers are required if and only if the data must be separated non-linearly.

109
Q

What is a component of that ANNs are built from?

A

A single layer perceptron

110
Q
  • What’s the perceptron equation?
  • What do we know about single layer perceptions? (1)
A
  • y = w_1*x_1 + w_2*x_2 + ⋯ + w_i*x_i + b
  • It’s a linear classifier
111
Q
  • How do we represent neural network decision boundaries?
  • What’s the intuition behind this?
A
  • By using multiple lines.
  • ANN is a multilayer perception. Each perception adds a line. Each perceptron/line corresponds to one neuron in the hidden layer?
112
Q

How do you figure out the decision boundary for a neural network?

A
  • Draw the ideal decision boundary curve
  • One line intersection needs to represent each change in direction of the ideal DB curve. Add lines accordingly
  • The output layer does the merging of the two lines
113
Q

For regularization, what is:

  • Alpha
  • Omega(theta)
  • lambda
A
  • from 0 to infiniti, - a hyperparameter that weights the relative contribution of omega(theta)?
  • parameter norm penalty term, omega(theta)
  • (fill this in)
114
Q

What’s interesting about how regularization is applied? Why does this happen?

A

for neural networks, we typically choose to use a parameter norm penalty Ω that penalizes only the weights of the affine transformation at each layer and leaves the biases unregularized. The biases typically require less data to fit accurately than the weights. Each weight specifies how two variables interact. Fitting the weight well requires observing both variables in a variety of conditions. Each bias controls only a single variable. This means that we do not induce too much variance by leaving the biases unregularized. Also, regularizing the bias parameters can introduce a significant amount of underfitting.

115
Q

How does weight decay relate between layers of the neural networks?

A

Because it can be expensive to search for the correct value of multiple hyperparameters, it is still reasonable to use the same weight decay at all layers

116
Q

What kind of penalty do we use for the regularization of neural networks?

A

it is sometimes desirable to use a separate penalty with a different α coefficient for each layer of the network

117
Q

What is ridge regression?

A

synonymous with L2 regularization

118
Q

What is Tikhonov regularization?

A

Synonymous with L2 regularization

119
Q

What is weight decay?

A

Name for the L2 parameter norm penalty, not for L2 regularization in general?

120
Q

What’s the difference between L2 regularization and weight decay?

A
121
Q

What does L2 regularization do? (2)

A
  • drives the weights closer to the origin1 by adding a regularization term to the objective function
  • Only directions along which the parameters contribute significantly to reducing the objective function are preserved relatively intact.
122
Q

What’s the penalty term Omega(theta) of L2 regularization?

A
123
Q

For L1 regularization, what is Omega(theta)?

A

The L1 norm of ω

124
Q

What is the L1 norm?

A

The sum of absolute values of the individual elements of the vector

125
Q
  • Compare how the regularization contribution to the gradient compares between L1 and L2 regularization.
  • What’s the implication?
A
  • For L1, the regularization contribution to the gradient doesn’t scale linearly with each lowercase omega sub i; instead it is a constant factor with a sign equal to sign(lowercase omega sub i).
  • One consequence of this form of the gradient is that we will not necessarily see clean algebraic solutions to quadratic approximations of J(X, y;lowercase omega) as we did for L2 regularization
126
Q

What do L1 and L2 regularization have in common?

A
  • Generally, they shift the values of lowercase omega toward zero.
  • Technically, it doesn’t have to be zero in either case, but that’s usually how it’s implemented
127
Q

For regularization, what is ω?

A
128
Q

Compare the solutions of L1 and L2 regularization. (1)

A
  • L1 results in a solution that is more sparse
129
Q

In what case does L1 regularization cause parameters to become sparse (i.e. 0)?

What about L2?

A
  • In the case of a large enough α
  • Never
130
Q

What is L2 regularization equal to? (not a synonym)

A

MAP Bayesian inference with a Gaussian prior on the weights

131
Q

At a high level, what is the goal of regularization? (1)

A
  • reduce the test error (i.e. increase its ability to generalize), possibly at the expense of increased training error
132
Q

In SGD, does the direction that you’re stepping relative to the gradient change based on whether you’re minimizing or maximizing the objective function?

A
133
Q

In forward propagation, what happens at each layer?

A

Given the input data x, we multiply it by the given weights, α, then apply the corresponding activation function to it and finally pass the result to the next layer