12 Neural Networks Flashcards

1
Q

In (batched) gradient descent the model is updated based on …
* Taking the … allows for the gradients to be calculated based on the …, rather than being influenced by the specific characteristics of any one instance.

  • Mini-batch gradient descent or SGD are used to speed up the process.

The risk function 𝑹 is defined as …

Backpropagation is a method used in artificial neural networks to calculate … that is then needed to calculate the weights and biases of the network.

A

the average of the gradients of the loss functions calculated over all the training examples.

average loss
overall performance of the model on the entire dataset

the expectation of the loss functions over a dataset.

gradients wrt. the risk function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Local Minima of Risk Functions
The risk function of a neural network is in general neither convex nor concave. This means that the …

The probability of …and decreases quickly with network size.
For large networks, most local minima are …

  • If…, just like logistic regression.
  • Neural networks with … also yield a convex
    problem. A neural network with … is simply a linear regression
    model.
A

Hessian is neither positive semidefinite, nor negative semidefinite.

finding a “bad” (high value) local minimum is non-zero for small- size neural networks

equivalent and yield similar performance on a test set

there’s no hidden layers the logistic neural network is convex

linear activation functions and square loss; a linear activation function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Linearly Separability
Two classes of points are linearly separable, iff there exists a line such that ….

If 𝑓(𝑥) is linear, the …

Established approach is to use the non- linear, differentiable sigmoidal “logistic” function 𝑓(𝑥), that we have seen earlier, which can draw complex boundaries.

Output layer:
Corresponding to modeling needs, i.e. 1 for binary classification, 𝑘 − 1 nodes for multiclass classification, etc. Activation of output layer for each observation becomes…

A

all the points of one class fall on one side of the line, and all the points of the other class fall on the other side of the line.

NN can only draw straight decision boundaries (even if there are many layers of units).

the input to the loss function 𝐿 that triggers backpropagation.

reminder to include the differential of the risk function with respect to weights procedure on a piece of paper–> really important to review pre-exam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Speeding Up Gradient Descent
An epoch in (standard) gradient descent is …

Remember that the loss function is large as it contains … Then the gradient needs to be computed for …

Instead compute the loss and the gradient on a small random sample (~1-1000 samples) or batch size in mini-batch GD or even on instances using Stochastic Gradient Descent.
Result: .

A

one pass through the training set, with an adjustment via backpropagation to the network weights.

every single element of the entire data set with 𝑁 samples
each iteration of gradient descent many times.

fast, but noisy gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Reminder: Mini-Batch Gradient Descent
In mini-batch gradient descent, the model makes a …) and produces a prediction for each one.

Then …

A smaller mini-batch size can …
Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

Mini-batch GD is typically more efficient than standard GD and it can be … The batch size is an …

Momentum uses an exponential averaging of gradients to make …

A regularized classifier might not take all features into account.–>

A

forward pass through all instances in a mini-batch (e.g. a few hundred instances

the average errors for all instances in the mini-batch is backpropagated through the network, which updates the model’s parameters.

provide more noisy gradients, but is faster to compute.

parallelized.;important hyperparameter. It is often preferred in practice.

sudden changes in direction less likely. It can help in SGD.

We can use L2 or L1 regularization with adding a regularization term to the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hyperparameters
* Number of …
* Number of nodes in each hidden layer
*…
* Learning rate and momentum
* …
* Batch size in mini-batch GD
* Other optimization algorithms than SGD such as RMSProp, Adam, etc.
* …
the influence of …

A

hidden layers

Activation function
− Sigmoid suffers from vanishing gradient, ReLU is now widespread

Iterations and desired error level

Regularization (aka. weight decay) in each layer (as in ridge regression) to limit

irrelevant connections with low weight on the network‘s predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Setting the Number of Nodes in the Hidden Layer
The number of nodes in the hidden layer affects …
If too few hidden nodes: ..
Few but not too few nodes: ….
Too many hidden nodes (breadth): —

What are the hidden layers doing?! …Neural Networks being treated as black boxes.

A

generality and convergence.

convergence may fail.

possibly slow convergence but good generalization

Rapid convergence, but “overfitting” happens.

Feature extraction
The non-linearities in the feature extraction can make interpretation of the hidden layers very difficult.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A feed-forward neural network with a single hidden layer and continuous non-linear activation function can approximate any continuous function with arbitrary precision.*

A
  1. Neural Network
  2. One hidden layer Approximate any function
  3. Non-linear activation function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

We consider an artificial neural network with ReLU activations:

  1. Determine whether the following statement is true or false.
    * L(W ) is not a differentiable function everywhere in its domain.
  • fW(ax)=afW(x)foralla∈R.
  • σ(x) is a convex function.
  • L(W ) might be non-convex for some dataset {(xi, yi)}ni=1.
A

True. Not differentiable at x=0

False. Not true for x = -1

True. By definition.

True. For a data pt with x=-1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

d)* Explain why it is essential for neural networks to incorporate activation functions. Do not write more than two sentences.

A

Activation functions introduce non-linearities needed to approximate non-linear functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly