Tentamen Flashcards

Question 1

Q

softmax vs sigmoid

Answer

A

softmax is voor als de output categorisch is en sigmoid wanneer de output continu is

Question 2

Q

Cover’s theorem

Answer

A

in a highly dimensional space, if the number of points is relatively small compared to the dimensionality and you paint the points randomly in two colors, the data set will be linearly separable

if the number of points in a d-dimensional space is smaller than 2*d, they are almost always linearly separable
if n/(d+1) < 2, then linearly separable
if the number of points in a d-dimensional space is bigger than 2*d, they are almost always NOT linearly separable
if n/(d+1) > 2, then not linearly separable

Question 3

Q

recommended setup for regression problems

Answer

A

linear activation function with SSE loss function

Question 4

Q

Gradient descent with momentum

Answer

A

combine current weight update with previous update
p_new = p_old - agradient + blast_direction

ofwel

x_(k+1) = x_k - αg(x_k) + βd(x_k)

Question 5

Q

Gradient descent

Answer

A

p_new = p_old - a*gradient

Question 6

Q

Biggest advantage of LSTM networks over Vanilla Recurrent Networks

Answer

A

Ability of learning which remote and recent information is relevant for the given task and using this information to generate output

Question 7

Q

batch normalization

Answer

A

The process of finding an optimal transformation of each batch, layer after layer, that is optimized during the training process.

When training a network with batches of data the network “gets confused” by the fact that statistical properties of batches vary from batch to batch

Idea 1: normalize each batch => subtract the mean and divide by the std deviation

Idea 2: assume that it is beneficial to scale and to shift each batch by a certain gamma and beta, to minimize network loss (error) on the whole training set

Idea 3: Finding optimal gamma and beta can be achieved with SGD (gradient descent)

Batch Normalization allows higher learning rates, reducing the number of epochs; consequently, it is much faster than other training algorithms

Question 8

Q

NetTalk

Answer

A

The task is to learn to pronounce English text from examples (text-to-speech)
- Training data: list of <phrase, phonetic representation>
- Input: 7 consecutive characters from written text presented in a moving window that scans text
- Output: phoneme code giving the pronunciation of the letter at the center of the input window
- Network topology: 7x29 binary inputs (26 chars + punctuation marks), 80 hidden units and 26 output units (phoneme code). Sigmoid units in hidden and output layer

Question 9

Q

recommended setup for solving binary classification problems with MLP

Answer

A

sigmoid and cross-entropy

Question 10

Q

recommended setup for solving multiclass classification problems with MLP

Answer

A

softmax and cross-entropy

Question 11

Q

LeNet 5: layer C1

Answer

A

Convolutional layer with 6 feature maps of size 28x28
- Each unit of C1 has a 5x5 receptive field in the input layer
- Shared weights (5x5+1)x6=156 parameters to learn
- Connections: 28x28x(5x5+1)x6=122304
- If it was fully connected we had:
(32x32+1)x(28x28)x6 = 4.821.600 parameters

Question 12

Q

LeNet 5: layer S2

Answer

A

Subsampling layer with 6 feature maps of size
14x14 2x2 nonoverlapping receptive fields in C1
- 6x2=12 trainable parameters.
- Connections: 14x14x(2x2+1)x6=5880

Question 13

Q

LeNet 5: total

Answer

A

The whole network has:
– 1256 nodes
– 64.660 connections
– 9.760 trainable parameters (and not millions!)
– trained with the Backpropagation algorithm

Question 14

Q

ALVINN

Answer

A

Neural network that drives a car
30x32 inputs, 4 hiddens, 30 outputs => 30x32x4 + 4x30 tunable parameters

Question 15

Q

network type most suitable for removing noise from images

Answer

A

autoencoder

Question 16

Q

Linear Separability for multi-class problems

Answer

A

There exist c linear discriminant functions
y_1(x),…., y_c(x) such that each x is assigned to class C_k if and only if y_k (x) > y_j(x) for all j neq k

Question 17

Q

when do functions not necessarily discriminate sets?

Answer

A

Check if the functions are monotonic, if not, they do not necessarily discriminate the sets

Question 18

Q

number of weights between input layer and first convolutional layer C

Answer

A

input comes from the convolutional filter, so size_conv_filter x nodes in C

Question 19

Q

learning to play the Atari Breakout game

Answer

A

train convolutional network to play Breakout

the network takes as input 4 consecutive frames (preprocessed to 4x84x84 pixels) + “reward”;
4 frames are needed to contain info about ball direction, speed, acceleration, etc.
output consists of 18 nodes that correspond to all possible positions of the joystick

Question 20

Q

What network architecture was used generate the “word to vector” mapping?

Answer

A

multi-layer perceptron

Question 21

Q

What network architecture was used by the AlphaGo program?

Answer

A

ResNet (residual network)
Key idea: it’s easier to learn “the modification of the original image than the modified image” => ad indentity shortcuts between 2 or more layers

Question 22

Q

What network architecture was used generate the Google DeepDream video(s)?

Answer

A

convolutional network

Question 23

Q

What is the difference between gradient descent and backpropagation?

Answer

A

Gradient descent is a general technique for finding (local) minima of a function which involves calculating gradients (or partial derivatives) of the function, while backpropagation is a very efficient method for calculating gradients of “well-structured” functions such as multi-layered networks.

Question 24

Q

AlphaGo Zero

Answer

A

trained solely by self-play generated data (no human knowledge!)
use a SINGLE Convolutional ResNet with two “heads” that model policy and value estimates:
policy = probability distribution over all possible next moves
value = probability of winning from the current position
extensive use of Monte Carlo Tree Search to get better estimates
a tournament to select the best network to generate fresh training data

Question 25

Q

DQN introduces a ‘replay buffer’ to store observations obtained during training. What is this buffer used for?

Answer

A

To avoid correlation between training examples

Question 26

Q

GANs

Answer

A

Generator: generate fake samples, tries to fool the Discriminator
Discriminator: tries to distinguish between real and fake samples

Formulated as a minimax game where:
- The Discriminator tries to maximize its reward
- The Generator tries to minimize the Discriminator’s reward (or maximize its loss)

Question 27

Q

Mode Collapse GANs

Answer

A

Outputs of the Generator gradually become less diverse: the Generator produces good samples, but very few of them, thus the Discriminator can’t tag them as fake.

Question 28

Q

dropout

Answer

A

at every training step, every neuron has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step

Question 29

Q

L1 regularization

Answer

A

adds the absolute value of magnitude of the coefficient as a penalty term to the loss function to avoid overfitting

Question 30

Q

Monte Carlo dropout

Answer

A

stack the predictions of, say, 100 over the test set, while dropout is active (so all predictions will be different)

Question 31

Q

why is adding layers to a neural network harmful?

Answer

A

the increased number of weights quickly leads to data
overfitting (lack of generalization)
huge number of (bad) local minima trap the gradient descent
algorithm
vanishing or exploding gradients (the update rule involves
products of many numbers) cause additional problems

Question 32

Q

why would we need many layers?

Answer

A

in theory, one hidden layer is sufficient to model any function
with an arbitrary accuracy, but the number of required nodes and weights grows exponentially fast
the deeper the network the less nodes are required to
model “complicated” functions
consecutive layers “learn” features of the training patterns;
from simplest (lower layers) to more complicated (top layers)

Question 33

Q

perceptron learning algorithm

Answer

A

initialize w randomly
while there are misclassified examples:
- select misclassified example (x,d)
- w_new = w_old + theta * x*(desired - out)

Question 34

Q

neuron learning rule

Answer

A

w = w + thetax(desired - out)*out’(x)

Question 35

Q

support vector machine idea

Answer

A

the decision boundary should be as far away from the data of both classes as possible, maximize the margin

support vectors are the training points that are nearest to the separating hyperplane

Question 36

Q

3 weight update strategies

Answer

A

full batch mode: weights are updated after all the inputs are processed

(mini) batch mode: weights are updated after a small random sample of inputs is processed (Stochastic Gradient Descent)

on-line mode: weights are updated after processing single inputs

Question 37

Q

types of decision regions

Answer

A

network with a single node (een lijn)
one-hidden layer network that realizes the convex region: each node realizes one line bounding this region
two-hidden layer network that realizes the union of 3 convex regions: each box represents a one-hidden layer network realizing one region

zie ook slide 6 van week 3

Question 38

Q

gradient descent method

Answer

A

walk in the direction yielding the maximum decrease of the network error E, which is the opposite of the gradient of E

Question 39

Q

3 phases of backpropagation

Answer

A

computing the output of the network with corresponding error
computing the contribution of each weight to the error
adjusting the weights accordingly

Question 40

Q

forward pass of backpropagation

Answer

A

The network is activated on one example, the error of each neuron of the output layer is computed and the activations of all hidden nodes are computed.

Question 41

Q

backward pass of backpropagation

Answer

A

The network error is used for updating the weights. Starting at the output layer, the error is propagated backwards through the network, layer by layer with help of the generalized delta rule. Finally, all weights are updated.

Question 42

Q

advantages SGD

Answer

A

additional randomness helps to avoid local minima

Question 43

Q

advantages SGD

Answer

A

additional randomness helps to avoid local minima
huge savings of CPU time
easy to execute on GPU cards

Question 44

Q

early stopping

Answer

A

stop training as soon as the error on the validation set increases

Question 45

Q

exploding/vansishing gradients

Answer

A

the deeper you go, the more multiplications you have => products of small numbers are very small, products of big numbers very big

Question 46

Q

how to fix exploding/vansishing gradients?

Answer

A

Instead of sigmoid or TanH, use alternative activation functions of which derivatives do not vanish, or only very slowly (ReLU, LReLU, ELU, SELU)
avoid increasing variance of outputs the deeper you go of with traditional initialization => use alternative initialization strategies that use fan_in and fan_out (Glorot, He, LeCun)
batch normalization

Question 47

Q

fan_in and fan_out

Answer

A

fan_in = number of connections to the given layer

fan_out = number of connections from the given layer

fan_avg = (fan_in + fan_out) / 2

Question 48

Q

Stochastig Gradient Descent (SGD)

Answer

A

evaluate gradients and update the weights with every training example or in mini batches

normal GD takes fewer steps, but each step takes much longer to compute

advantages:
+ Fewer redundant gradient computations, i.e., faster
+ Parallelizable, optional asynchronous updates
+ High-variance updates can hop out of local minima
+ Can encourage convergence by annealing the learning rate

Question 49

Q

Gradient descent Nesterov momentum

Question 50

Q

gradient clipping

Answer

A

clip the gradients during backpropagation so that they never exceed some threshold, to avoid exploding gradients (clip to value between certain interval)

Question 51

Q

random surfer model as Markov Process

Answer

A

page importance = fequency with which the surfer visits the page

transition matrix used for iterative calculation of the page probability distribution

Markov Process converges if the graph is strongly connected and there are no dead ends

Question 52

Q

digit recognition problem

what accuracy can be achieved if we randomly permute all pixels

Answer

A

the same as the original data if we work with a single-layer perceptron or a multi-layer perceptron, because if we permute the weights from input to hidden nodes the network behaves in the same way as the original trained network

Question 53

Q

key idea behind convulutional networks

Answer

A

a filter (feature detector) return high values when the corresponding patch is similar to the filter matrix

how do we know what the filters should look like? => instead of hand-crafting, specify each filter with (very few) parameters and find values of these parameters by backpropagation

Question 54

Q

SVM margin gamma

Answer

A

the distance of the closest example from the decision line or hyperplane

gamma_i = (w * x_i + b) * y_i

we want to maximize the margin for each data point (optimization problem)

Question 55

Q

SVM what if the data is not separable

Answer

A

introduce a penalty: if point x_i is on the wrong side of the margin then get penalty ksi_i, which is the distance of x_i to the closest point on the right side of the line

minimize |w|^2 plus the number of training mistakes (slack penalty C) times ksi

Question 56

Q

SVM problem when optimizing with w

Answer

A

scaling w increases the margin, so optimizing to get the largest error would work by just maximizing w (w can be arbtrarily large)

solution: work with normalized w
=> gamma = (w/|w| * x + b) * y
max gamma = max 1/|w| = min|w| = min 1/2*|w|^2

Question 57

Q

SVM hinge loss

Answer

A

if the classified point is too close to the separating line or on the wrong side, we incur a penalty proportional to how far away the point is from the decision boundary

Question 58

Q

SVM how do we estimate w?

Answer

A

minimize f(w,b)
f(w,b) = 1/2|w|^2 + Cmax(0, 1 - (w * x + b) * y)

compute gradients with respect to w_j

Question 59

Q

tensor

Answer

A

an array, can be of any dimension (single number, tuple, image, stack of images, etc.)

Question 60

Q

CNN feature map

Answer

A

the result of applying a convolutional layer to the data

Question 61

Q

padding settings

Answer

A

artificially increasing the size of the input to preserve the original input size in the feature map

“same”: add zeros when needed
“valid”: accept the loss of some input => no padding and ignore parts of the input that don’t fit because of the stride

Question 62

Q

local response normalization LRN

Question 63

Q

AlexNet

Answer

A

first to stack convolutional layers directly on top of one another without pooling layers inbetween

regularization techniques:
- dropout
- data augmentation: increase size of training set by generating many realistic variants of each training instance (eg. shift, rotate, resize)
- local response normalization

Question 64

Q

local response normalization LRN

Answer

A

the most strongly activated neurons inhibit other neurons located at the same position in neighboring feature maps, encouraging different feature maps to specialize, which improves generalization