Tentamen Flashcards
softmax vs sigmoid
softmax is voor als de output categorisch is en sigmoid wanneer de output continu is
Cover’s theorem
in a highly dimensional space, if the number of points is relatively small compared to the dimensionality and you paint the points randomly in two colors, the data set will be linearly separable
- if the number of points in a d-dimensional space is smaller than 2*d, they are almost always linearly separable
if n/(d+1) < 2, then linearly separable - if the number of points in a d-dimensional space is bigger than 2*d, they are almost always NOT linearly separable
if n/(d+1) > 2, then not linearly separable
recommended setup for regression problems
linear activation function with SSE loss function
Gradient descent with momentum
combine current weight update with previous update
p_new = p_old - agradient + blast_direction
ofwel
x_(k+1) = x_k - αg(x_k) + βd(x_k)
Gradient descent
p_new = p_old - a*gradient
Biggest advantage of LSTM networks over Vanilla Recurrent Networks
Ability of learning which remote and recent information is relevant for the given task and using this information to generate output
batch normalization
The process of finding an optimal transformation of each batch, layer after layer, that is optimized during the training process.
When training a network with batches of data the network “gets confused” by the fact that statistical properties of batches vary from batch to batch
Idea 1: normalize each batch => subtract the mean and divide by the std deviation
Idea 2: assume that it is beneficial to scale and to shift each batch by a certain gamma and beta, to minimize network loss (error) on the whole training set
Idea 3: Finding optimal gamma and beta can be achieved with SGD (gradient descent)
Batch Normalization allows higher learning rates, reducing the number of epochs; consequently, it is much faster than other training algorithms
NetTalk
The task is to learn to pronounce English text from examples (text-to-speech)
- Training data: list of <phrase, phonetic representation>
- Input: 7 consecutive characters from written text presented in a moving window that scans text
- Output: phoneme code giving the pronunciation of the letter at the center of the input window
- Network topology: 7x29 binary inputs (26 chars + punctuation marks), 80 hidden units and 26 output units (phoneme code). Sigmoid units in hidden and output layer
recommended setup for solving binary classification problems with MLP
sigmoid and cross-entropy
recommended setup for solving multiclass classification problems with MLP
softmax and cross-entropy
LeNet 5: layer C1
Convolutional layer with 6 feature maps of size 28x28
- Each unit of C1 has a 5x5 receptive field in the input layer
- Shared weights (5x5+1)x6=156 parameters to learn
- Connections: 28x28x(5x5+1)x6=122304
- If it was fully connected we had:
(32x32+1)x(28x28)x6 = 4.821.600 parameters
LeNet 5: layer S2
Subsampling layer with 6 feature maps of size
14x14 2x2 nonoverlapping receptive fields in C1
- 6x2=12 trainable parameters.
- Connections: 14x14x(2x2+1)x6=5880
LeNet 5: total
The whole network has:
– 1256 nodes
– 64.660 connections
– 9.760 trainable parameters (and not millions!)
– trained with the Backpropagation algorithm
ALVINN
Neural network that drives a car
30x32 inputs, 4 hiddens, 30 outputs => 30x32x4 + 4x30 tunable parameters
network type most suitable for removing noise from images
autoencoder
Linear Separability for multi-class problems
There exist c linear discriminant functions
y_1(x),…., y_c(x) such that each x is assigned to class C_k if and only if y_k (x) > y_j(x) for all j neq k
when do functions not necessarily discriminate sets?
Check if the functions are monotonic, if not, they do not necessarily discriminate the sets
number of weights between input layer and first convolutional layer C
input comes from the convolutional filter, so size_conv_filter x nodes in C
learning to play the Atari Breakout game
train convolutional network to play Breakout
- the network takes as input 4 consecutive frames (preprocessed to 4x84x84 pixels) + “reward”;
- 4 frames are needed to contain info about ball direction, speed, acceleration, etc.
- output consists of 18 nodes that correspond to all possible positions of the joystick
What network architecture was used generate the “word to vector” mapping?
multi-layer perceptron
What network architecture was used by the AlphaGo program?
ResNet (residual network)
Key idea: it’s easier to learn “the modification of the original image than the modified image” => ad indentity shortcuts between 2 or more layers
What network architecture was used generate the Google DeepDream video(s)?
convolutional network
What is the difference between gradient descent and backpropagation?
Gradient descent is a general technique for finding (local) minima of a function which involves calculating gradients (or partial derivatives) of the function, while backpropagation is a very efficient method for calculating gradients of “well-structured” functions such as multi-layered networks.
AlphaGo Zero
- trained solely by self-play generated data (no human knowledge!)
- use a SINGLE Convolutional ResNet with two “heads” that model policy and value estimates:
policy = probability distribution over all possible next moves
value = probability of winning from the current position - extensive use of Monte Carlo Tree Search to get better estimates
- a tournament to select the best network to generate fresh training data
DQN introduces a ‘replay buffer’ to store observations obtained during training. What is this buffer used for?
To avoid correlation between training examples
GANs
- Generator: generate fake samples, tries to fool the Discriminator
- Discriminator: tries to distinguish between real and fake samples
Formulated as a minimax game where:
- The Discriminator tries to maximize its reward
- The Generator tries to minimize the Discriminator’s reward (or maximize its loss)
Mode Collapse GANs
Outputs of the Generator gradually become less diverse: the Generator produces good samples, but very few of them, thus the Discriminator can’t tag them as fake.
dropout
at every training step, every neuron has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step
L1 regularization
adds the absolute value of magnitude of the coefficient as a penalty term to the loss function to avoid overfitting
Monte Carlo dropout
stack the predictions of, say, 100 over the test set, while dropout is active (so all predictions will be different)
why is adding layers to a neural network harmful?
- the increased number of weights quickly leads to data
overfitting (lack of generalization) - huge number of (bad) local minima trap the gradient descent
algorithm - vanishing or exploding gradients (the update rule involves
products of many numbers) cause additional problems
why would we need many layers?
- in theory, one hidden layer is sufficient to model any function
with an arbitrary accuracy, but the number of required nodes and weights grows exponentially fast - the deeper the network the less nodes are required to
model “complicated” functions - consecutive layers “learn” features of the training patterns;
from simplest (lower layers) to more complicated (top layers)
perceptron learning algorithm
- initialize w randomly
- while there are misclassified examples:
- select misclassified example (x,d)
- w_new = w_old + theta * x*(desired - out)
neuron learning rule
w = w + thetax(desired - out)*out’(x)
support vector machine idea
the decision boundary should be as far away from the data of both classes as possible, maximize the margin
support vectors are the training points that are nearest to the separating hyperplane
3 weight update strategies
full batch mode: weights are updated after all the inputs are processed
(mini) batch mode: weights are updated after a small random sample of inputs is processed (Stochastic Gradient Descent)
on-line mode: weights are updated after processing single inputs
types of decision regions
- network with a single node (een lijn)
- one-hidden layer network that realizes the convex region: each node realizes one line bounding this region
- two-hidden layer network that realizes the union of 3 convex regions: each box represents a one-hidden layer network realizing one region
zie ook slide 6 van week 3
gradient descent method
walk in the direction yielding the maximum decrease of the network error E, which is the opposite of the gradient of E
3 phases of backpropagation
- computing the output of the network with corresponding error
- computing the contribution of each weight to the error
- adjusting the weights accordingly
forward pass of backpropagation
The network is activated on one example, the error of each neuron of the output layer is computed and the activations of all hidden nodes are computed.
backward pass of backpropagation
The network error is used for updating the weights. Starting at the output layer, the error is propagated backwards through the network, layer by layer with help of the generalized delta rule. Finally, all weights are updated.
advantages SGD
- additional randomness helps to avoid local minima
advantages SGD
- additional randomness helps to avoid local minima
- huge savings of CPU time
- easy to execute on GPU cards
early stopping
stop training as soon as the error on the validation set increases
exploding/vansishing gradients
the deeper you go, the more multiplications you have => products of small numbers are very small, products of big numbers very big
how to fix exploding/vansishing gradients?
- Instead of sigmoid or TanH, use alternative activation functions of which derivatives do not vanish, or only very slowly (ReLU, LReLU, ELU, SELU)
- avoid increasing variance of outputs the deeper you go of with traditional initialization => use alternative initialization strategies that use fan_in and fan_out (Glorot, He, LeCun)
- batch normalization
fan_in and fan_out
fan_in = number of connections to the given layer
fan_out = number of connections from the given layer
fan_avg = (fan_in + fan_out) / 2
Stochastig Gradient Descent (SGD)
evaluate gradients and update the weights with every training example or in mini batches
normal GD takes fewer steps, but each step takes much longer to compute
advantages:
+ Fewer redundant gradient computations, i.e., faster
+ Parallelizable, optional asynchronous updates
+ High-variance updates can hop out of local minima
+ Can encourage convergence by annealing the learning rate
Gradient descent Nesterov momentum
vul in
gradient clipping
clip the gradients during backpropagation so that they never exceed some threshold, to avoid exploding gradients (clip to value between certain interval)
random surfer model as Markov Process
page importance = fequency with which the surfer visits the page
transition matrix used for iterative calculation of the page probability distribution
Markov Process converges if the graph is strongly connected and there are no dead ends
digit recognition problem
what accuracy can be achieved if we randomly permute all pixels
the same as the original data if we work with a single-layer perceptron or a multi-layer perceptron, because if we permute the weights from input to hidden nodes the network behaves in the same way as the original trained network
key idea behind convulutional networks
a filter (feature detector) return high values when the corresponding patch is similar to the filter matrix
how do we know what the filters should look like? => instead of hand-crafting, specify each filter with (very few) parameters and find values of these parameters by backpropagation
SVM margin gamma
the distance of the closest example from the decision line or hyperplane
gamma_i = (w * x_i + b) * y_i
we want to maximize the margin for each data point (optimization problem)
SVM what if the data is not separable
introduce a penalty: if point x_i is on the wrong side of the margin then get penalty ksi_i, which is the distance of x_i to the closest point on the right side of the line
minimize |w|^2 plus the number of training mistakes (slack penalty C) times ksi
SVM problem when optimizing with w
scaling w increases the margin, so optimizing to get the largest error would work by just maximizing w (w can be arbtrarily large)
solution: work with normalized w
=> gamma = (w/|w| * x + b) * y
max gamma = max 1/|w| = min|w| = min 1/2*|w|^2
SVM hinge loss
if the classified point is too close to the separating line or on the wrong side, we incur a penalty proportional to how far away the point is from the decision boundary
SVM how do we estimate w?
minimize f(w,b)
f(w,b) = 1/2|w|^2 + Cmax(0, 1 - (w * x + b) * y)
compute gradients with respect to w_j
tensor
an array, can be of any dimension (single number, tuple, image, stack of images, etc.)
CNN feature map
the result of applying a convolutional layer to the data
padding settings
artificially increasing the size of the input to preserve the original input size in the feature map
“same”: add zeros when needed
“valid”: accept the loss of some input => no padding and ignore parts of the input that don’t fit because of the stride
local response normalization LRN
AlexNet
first to stack convolutional layers directly on top of one another without pooling layers inbetween
regularization techniques:
- dropout
- data augmentation: increase size of training set by generating many realistic variants of each training instance (eg. shift, rotate, resize)
- local response normalization
local response normalization LRN
the most strongly activated neurons inhibit other neurons located at the same position in neighboring feature maps, encouraging different feature maps to specialize, which improves generalization