Tentamen Flashcards
softmax vs sigmoid
softmax is voor als de output categorisch is en sigmoid wanneer de output continu is
Cover’s theorem
in a highly dimensional space, if the number of points is relatively small compared to the dimensionality and you paint the points randomly in two colors, the data set will be linearly separable
- if the number of points in a d-dimensional space is smaller than 2*d, they are almost always linearly separable
if n/(d+1) < 2, then linearly separable - if the number of points in a d-dimensional space is bigger than 2*d, they are almost always NOT linearly separable
if n/(d+1) > 2, then not linearly separable
recommended setup for regression problems
linear activation function with SSE loss function
Gradient descent with momentum
combine current weight update with previous update
p_new = p_old - agradient + blast_direction
ofwel
x_(k+1) = x_k - αg(x_k) + βd(x_k)
Gradient descent
p_new = p_old - a*gradient
Biggest advantage of LSTM networks over Vanilla Recurrent Networks
Ability of learning which remote and recent information is relevant for the given task and using this information to generate output
batch normalization
The process of finding an optimal transformation of each batch, layer after layer, that is optimized during the training process.
When training a network with batches of data the network “gets confused” by the fact that statistical properties of batches vary from batch to batch
Idea 1: normalize each batch => subtract the mean and divide by the std deviation
Idea 2: assume that it is beneficial to scale and to shift each batch by a certain gamma and beta, to minimize network loss (error) on the whole training set
Idea 3: Finding optimal gamma and beta can be achieved with SGD (gradient descent)
Batch Normalization allows higher learning rates, reducing the number of epochs; consequently, it is much faster than other training algorithms
NetTalk
The task is to learn to pronounce English text from examples (text-to-speech)
- Training data: list of <phrase, phonetic representation>
- Input: 7 consecutive characters from written text presented in a moving window that scans text
- Output: phoneme code giving the pronunciation of the letter at the center of the input window
- Network topology: 7x29 binary inputs (26 chars + punctuation marks), 80 hidden units and 26 output units (phoneme code). Sigmoid units in hidden and output layer
recommended setup for solving binary classification problems with MLP
sigmoid and cross-entropy
recommended setup for solving multiclass classification problems with MLP
softmax and cross-entropy
LeNet 5: layer C1
Convolutional layer with 6 feature maps of size 28x28
- Each unit of C1 has a 5x5 receptive field in the input layer
- Shared weights (5x5+1)x6=156 parameters to learn
- Connections: 28x28x(5x5+1)x6=122304
- If it was fully connected we had:
(32x32+1)x(28x28)x6 = 4.821.600 parameters
LeNet 5: layer S2
Subsampling layer with 6 feature maps of size
14x14 2x2 nonoverlapping receptive fields in C1
- 6x2=12 trainable parameters.
- Connections: 14x14x(2x2+1)x6=5880
LeNet 5: total
The whole network has:
– 1256 nodes
– 64.660 connections
– 9.760 trainable parameters (and not millions!)
– trained with the Backpropagation algorithm
ALVINN
Neural network that drives a car
30x32 inputs, 4 hiddens, 30 outputs => 30x32x4 + 4x30 tunable parameters
network type most suitable for removing noise from images
autoencoder
Linear Separability for multi-class problems
There exist c linear discriminant functions
y_1(x),…., y_c(x) such that each x is assigned to class C_k if and only if y_k (x) > y_j(x) for all j neq k
when do functions not necessarily discriminate sets?
Check if the functions are monotonic, if not, they do not necessarily discriminate the sets
number of weights between input layer and first convolutional layer C
input comes from the convolutional filter, so size_conv_filter x nodes in C
learning to play the Atari Breakout game
train convolutional network to play Breakout
- the network takes as input 4 consecutive frames (preprocessed to 4x84x84 pixels) + “reward”;
- 4 frames are needed to contain info about ball direction, speed, acceleration, etc.
- output consists of 18 nodes that correspond to all possible positions of the joystick
What network architecture was used generate the “word to vector” mapping?
multi-layer perceptron
What network architecture was used by the AlphaGo program?
ResNet (residual network)
Key idea: it’s easier to learn “the modification of the original image than the modified image” => ad indentity shortcuts between 2 or more layers
What network architecture was used generate the Google DeepDream video(s)?
convolutional network
What is the difference between gradient descent and backpropagation?
Gradient descent is a general technique for finding (local) minima of a function which involves calculating gradients (or partial derivatives) of the function, while backpropagation is a very efficient method for calculating gradients of “well-structured” functions such as multi-layered networks.
AlphaGo Zero
- trained solely by self-play generated data (no human knowledge!)
- use a SINGLE Convolutional ResNet with two “heads” that model policy and value estimates:
policy = probability distribution over all possible next moves
value = probability of winning from the current position - extensive use of Monte Carlo Tree Search to get better estimates
- a tournament to select the best network to generate fresh training data
DQN introduces a ‘replay buffer’ to store observations obtained during training. What is this buffer used for?
To avoid correlation between training examples