Kursusgang 8 (Multilayer perceptrons) Flashcards

Question 1

Q

What are the three key elements of feed-forward neural networks?

Answer

A

The model
* A non-linear mapping function, f_θ(x) defined by the network architecture.
Objective/loss function
* Squared error for regression, \sum_{n=1}^N [y_n - f_θ(x_n)]^2
* Cross entropy for classification, \sum_{n=1}^N [-y_n log (f_n) - (1 - y_n) log (1 - f_n)]
Optimization algorithm
* Error backpropagation

Question 2

Q

Write out the equation for feed-forward neural networks and explain it.

Answer

A

Y_{k}(X, W) = f[\sum_{j=1}^M w_{kj}^{ (2) } h[\sum_{i=1}^D w_{ji}^{ (1) } x_i + w_{j0}^{ (1) }] + w_{k0}^{ (2) } ] =
f[\sum_{j=0}^M w_{kj}^{ (2) } h(\sum_{i=0}^D w_{ji}^{(1)} x_i)]

where the differentiable, nonlinear activation functions h (.) and f (.) are often chosen to be logistic sigmoid functions; for h (.), ReLU is also often used.
The NN model is simply a nonlinear function from a set of input variables {x_ i} to a set of output variables {y_i} controlled by adjustable parameters W.
The neural network model comprises two stages of processing, each of which resembles the perceptron model and thus it is also known as the multilayer perceptron (MLP).

Question 3

Q

What is a fully connected neural network?

Answer

A

All neurons in one layer are connected to all neurons in the next layer and the previous layer.

Question 4

Q

What is a hidden layer in a neural network?

Answer

A

It is a layer that is not directly constrained from the outside (nor provided to the outside), thus not directly related to x_i or y_i.

Question 5

Q

What is an epoch?

Answer

A

One epoch means all data being processed by the algorithm one time.

Question 6

Q

What is the update rule for a perceptron?

Answer

A

Update = LearningFactor * (DesiredOutput - ActualOutput) * Input

Question 7

Q

How are multilayer perceptrons for regression defined?

Answer

A

y_i = v_i^T * z = \sum_{h=1}^H v_{ih} z_h + v_i0
where z_h is the adaptive basis function, defined as
z_h = σ(w_h^T * x),
with σ being the sigmoid function.

Question 8

Q

What is the likelihood function + error function for multilayer perceptrons for two class classification?

Answer

A

Bernouli distribution.
The error function is as usual defined as taking the negative logarithm of the likelihood, which gives the cross entropy error function, which ends up being the same as for regression.

Question 9

Q

What is the error function and gradients for multiclass logistic regression multilayer perceptrons?

Answer

A

E(W, v | X) = - \sum_t \sum_i r_i^t \log y_i^t
Δv_ih = η \sum_t (r_i^t - y_i^t) z_h^t
Δw_hj = η \sum_t [\sum_i (r_i^t - y_i^t)* v_ih] * z_h^t * (1 - z_h^t) * x_j

Question 10

Q

What is overfitting in the context of multilayer perceptrons?

Answer

A

Multilayer perceptrons with a large number of layers, neurons, or connections have high capacity. This allows them to learn complex functions, but also makes them more susceptible to overfitting.
Plotting the training and validation loss/accuracy over number of hidden units (or epochs) can reveal overfitting. If the training error continues to decrease while the validation error starts to increase, it’s a strong indicator of overfitting.

Question 11

Q

What are the central problems when designing a neural network?

Answer

A

Finding a suitable architecture.
Finding the corresponding weights of the network: back propagation remains the most effective brute-force training algorithm as of today.

The general approach involves:
* Trial and error.
* Fixed architecture during the learning process.
* Parameters are trained by gradient-based algorithms, liable to a local-minimum.

Question 12

Q

Compare support vector machines and neural networks

Answer

A

Support vector machines:

Purpose: Primarily classification and regression
Model Complexity: Simpler structure with fewer hyperparameters
Handling Non-linear Data: Uses kernel functions (e.g., RBF, polynomial)
Scalability: Not Scalable (computationally expensive for large datasets)
Overfitting: Less prone to overfitting (due to margin maximization)
Interpretability: Easier to interpret (support vectors define decision boundary)
Training Time: Generally faster for smaller datasets
Input Data Requirements: Typically requires feature scaling
Multi-class Support: Requires one-vs-rest or one-vs-one strategies
Performance on Large Data: May struggle due to computational limitations

Neural network:

Purpose: Classification, regression, and unsupervised tasks
Model Complexity: Flexible and highly customizable with many hyperparameters
Handling Non-linear Data: Learns non-linear decision boundaries through layers and activations
Scalability: Scales well with large datasets using techniques like mini-batch training
Overfitting: More prone to overfitting (requires proper regularization)
Interpretability: Harder to interpret (considered a "black-box" model)
Training Time: Slower for small datasets (but efficient with large datasets and GPUs)
Input Data Requirements: Can handle raw input data (but benefits from normalization)
Multi-class Support: Natively supports multi-class classification
Performance on Large Data: Well-suited for large datasets, especially deep networks

Question 13

Q

Design a MLP neural network for MNIST digit classification

Answer

A

Input Layer:
The MNIST dataset contains 28x28 pixel images, which means each image is represented as a 28x28 matrix of pixel values. We flatten this matrix into a vector of size 784 (28 * 28 = 784).

Hidden Layers:
A typical MLP for MNIST usually has one or two hidden layers with ReLU activation functions.
A good starting point could be:
First hidden layer: 512 neurons (some guy on github tested this)

Output Layer:
The output layer will have 10 neurons, each corresponding to a digit (0-9), with a softmax activation function to produce probability scores for each digit.

Activation Functions:
ReLU (Rectified Linear Unit) is commonly used for hidden layers because it helps the network to learn non-linear relationships.
Softmax is used in the output layer to generate probability values that sum up to 1, representing the likelihood of each class (digit).

Loss Function:
Since we are dealing with a classification problem, we use categorical cross entropy as the loss function.