lecture 5 - neural networks Flashcards

Question

What types of functions can neural networks approximate?

Answer 1

1. quadratic functions 2. sine waves 3. absolute value functions 4. step functions.

Answer 2

A neural network approximates a **parabolic** shape for a quadratic function.

Answer 3

It approximates the sine wave by learning the oscillatory pattern of the function.

Answer 4

The network approximates a **V-shaped** function when approximating an absolute value function.

Answer 5

Neural networks approximate step functions by learning discrete jumps between values, simulating thresholding behavior.

Answer 6

A simple linear decision boundary may not work well **when the data is not linearly separable**, requiring a more complex decision-making model like a neural network.

Answer 7

Neural networks use simple functions like AND, OR, and other **logical gates** as building blocks to model complex decision boundaries.

Answer 8

By combining AND and the inverted OR gates, and then applying an OR gate to both, a neural network can perfectly separate points into two classes, solving the XOR problem.

Answer 9

Solving the XOR problem demonstrates that neural networks can model complex, non-linear relationships, unlike simple linear models.

Answer 10

They developed the first mathematical model of biological neurons, showing that Boolean operations could be implemented by neuron-like nodes, which became the origin of automaton theory.

Answer 11

Hebb’s rule states that the connection strength between two neurons increases when both neurons are activated simultaneously.

Answer 12

Rosenblatt introduced the perceptron, a network of threshold nodes for pattern classification, and formulated the perceptron convergence theorem.

Answer 13

New techniques like backpropagation, physics-inspired models (Hopfield net, Boltzmann machines), and unsupervised learning led to impressive applications, such as character recognition and speech recognition.

Answer 14

Modern techniques include 1. support vector machines 2. kernel methods 3. ensemble methods (bagging, boosting, stacking) 4. deep learning 5. transparency methods like LIME and SHAP.

Answer 15

Weights space symmetries refer to the number of possible configurations in the weight space that can arise due to the nature of activation functions and the architecture of the network.

Answer 16

Symmetric activation functions like tanh produce the same output when the input sign is flipped, allowing multiple weight configurations (e.g., positive and negative signs) to result in the same error during training.

Answer 17

Hidden units can be reshuffled without changing the output of the network, creating permutational symmetries.

Answer 18

Weights space symmetries create a complex optimization landscape with many local minima, making it harder to find the global minimum during training.

Answer 19

- M!2^M - M! = number of ways M hidden units can be reshuffled - 2^M = accounts for flipping the sign of the weights

Answer 20

1. Regression: Linear outputs and sum-of-squares error (SSE). 2. Binary classification: Logistic sigmoid and cross-entropy error. 3. Multiclass classification: Softmax outputs and multiclass cross-entropy error.

Answer 21

1. Model the output using a Gaussian distribution. 2. Define the likelihood function. 3. Take the negative logarithm of the likelihood to get the error function. 4. Apply (stochastic) gradient descent to minimize the error function.

Answer 22

Since data points are independent, the total error E(w) can be expressed as the sum of individual errors E_n(W) for each data point.

Answer 23

SGD is a version of gradient descent that updates weights using only one data point at a time, rather than the entire dataset.

Answer 24

- The gradient is computed by multiplying the difference between the predicted output and the target output with the corresponding input feature. - dE_n/dw_{ji} = (y_nj - t_nj) x_ni - dE_n/dw_{ji} = δ_j * a_i

Answer 25

The error gradient is propagated backward through the layers, with each layer computing the gradient with respect to its weights using the chain rule.

Answer 26

The goal is to compute activations (outputs) for each layer by applying an input vector and propagating it forward through the network.

Answer 27

- The error is calculated by evaluating the difference between the predicted output (y_k) and the target output (t_k) for each output unit, resulting in an error term (δ_k). - δ_k = (y_k - t_k)

Answer 28

The purpose is to propagate the error backward through the network **using the errors from the next layer** and the weights connecting them to update the network’s weights accordingly.

Answer 29

- The product of the derivative of the activation function and the sum of the weighted errors from the next layer. - δ_j = h'(z_j) * SUM(w_kj * δ_k)

Answer 30

The derivative is calculated by multiplying the error term (δ_j) with the activation from the previous layer (a_i)

Answer 31

- The Jacobian matrix represents the partial **derivatives of each output with respect to each input**, - This helps in sensitivity analysis of the network’s outputs to its inputs.

Answer 32

- The Hessian matrix represents the second-order partial **derivatives of the error with respect to the weights**, indicating the curvature of the error surface.

Answer 33

1. Non-linear optimization techniques. 2. Re-training feed-forward networks. 3. Pruning by removing less significant weights. 4. Laplace approximation for Bayesian networks.

Answer 34

Increasing the number of hidden units increases the flexibility of the model, **reducing bias but increasing variance**, leading to a higher risk of overfitting.

Answer 35

The total number of weights is (D+1)M+(M+1)K - D is the number of input features. - M is the number of hidden units. - K is the number of output nodes.

Answer 36

Increasing the number of layers and weights rapidly increases the number of parameters, making the model more prone to overfitting on the training data.

Answer 37

Regularization is necessary because overfitting can occur early due to the large number of parameters, especially when the model has many hidden units or layers.

Answer 38

By plotting a graph of performance using random initializations of weights with different numbers of hidden units, and **selecting the configuration with the smallest generalization error on the validation set**.

Answer 39

Multiple random starts ensure that the performance of the network is not dependent on a specific random initialization, leading to more reliable results.

Answer 40

Low variation indicates high bias and low variance, meaning the model may underfit the data.

Answer 41

High variation indicates low bias and high variance, meaning the model may overfit the data.

Answer 42

Using a specific random seed ensures that the algorithm’s performance can be reproduced consistently, reducing variability due to random initialization.

Answer 43

- Weight decay regularization modifies the error function by adding a penalty term for large weights - Higher λ implies stronger penalization of large weights.

Answer 44

- It is not invariant to scaling and translations. - When input data is scaled (e.g., multiplied by a constant) or shifted (e.g., adding a bias), the impact of weight decay changes unpredictably. - Simple weight decay applies the same regularization factor to all weights, which doesn't account for the different effects that transformations have on input and output, leading to biased results.

Answer 45

- The new regularization term involves regularizing different weight groups separately - W_1 (e.g., input weights) and W_2 (e.g., output weights) are regularized independently with coefficients λ_1 and λ_2 to account for differences in their roles across layers.

Answer 46

- Simple weight decay related to a Gaussian prior - Since the new prior cannot be normalized, it leads to difficulties in 1. Selecting regularization coefficients α_1 and α_2 2. Comparing different models effectively. 3. Requiring separate priors for bias weights.

Answer 47

1. Early stopping 2. Dropout

Answer 48

While training, the training error decreases continuously, but the validation error decreases initially and then increases when overfitting begins. Early stopping halts training at the point where validation error is minimized, ensuring good generalization.

Answer 49

involves randomly deactivating neurons during training to prevent overfitting by reducing co-adaptation of neurons.

Answer 50

Activation functions are used to introduce non-linearity into the model, allowing it to learn complex patterns.

Answer 51

- The sigmoid function squashes the input z into a range between 0 and 1, making it useful for binary classification problems. - s-shaped graph

Answer 52

**pro** - easy to understand and implement **cons** - suffers from the vanishing gradient problem for large and small z - outputs are not zero-centered, which can lead to slower convergence during training

Answer 53

Since the sigmoid function outputs values between 0 and 1, any negative values become positive after applying the function, losing the centeredness of the data.

Answer 54

The hyperbolic tangent function has an output range of [−1,1], making it zero-centered and better for optimization compared to the sigmoid.

Answer 55

**pros** - outputs a wider range ([-1,1]) - zero-centred, aiding in gradient-based optimization **cons** - Suffers from the vanishing gradient problem for large or small inputs.

Answer 56

It better retains the **centeredness and scale** of the data, improving optimization during training.

Answer 57

Despite its advantages, it involves a costly computation, requiring four exponentials.

Answer 58

- The ReLU function outputs z if z>0 and 0 otherwise. - It introduces sparsity in the network by setting many neurons to zero.

Answer 59

**pros** - computationally efficient - helps mitigate vanishing gradient problem **cons** - Can suffer from the dying ReLU problem, where neurons output 0 for all inputs and stop updating during training.

Answer 60

Unlike standard ReLU, leaky ReLU allows a small slope (e.g., 0.01) for z<0 instead of outputting zero, addressing the dying ReLU problem.

Answer 61

- pro: Addresses the dying ReLU problem by allowing small gradients for negative inputs. - con: Slightly more computationally expensive than standard ReLU.

Answer 62

It is still non-differentiable at z=0, though this rarely causes issues in practice.

Answer 63

- If all weights are initialized to zero (or the same value), every neuron in the same layer will compute the same output during forward propagation, leading to **identical gradients during backpropagation**. - This symmetry prevents the network from learning useful representations, reducing it to an ineffective structure.

Answer 64

Xavier initialization is used for networks with sigmoid or tanh activations. It ensures that activations are neither too small nor too large, avoiding the vanishing or exploding gradient problem.

Answer 65

He initialization is preferred for ReLU activations. It ensures that activations are scaled appropriately for ReLU's non-linear nature, reducing issues like vanishing gradients.

Answer 66

Invariance refers to the property where a model produces **consistent outputs even when the input undergoes certain transformations**, such as rotation, scaling, or translation.

Answer 67

Invariance ensures that models can classify inputs correctly regardless of transformations. Examples: 1. Handwriting recognition: Digits should have the same classification despite changes in position or size. 2. Speech recognition: Invariant to non-linear warping that preserves temporal ordering (e.g., speed or accent variations).

Answer 68

1. It requires a large sample set where all possible transformations are present, which is impractical because the **dataset size grows exponentially**. 2. Translating, rotating, and scaling every image drastically increases the number of training examples, making it **computationally inefficient**.

Answer 69

Instead of relying on large datasets, models can be designed to inherently exhibit the required invariances through appropriate design.

Answer 70

1. augment the training set 2. regularization-based invariance 3. preprocessing to extract invariant features 4. architectural design

Answer 71

Transform the training data by applying variations (e.g., rotating, scaling, translating) to create examples with the desired invariances. - Example: Rotating an image of a digit “5” slightly should still result in the digit being classified as “5”.

Answer 72

Regularization terms are added to the loss function to penalize changes in the model’s output when inputs are transformed. - Example: If a small rotation in input causes a drastic change in output, the penalty ensures the model learns to reduce sensitivity to such changes.

Answer 73

Preprocessing can extract invariant features before feeding the data into the model. - Example: For images, preprocessing may include converting to grayscale, normalizing intensity, or using handcrafted features (e.g., edge detection).

Answer 74

Neural networks can be designed with built-in invariance properties. 1. Convolutional Neural Networks (CNNs): Convolutional layers detect patterns regardless of position, making them inherently invariant to translation. 2. Pooling layers: Enhance translation invariance by summarizing spatial information.

Answer 75

[e^z - e^{-z}] / [e^z + e^{-z}]

lecture 5 - neural networks Flashcards

(100 cards)