lecture 5 - neural networks Flashcards

1
Q

What is the curse of dimensionality in the context of linear models?

A

As the number of features (or dimensions) increases, the amount of data needed to properly train a linear model grows exponentially, making data sparse in high-dimensional spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do linear models struggle in high-dimensional spaces?

A

They depend on predefined basis functions to represent the data, but in high-dimensional spaces, the number of basis functions explodes, making the model computationally expensive and prone to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the main solutions to the curse of dimensionality for linear models?

A
  1. Support Vector Machines (SVMs)
  2. Neural Networks (NNs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do Support Vector Machines (SVMs) help address the curse of dimensionality?

A

SVMs define basis functions centered on relevant training points (support vectors), simplifying the model by using only relevant points to construct the decision boundary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do Neural Networks (NNs) help address the curse of dimensionality?

A

NNs use basis functions (hidden layers) that are learned and adapted during training, allowing the model to capture complex relationships flexibly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are the basis functions in Neural Networks structured and optimized?

A

The structure of basis functions (e.g., number of neurons and activation functions) is fixed, but their weights are adaptive and optimized during training to discover useful patterns in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of a neural network?

A

The goal of a neural network is to build a more powerful representation of the data by applying multiple functional transformations in sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the first step in constructing a neural network?

A

Start with a linear model that combines input features using weights and biases to produce a linear output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are new features (hidden units) created in a neural network?

A

New features are created by constructing linear combinations of the input features, followed by applying an activation function to introduce nonlinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why are nonlinear activation functions applied in neural networks?

A

Nonlinear activation functions (e.g., ReLU or sigmoid) are applied to introduce nonlinearity, enabling the network to model complex relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are additional layers added in a neural network?

A

After the first layer, the outputs of the hidden units are combined and passed through additional layers, each applying new weights and activation functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is the final output of a neural network computed?

A

The final output is computed by applying another activation function (e.g., sigmoid or softmax) to the output of the last layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of using multiple layers in a neural network?

A

Multiple layers enable the network to learn hierarchical representations, with each layer capturing increasingly complex patterns in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which activation function is typically used for regression in the output layer of a neural network?

A

The identity function is used for regression tasks, as it outputs continuous real-valued numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which activation function should be used for binary classification in the output layer?

A

The sigmoid function is used for binary classification, as it maps outputs to probabilities between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which activation function is suitable for multi-class classification in the output layer?

A

The softmax function is used for multi-class classification, as it converts logits into probabilities across multiple classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What happens when a linear function is used as the activation function in hidden layers?

A

It results in a linear model, which cannot handle complex relationships in the data, limiting the network’s capability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the role of the sigmoid function in hidden layers?

A

The sigmoid function smoothly squashes inputs to a range between 0 and 1, introducing nonlinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is the tanh function commonly used in neural networks?

A

The tanh function squashes inputs between -1 and 1, is centered around zero, and was often used in early neural networks for better gradient flow than the sigmoid function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the advantages of using ReLU as an activation function?

A

ReLU outputs zero for negative inputs and the input itself for positive inputs. It is computationally efficient and helps avoid the vanishing gradient problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the difference between ReLU and Leaky ReLU?

A

Leaky ReLU allows small negative values for negative inputs, preventing dead neurons and improving gradient flow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

output layer activation functions

A
  1. identity (regression)
  2. sigmoid (binary classification)
  3. softmax (multi-classification)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

hidden layer activation functions

A
  1. linear function (limited)

nonlinear functions

  1. sigmoid
  2. tanh
  3. relu
  4. leaky relu
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does it mean when we say that neural networks are universal function approximators?

A

It means that neural networks can approximate any function given enough data, layers, and hidden units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What types of functions can neural networks approximate?

A
  1. quadratic functions
  2. sine waves
  3. absolute value functions
  4. step functions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What kind of shape does a neural network approximate for a quadratic function?

A

A neural network approximates a parabolic shape for a quadratic function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How does a neural network approximate a sine wave function?

A

It approximates the sine wave by learning the oscillatory pattern of the function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the behavior of a neural network when approximating an absolute value function?

A

The network approximates a V-shaped function when approximating an absolute value function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How does a neural network handle step functions?

A

Neural networks approximate step functions by learning discrete jumps between values, simulating thresholding behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Why might a simple linear decision boundary not work well for some datasets?

A

A simple linear decision boundary may not work well when the data is not linearly separable, requiring a more complex decision-making model like a neural network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What simple functions do neural networks use as building blocks?

A

Neural networks use simple functions like AND, OR, and other logical gates as building blocks to model complex decision boundaries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How does a neural network solve the XOR problem?

A

By combining AND and the inverted OR gates, and then applying an OR gate to both, a neural network can perfectly separate points into two classes, solving the XOR problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the significance of solving the XOR problem in neural networks?

A

Solving the XOR problem demonstrates that neural networks can model complex, non-linear relationships, unlike simple linear models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What was the contribution of Pitts and McCulloch (1943) to neural networks?

A

They developed the first mathematical model of biological neurons, showing that Boolean operations could be implemented by neuron-like nodes, which became the origin of automaton theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is Hebb’s rule (1949) in the context of neural networks?

A

Hebb’s rule states that the connection strength between two neurons increases when both neurons are activated simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What was Rosenblatt’s contribution to neural networks in 1958?

A

Rosenblatt introduced the perceptron, a network of threshold nodes for pattern classification, and formulated the perceptron convergence theorem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How did neural networks experience renewed interest in the 1980s and 1990s?

A

New techniques like backpropagation, physics-inspired models (Hopfield net, Boltzmann machines), and unsupervised learning led to impressive applications, such as character recognition and speech recognition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What modern techniques have driven the “revolution” in neural networks since the 1990s?

A

Modern techniques include

  1. support vector machines
  2. kernel methods
  3. ensemble methods (bagging, boosting, stacking)
  4. deep learning
  5. transparency methods like LIME and SHAP.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are weights space symmetries in neural networks?

A

Weights space symmetries refer to the number of possible configurations in the weight space that can arise due to the nature of activation functions and the architecture of the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How does symmetry in the activation function create redundancy in the weight space?

A

Symmetric activation functions like tanh produce the same output when the input sign is flipped, allowing multiple weight configurations (e.g., positive and negative signs) to result in the same error during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the impact of symmetry in hidden units on a neural network?

A

Hidden units can be reshuffled without changing the output of the network, creating permutational symmetries.

42
Q

Why do weights space symmetries pose a problem during optimization?

A

Weights space symmetries create a complex optimization landscape with many local minima, making it harder to find the global minimum during training.

43
Q

What is the total number of symmetries

A
  • M!2^M
  • M! = number of ways M hidden units can be reshuffled
  • 2^M = accounts for flipping the sign of the weights
44
Q

What are the three main cases for choosing weights to minimize the error function?

A
  1. Regression: Linear outputs and sum-of-squares error (SSE).
  2. Binary classification: Logistic sigmoid and cross-entropy error.
  3. Multiclass classification: Softmax outputs and multiclass cross-entropy error.
45
Q

What is the output type and error function used for regression?

A
  1. Model the output using a Gaussian distribution.
  2. Define the likelihood function.
  3. Take the negative logarithm of the likelihood to get the error function.
  4. Apply (stochastic) gradient descent to minimize the error function.
46
Q

How can error functions based on maximum likelihood (ML) be written?

A

Since data points are independent, the total error E(w) can be expressed as the sum of individual errors E_n(W) for each data point.

47
Q

What is stochastic gradient descent (SGD)?

A

SGD is a version of gradient descent that updates weights using only one data point at a time, rather than the entire dataset.

48
Q

How is the gradient of the error function with respect to weights calculated in a simple linear model?

A
  • The gradient is computed by multiplying the difference between the predicted output and the target output with the corresponding input feature.
  • dE_n/dw_{ji} = (y_nj - t_nj) x_ni
  • dE_n/dw_{ji} = δ_j * a_i
49
Q

How do the concepts for training a simple linear model generalize to multiple layers in a neural network?

A

The error gradient is propagated backward through the layers, with each layer computing the gradient with respect to its weights using the chain rule.

50
Q

derivative of tanh function

A

1-tanh()

51
Q

What is the goal of the forward pass in backpropagation?

A

The goal is to compute activations (outputs) for each layer by applying an input vector and propagating it forward through the network.

52
Q

How is the error calculated during backpropagation?

A
  • The error is calculated by evaluating the difference between the predicted output (y_k) and the target output (t_k) for each output unit, resulting in an error term (δ_k).
  • δ_k = (y_k - t_k)
53
Q

What is the purpose of backpropagating the error?

A

The purpose is to propagate the error backward through the network using the errors from the next layer and the weights connecting them to update the network’s weights accordingly.

54
Q

How is the error term for a hidden unit (δ_j) obtained during backpropagation?

A
  • The product of the derivative of the activation function and the sum of the weighted errors from the next layer.
  • δ_j = h’(z_j) * SUM(w_kj * δ_k)
55
Q

How is the derivative with respect to the first set of weights calculated during backpropagation?

A

The derivative is calculated by multiplying the error term (δ_j) with the activation from the previous layer (a_i)

56
Q

What does the Jacobian matrix represent in neural networks?

A
  • The Jacobian matrix represents the partial derivatives of each output with respect to each input,
  • This helps in sensitivity analysis of the network’s outputs to its inputs.
57
Q

What does the Hessian matrix represent in neural networks?

A
  • The Hessian matrix represents the second-order partial derivatives of the error with respect to the weights, indicating the curvature of the error surface.
58
Q

What are some applications of the Hessian matrix in neural networks?

A
  1. Non-linear optimization techniques.
  2. Re-training feed-forward networks.
  3. Pruning by removing less significant weights.
  4. Laplace approximation for Bayesian networks.
59
Q

What happens when the number of hidden units (M) increases in a neural network?

A

Increasing the number of hidden units increases the flexibility of the model, reducing bias but increasing variance, leading to a higher risk of overfitting.

60
Q

How many weights are there in a neural network with one hidden layer?

A

The total number of weights is (D+1)M+(M+1)K

  • D is the number of input features.
  • M is the number of hidden units.
  • K is the number of output nodes.
61
Q

Why does increasing the number of layers and weights in a neural network lead to overfitting?

A

Increasing the number of layers and weights rapidly increases the number of parameters, making the model more prone to overfitting on the training data.

62
Q

Why is regularization necessary in neural networks?

A

Regularization is necessary because overfitting can occur early due to the large number of parameters, especially when the model has many hidden units or layers.

63
Q

How can the number of hidden nodes be evaluated using validation and random restarts?

A

By plotting a graph of performance using random initializations of weights with different numbers of hidden units, and selecting the configuration with the smallest generalization error on the validation set.

64
Q

Why is it important to use multiple random starts when evaluating a neural network?

A

Multiple random starts ensure that the performance of the network is not dependent on a specific random initialization, leading to more reliable results.

65
Q

What does low variation in the outcomes indicate when evaluating hidden units?

A

Low variation indicates high bias and low variance, meaning the model may underfit the data.

66
Q

What does high variation in the outcomes indicate when evaluating hidden units?

A

High variation indicates low bias and high variance, meaning the model may overfit the data.

67
Q

Why is it important for reproducible results to use a specific random seed?

A

Using a specific random seed ensures that the algorithm’s performance can be reproduced consistently, reducing variability due to random initialization.

68
Q

What is weight decay regularization in neural networks?

A
  • Weight decay regularization modifies the error function by adding a penalty term for large weights
  • Higher λ implies stronger penalization of large weights.
69
Q

What are the main shortcomings of weight decay regularization?

A
  • It is not invariant to scaling and translations.
  • When input data is scaled (e.g., multiplied by a constant) or shifted (e.g., adding a bias), the impact of weight decay changes unpredictably.
  • Simple weight decay applies the same regularization factor to all weights, which doesn’t account for the different effects that transformations have on input and output, leading to biased results.
70
Q

What is the new regularization term introduced for neural networks?

A
  • The new regularization term involves regularizing different weight groups separately
  • W_1 (e.g., input weights) and W_2 (e.g., output weights) are regularized independently with coefficients λ_1 and λ_2 to account for differences in their roles across layers.
71
Q

Why is the new regularization prior considered improper?

A
  • Simple weight decay related to a Gaussian prior
  • Since the new prior cannot be normalized, it leads to difficulties in
  1. Selecting regularization coefficients α_1 and α_2
  2. Comparing different models effectively.
  3. Requiring separate priors for bias weights.
72
Q

What is an alternative to regularization in neural networks?

A
  1. Early stopping
  2. Dropout
73
Q

How does early stopping work?

A

While training, the training error decreases continuously, but the validation error decreases initially and then increases when overfitting begins. Early stopping halts training at the point where validation error is minimized, ensuring good generalization.

74
Q

How does dropout work?

A

involves randomly deactivating neurons during training to prevent overfitting by reducing co-adaptation of neurons.

75
Q

What are activation functions in neural networks, and where are they used?

A

Activation functions are used to introduce non-linearity into the model, allowing it to learn complex patterns.

76
Q

What is the sigmoid activation function and its output range?

A
  • The sigmoid function squashes the input z into a range between 0 and 1, making it useful for binary classification problems.
  • s-shaped graph
77
Q

What are the pros and cons of using the sigmoid function?

A

pro
- easy to understand and implement

cons
- suffers from the vanishing gradient problem for large and small z

  • outputs are not zero-centered, which can lead to slower convergence during training
78
Q

What problem arises when scaling data centered around 0 with the sigmoid function?

A

Since the sigmoid function outputs values between 0 and 1, any negative values become positive after applying the function, losing the centeredness of the data.

79
Q

How does the hyperbolic tangent function differ from the sigmoid function?

A

The hyperbolic tangent function has an output range of [−1,1], making it zero-centered and better for optimization compared to the sigmoid.

80
Q

What are the pros and cons of the hyperbolic tangent function?

A

pros

  • outputs a wider range ([-1,1])
  • zero-centred, aiding in gradient-based optimization

cons

  • Suffers from the vanishing gradient problem for large or small inputs.
81
Q

What is an advantage of using the hyperbolic tangent function over the sigmoid function?

A

It better retains the centeredness and scale of the data, improving optimization during training.

82
Q

What is the drawback of the hyperbolic tangent function?

A

Despite its advantages, it involves a costly computation, requiring four exponentials.

83
Q

What is the ReLU activation function, and how does it behave?

A
  • The ReLU function outputs z if z>0 and 0 otherwise.
  • It introduces sparsity in the network by setting many neurons to zero.
84
Q

What are the pros and cons of using ReLU?

A

pros
- computationally efficient
- helps mitigate vanishing gradient problem

cons
- Can suffer from the dying ReLU problem, where neurons output 0 for all inputs and stop updating during training.

85
Q

How does the leaky ReLU function differ from the standard ReLU?

A

Unlike standard ReLU, leaky ReLU allows a small slope (e.g., 0.01) for z<0 instead of outputting zero, addressing the dying ReLU problem.

86
Q

What are the pros and cons of leaky ReLU?

A
  • pro: Addresses the dying ReLU problem by allowing small gradients for negative inputs.
  • con: Slightly more computationally expensive than standard ReLU.
87
Q

What issue remains with leaky ReLU?

A

It is still non-differentiable at z=0, though this rarely causes issues in practice.

88
Q

Why is initializing weights to zero a problem in neural networks?

A
  • If all weights are initialized to zero (or the same value), every neuron in the same layer will compute the same output during forward propagation, leading to identical gradients during backpropagation.
  • This symmetry prevents the network from learning useful representations, reducing it to an ineffective structure.
89
Q

What is Xavier initialization used for?

A

Xavier initialization is used for networks with sigmoid or tanh activations. It ensures that activations are neither too small nor too large, avoiding the vanishing or exploding gradient problem.

90
Q

What is He initialization, and when is it used?

A

He initialization is preferred for ReLU activations. It ensures that activations are scaled appropriately for ReLU’s non-linear nature, reducing issues like vanishing gradients.

91
Q

What is invariance in the context of neural networks?

A

Invariance refers to the property where a model produces consistent outputs even when the input undergoes certain transformations, such as rotation, scaling, or translation.

92
Q

Why is invariance important in classification problems?

A

Invariance ensures that models can classify inputs correctly regardless of transformations.

Examples:

  1. Handwriting recognition: Digits should have the same classification despite changes in position or size.
  2. Speech recognition: Invariant to non-linear warping that preserves temporal ordering (e.g., speed or accent variations).
93
Q

What are the challenges in handling invariances using large datasets?

A
  1. It requires a large sample set where all possible transformations are present, which is impractical because the dataset size grows exponentially.
  2. Translating, rotating, and scaling every image drastically increases the number of training examples, making it computationally inefficient.
94
Q

What is an alternative to relying on larger datasets for invariance?

A

Instead of relying on large datasets, models can be designed to inherently exhibit the required invariances through appropriate design.

95
Q

four approaches to handling invariance

A
  1. augment the training set
  2. regularization-based invariance
  3. preprocessing to extract invariant features
  4. architectural design
96
Q

explain augmenting the training set to handle invariance

A

Transform the training data by applying variations (e.g., rotating, scaling, translating) to create examples with the desired invariances.

  • Example: Rotating an image of a digit “5” slightly should still result in the digit being classified as “5”.
97
Q

How does regularization-based invariance work?

A

Regularization terms are added to the loss function to penalize changes in the model’s output when inputs are transformed.

  • Example: If a small rotation in input causes a drastic change in output, the penalty ensures the model learns to reduce sensitivity to such changes.
98
Q

How can preprocessing be used to handle invariance?

A

Preprocessing can extract invariant features before feeding the data into the model.

  • Example: For images, preprocessing may include converting to grayscale, normalizing intensity, or using handcrafted features (e.g., edge detection).
99
Q

How does architectural design help achieve invariance?

A

Neural networks can be designed with built-in invariance properties.

  1. Convolutional Neural Networks (CNNs): Convolutional layers detect patterns regardless of position, making them inherently invariant to translation.
  2. Pooling layers: Enhance translation invariance by summarizing spatial information.
100
Q

hyperbolic tangent equation

A

[e^z - e^{-z}] / [e^z + e^{-z}]