Connectionist Prep Flashcards

1
Q

What is learning by gradient descent? Explain the general idea behind it, and the role the error E has in it.

A
  • algorithm which aims to minimise the error (E) of the NN by adjusting the model’s parameters
  • involves computing gradients of E with respect to these parameters
  • the parameters are adjusted in the opposite direction of the gradient to minimise E
  • iteratively reduces E, in pursuit of a global minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Briefly describe what the backpropagation algorithm is, and in which way it relates to gradient descent.

A
  • algorithm used to train artificial NN by minimising the error between the predicted output and the target values
  • Two main phases: Forward Pass, Backward Pass (backpropagation)
  • FP: Input data is fed into the NN, layer-by-layer computations yield the predicted output
  • Backpropagation: works backwards through the layers, using gradient descent to minimise the error of the NN by adjusting the model’s parameters at each layer.
  • involves calculus to calculate the partial derivatives of the loss function with respect to each parameter (weights and bias)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the common problems of gradient descent, that may limit its effectiveness?

A
  • local minima
  • slow convergence
  • sensitivity to learning rate
  • dependence on initial weight selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the role of activation functions in NN

A

They play a crucial role by introducing non-linearities to the model, which are essential for enabling NN to learn complex patterns in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of the cost function in a NN

A

Also known as the loss function, it quantifies the inconsistency between predicted values and the corresponding correct values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

explain the role of bias terms in a NN

A
  • bias terms add a level of flexibility and adaptability to the model.
  • they “shift” the activation function, providing every neuron with a trainable constant value, in addition to the inputs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a perceptron

A

an artificial neuron which takes in many input signals and produces a single binary output signal (0 or 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the differences between Batch Gradient Descent and Stochastic Gradient Descent

A

in BGD, the model parameters are updated in one go, based on the average gradient of the entire training dataset. In SGD, updates occur for each training example or mini-batch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which gradient descent is preferred for large datasets and why.

A

Stochastic GD is preferred over Batch GD.

  • Although BGD usually converges to a more accurate minimum, it is computationally expensive (extremely)
  • SGD converges faster and requires less memory. However, updates can be noisy, and it may converge to a local minimum rather than the global minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define generalisation

A

The ability of a trained model to perform well on unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you measure the generalisation ability of a MLP

A
  • cross validation
  • hold-out strategy (train/test sets)
  • consider choice of evaluation measure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you decide on an optimal number of hidden units?

A
  • apply to domain knowledge to estimate a range
  • test the model on the range to fine tune the selection
  • this may be unfeasible on complex models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the difference between two common activation functions of your choice

A

Sigmoid vs TanH
1. Output Range:
- Sigmoid: (0,1): used for binary classification
- tanh: (-1, 1): suitable for zero-centred data
2. Symmetry:
- Sigmoid is asymmetric, biased towards positive values
- tanh is symmetric around the origin (0, 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the problems with squared error as the loss function, give two alternatives

A

There are tricky problems with squared error:

  • if the desired output is 1 and the actual output is very close to 0, there is almost no gradient
  • alternatives: softmax, relative entropy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define what a Deep Neural Network is

A
  • consists of multiple layers that transform the input in a hierarchical fashion
  • they typically are feed-forward NN with multiple hidden layers, allowing modelling of complex non-linear relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Formal definition of overfitting in practice

A

during learning the error on the training examples decreases all along, but the error in generalisation reaches a minimum and then starts growing again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Training data contains information about the regularities in the mapping from input to output.
But it also contains noise, explain how.

A
  • the target values may be unreliable
  • there will be accidental regularities just because of the particular training cases that were chosen
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When we fit a model, it cannot tell which regularities are real and which are caused by sampling error. Which regularity does it fit, what is the worst case scenario?

A
  • Both
  • worst case: If the model is very flexible it can model the sampling error really well
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a model having the “right capacity” entail

A
  • enough to model the true regularities
  • not enough to also model the spurious regularities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to prevent overfitting in NN

A
  • limiting number of weights
  • weight decay
  • early stopping
  • combining diverse networks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Standard ways to limit the capacity of a neural net

A
  • Limit the number of hidden units.
  • Limit the size of the weights.
  • Stop the learning before it has time to overfit.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to limit the size of the model by using fewer hidden units in practice

A

trial and error

23
Q

What is weight-decay

A
  • method for limiting the size of a model
  • involves adding an extra term to the cost function that penalises the squared weights, i.e. keeps weights small unless they have large error derivatives
24
Q

What does weight decay prevent, and what does it improve and how?

A
  • It prevents the network from using weights that it does not need.
  • It tends to keep the network in the linear region, where its capacity is lower.
  • This helps to stop it from fitting the sampling error. It makes a smoother model in which the output changes more slowly as the input changes.
  • It can often improve generalisation a lot.
25
Q

What is the idea behind early stopping for preventing overfitting

A
  • expensive to train a big model with lots of data
  • cheaper to stop adjusting weights once generalisation starts getting worse
  • the capacity is limited because the weights have not had time to grow big
26
Q

what hold out strategy is recommended for model selection of a MLP

A

Strategies which include a validation set

27
Q

What is gradient descent

A
  • optimisation technique used to find optimal parameters of a model by iteratively updating them in the direction of the steepest descent of the loss function.
  • aims to minimise the error of the model
28
Q

What is the recommended method for the three-way hold-out strategy when training a NN

A
  • training data: used for learning the parameters of the model
  • validation data: used for deciding what type of model and what amount of regularisation works best (fine tuning)
  • test data: used to get a final, unbiased estimate of how well the network works
29
Q

Why NN ensembles?

A

The average error of a group of predictors is always smaller than the average error of the single predictors (unless the predictors are identical)

30
Q

Briefly explain the steps of k-Fold Cross Validation

A
  • Divide the data into k disjoint subsets - “folds”
  • For each of k experiments, use k-1 folds for training and the selected one fold for testing.
  • Repeat for all k folds, average the accuracy/error rates.
31
Q

How to achieve Network Ensembling with just one training

A

Use Dropout method

  • during training, at each step knock out some randomly chosen connections
  • when predicting, use all connections. You will need to introduce a normalising constant for this to work
  • equivalent to having a very large ensemble of networks
32
Q

Precautions for applying dropout in practice

A
  • it doesn’t always work: preconditions required
  • Must begin with an oversized net capacity to avoid underfitting
33
Q

What is online learning

A

Weight updates occur for each example during Gradient Descent

34
Q

Discuss the value of the gradient at different error angles, what does gradient descent do at these values and are we satisfied with this?

A
  • the gradient is large where the error is steep, small where the error is flat
  • Sometimes we would like to run where it’s flat and slow down when it gets too steep. GD does precisely the contrary
35
Q

Briefly discuss some of the fixes for the issues of Gradient Descent

A

Use an adaptive learning rate:

  • increase the rate slowly if it’s not diverging
  • decrease the rate quickly when it starts diverging

Use Momentum: instead of using the gradient to change the position of the weight, change the velocity of the change

Use fixed step: GD decides where to go, but always at same pace

Normalise the gradient based on some combination of previous gradients

36
Q

Explain what RL is, and what we want to learn from it

A

Learning from interaction with an environment to achieve some long-term goal that is related to the state of the environment

  • we want to learn how to act to accomplish goals
  • given an environment that contains rewards, we want to learn a policy for acting
37
Q

Define a simple RL Setup And Goal

A

Setup: We have an agent which is interacting with an environment which it can affect through actions. The agent may be able to sense the environment partially or fully.
Goal: the agent tries to maximise the long term reward conveyed using a reward signal

38
Q

Explain the differences between Supervised and Reinforcement Learning

A
  • In SL, there’s an external “supervisor”, which has knowledge of the environment and who shares it with the agent to complete the task
  • Both strategies use mappings between inputs and outputs, but in RL there is a reward function which acts as a feedback to the agent
  • Supervised learning relies on labelled training data
39
Q

Explain the differences between Unsupervised and Reinforcement Learning

A
  • In UL, there is no feedback from the environment
  • In UL, task is to find the underlying patterns rather than the mapping from input to output
40
Q

Characteristics of RL

A
  • no supervisor, only a reward signal
  • feedback is delayed, not instantaneous
  • time really matters
  • Agent’s actions have immediate consequences
41
Q

Why is Deep Learning hard

A
  • when networks get deep the gradient vanishes
  • when a network is untrained, the deeper down a hidden unit is, the more subtle its effect on the outputs
  • this means it doesn’t do much to the error if you change it
42
Q

Briefly discuss the two main categories for Deep Learning solutions

A

Pre-training:

  • stack deep networks layer-by-layer
  • make sure each layer represents the previous layer meaningfully before adding another layer

Use Artificial Targets: the real problem is that inner layers don’t get gradient, so every now and then use hard targets:

  • generate some random targets for the layer
  • evaluate them all
  • use the best one
43
Q

Advantages and disadvantages of pre-training by auto-association

A

Adv: you can use unlabelled data
Disadvantage (potentially disastrous):

  • you aren’t considering at all the property you
    want to predict
  • you compress regardless of the property. If it’s
    lossy, the loss can be in the wrong place..
44
Q

Advantages and disadvantages of pre-training without auto-association

A

Advantage:

  • you compress based on the property you are trying to predict. If it’s lossy, the loss is probably in the right place

Disadvantage:

  • can’t use unlabelled training data
  • shorter training
45
Q

What is clustering

A
  • unsupervised learning: grouping un-labelled data
  • find underlying patterns in the data
  • large choice of distance functions
  • partitioning and hierarchical methods
46
Q

Possible Clustering implementations for connectionist models

A

k-means can be achieved by using backpropagation in a non-linear self-associating network:

  • there is one hidden layer with each node representing a cluster centre
  • the hidden layer is hardmax, therefore only one of the neurons will be activated from the input
  • the neurons weight will be adjusted when it is activated (recomputes cluster centre)
47
Q

What are the strengths and weaknesses of Clustering compared to Principal Component Analysis (in connectionist models)?

A

PCA: linear self-associating networks
Clustering: non-linear (hardmax) self-associating networks

  • PCA builds global features (strength) while clustering builds local features (weakness)
  • PCA only considers linear combinations of the inputs (weakness)
  • clustering builds much stronger features (strength)
48
Q

State what the difference is between Feedforward and Feedback networks

A

Information Flow:

  • In FFN, information flows in one direction
  • FBNs have recurrent connections, allows them to maintain and propagate information over time
  • Consequently, FBNs can model sequences and time-dependent data
49
Q

Which network architecture (FFN, FBN) do you think are easier to deal with? Justify your choice.

A
  • FFNs are easier to train and are more stable because there are no feedback loops
  • FBNs can be more challenging to train due to vanishing gradients
50
Q

Describe Hopfield Networks and Boltzmann Machines

A
  • both are types of Recurrent NN (RNN)
  • HNs consists of binary threshold units with symmetric connections
  • BMs use binary stochastic units and incorporate a probabilistic aspect in the update rule
51
Q

Discuss the learning process in Hopfield Networks and Boltzmann machines

A

Hopfield Networks:

  • learning involves adjusting the weights to store certain memories or patterns, essentially capturing second-order interactions

Boltzmann machines

  • learn to generate configurations according to a probability distribution
  • involves adjusting the weights based on the correlation differences in the training and generated data
52
Q

Do Hopfield Networks and Boltzmann machines tackle similar problems?

A
  • HNs are deterministic, while BMs are probabilistic
  • Consequently, BMs can represent higher-order interactions
53
Q

Discuss similarities and differences between MLPs and Boltzmann Machines

A
  • both can have hidden layers
  • feedforward vs recurrent
  • role of hidden units are somewhat similar in both models, aiming to learn complex patterns/structures
  • the manner in which HU operate are different: deterministic in MLP, probabilistic in BM