Connectionist Prep Flashcards

Question

What is the idea behind early stopping for preventing overfitting

Answer 1

- expensive to train a big model with lots of data - cheaper to stop adjusting weights once generalisation starts getting worse - the capacity is limited because the weights have not had time to grow big

Answer 2

Strategies which include a **validation set**

Answer 3

- optimisation technique used to find optimal parameters of a model by iteratively updating them in the direction of the steepest descent of the loss function. - aims to minimise the error of the model

Answer 4

- **training data:** used for learning the parameters of the model - **validation data:** used for deciding what type of model and what amount of regularisation works best **(fine tuning)** - **test data:** used to get a final, unbiased estimate of how well the network works

Answer 5

The average error of a group of predictors is always smaller than the average error of the single predictors (unless the predictors are identical)

Answer 6

- Divide the data into k disjoint subsets - “folds” - For each of k experiments, use k-1 folds for training and the selected one fold for testing. - Repeat for all k folds, average the accuracy/error rates.

Answer 7

Use **Dropout** method - during training, at each step knock out some randomly chosen connections - when predicting, **use all connections**. You will need to introduce a **normalising constant** for this to work - **equivalent to having a very large ensemble of networks**

Answer 8

- it doesn't always work: **preconditions required** - Must begin with an oversized net capacity to avoid underfitting

Answer 9

Weight updates occur for **each example** during Gradient Descent

Answer 10

- the gradient is large where the error is steep, small where the error is flat - Sometimes we would like to run where it's flat and slow down when it gets too steep. GD does **precisely the contrary**

Answer 11

Use an **adaptive learning rate**: - increase the rate slowly if it's not diverging - decrease the rate quickly when it starts diverging Use **Momentum**: instead of using the gradient to change the position of the weight, change the velocity of the change Use **fixed step**: GD decides where to go, but always at same pace **Normalise the gradient** based on some combination of previous gradients

Answer 12

Learning from interaction with an environment to achieve some long-term goal that is related to the state of the environment - we want to learn how to act to accomplish goals - given an environment that contains rewards, we want to learn a policy for acting

Answer 13

Setup: We have an **agent** which is interacting with an environment which it can affect through **actions**. The agent may be able to sense the environment partially or fully. Goal: the agent tries to maximise the long term reward conveyed using a **reward signal**

Answer 14

- In SL, there's an external "supervisor", which has knowledge of the environment and who **shares it with the agent** to complete the task - Both strategies use **mappings between inputs and outputs**, but in RL there is a reward function which acts as a **feedback** to the agent - Supervised learning relies on **labelled training data**

Answer 15

- In UL, there is **no feedback from the environment** - In UL, task is to find the **underlying patterns** rather than the mapping from input to output

Answer 16

- no supervisor, only a reward signal - feedback is delayed, not instantaneous - time really matters - Agent's actions have immediate consequences

Answer 17

- when networks get deep the gradient **vanishes** - when a network is untrained, the deeper down a hidden unit is, the more **subtle** its effect on the outputs - this means it doesn't do much to the error if you change it

Answer 18

Pre-training: - stack deep networks layer-by-layer - make sure each layer represents the previous layer meaningfully before adding another layer Use Artificial Targets: the real problem is that inner layers don't get gradient, so every now and then use **hard targets**: - generate some random targets for the layer - evaluate them all - use the best one

Answer 19

Adv: you can use unlabelled data Disadvantage (potentially disastrous): - you aren't considering at all the **property you want to predict** - **you compress regardless of the property**. If it's lossy, the loss can be in the wrong place..

Answer 20

Advantage: - you **compress based on the property you are trying to predict**. If it's lossy, the loss is probably in the right place Disadvantage: - can't use unlabelled training data - shorter training

Answer 21

- unsupervised learning: grouping un-labelled data - find underlying patterns in the data - large choice of distance functions - partitioning and hierarchical methods

Answer 22

k-means can be achieved by using backpropagation in a **non-linear self-associating network**: - there is one hidden layer with each node representing a cluster centre - the hidden layer is **hardmax**, therefore only one of the neurons will be activated from the input - the neurons weight will be adjusted when it is activated **(recomputes cluster centre)**

Answer 23

PCA: linear self-associating networks Clustering: non-linear (hardmax) self-associating networks - PCA builds **global features** (strength) while clustering builds **local features** (weakness) - PCA only considers **linear** combinations of the inputs (weakness) - clustering builds much **stronger features** (strength)

Answer 24

**Information Flow:** - In FFN, information flows in one direction - FBNs have recurrent connections, allows them to maintain and propagate information over time - Consequently, FBNs can model sequences and time-dependent data

Answer 25

- FFNs are easier to train and are more stable because there are no feedback loops - FBNs can be more challenging to train due to vanishing gradients

Answer 26

- both are types of Recurrent NN (RNN) - HNs consists of **binary threshold units with symmetric connections** - BMs use **binary stochastic units** and incorporate a **probabilistic aspect in the update rule**

Answer 27

Hopfield Networks: - learning involves adjusting the weights to store certain memories or patterns, essentially capturing **second-order interactions** Boltzmann machines - learn to generate configurations according to a probability distribution - involves adjusting the weights based on the **correlation differences in the training and generated data**

Answer 28

- HNs are deterministic, while BMs are probabilistic - Consequently, BMs can represent **higher-order interactions**

Answer 29

- both can have hidden layers - feedforward vs recurrent - role of hidden units are somewhat similar in both models, aiming to learn complex patterns/structures - the manner in which HU operate are different: deterministic in MLP, probabilistic in BM

Connectionist Prep Flashcards

(53 cards)