Fully Connected Networks Flashcards by Daniel Casley

What is the purpose of the synapse in human memory?

The synapse is a ‘gap’ between neurons. When neurons want to communicate over this gap, they send chemicals called neurotransmitters into the synapse.

When it successfully communicates, that gap gets stronger and more accurate, allowing faster recall.

How well did you know this?

Not at all

Perfectly

What is the purpose of the input layer in deep learning?

The input layer takes a set of inputs and passes them through several weighted edges into a ‘hidden’ layer. For example, in the case of an image, the input layer is each individual pixel in that image.

How well did you know this?

Not at all

Perfectly

What is the purpose of the hidden layer in deep learning?

The hidden layers in a DL network are a black box - they are neuron layers hidden from the user in-between the input and output.

Weighted edges carry data from one neuron to another, performing transformations on those values non-linearly using activation functions.

How well did you know this?

Not at all

Perfectly

What is the purpose of the output layer in deep learning?

The output layer defines the different classes of output. In the case of a Boolean output, we would have 1 or 2 neurons, representing ‘true’ or ‘false’.

How well did you know this?

Not at all

Perfectly

What is the purpose of a loss function in deep learning?

A loss function is a function that measures how badly the AI system did at predicting a labelled output using some ‘badness’ score, comparing predicted to actual.

How well did you know this?

Not at all

Perfectly

What is the purpose of the optimizer in deep learning?

The optimizer takes the loss value produced by our loss function and adjusts the weights between our neurons to allow our network to ‘learn’.

How well did you know this?

Not at all

Perfectly

What is the purpose of activation functions in deep learning?

They introduce non-linearity into the network, transforming the summed weighted input to a node to an output that can be passed on.

How well did you know this?

Not at all

Perfectly

What is backpropagation?

Backpropagation is a technique to allow a network to learn by mathematically propagating the value a loss function produces backwards through the network, adjusting weight values.

How well did you know this?

Not at all

Perfectly

How is gradient descent used to optimise our network?

Gradient descent is an algorithm used to find a local minimum of a differentiable function.

If we can find weighted values such that they produce a local minimum for our loss values (a point where the loss is exactly zero), we have a perfect model.

How well did you know this?

Not at all

Perfectly

What would happen if we didn’t use activation functions?

Our network would essentially reduce to a linear regression, where the relationships in our data would be all part of one straight line.

How well did you know this?

Not at all

Perfectly

What are the steps to performing backpropagation on a neural network?

Firstly, draw a batch of training samples and corresponding targets. Then, run the network on those values to obtain a set of predictions.

Compute the loss of the network (the mismatch between the predictions and the targets), and update the weights of the network in such a way that we slightly reduce the loss on that batch.

How well did you know this?

Not at all

Perfectly

What happens if we leave our step size too high?

Think of it like a wheel rolling down a hill - if it goes too fast, it will jump straight past the minima, perhaps finding a less optimal point. If it goes too slow, it will never reach it. We need a perfect step size that will allow us to hit the minima without jumping it.

How well did you know this?

Not at all

Perfectly

What are hyperparameters?

Hyperparameters are global parameters that do not change as the network is running, but can also be optimised to help the network reach an optimised state. Step size (or learning rate) is a hyperparameter.

How well did you know this?

Not at all

Perfectly

What is an epoch?

An epoch is a hyperparameter - a single pass-through of the network by the training samples. The more epochs we set, the more times the data will be trained on the training set.

How well did you know this?

Not at all

Perfectly

What is batch size?

Batch size is the partition size of training data to be passed through the network at any one time. If we have 100 bits of data, and our batch size is 10, for each epoch, we will send through 10 batches.

How well did you know this?

Not at all

Perfectly

What is learning rate?

Study These Flashcards

Learning rate is step size - a constant we multiply the gradient by to make the gradient descent happen faster. However, a too high learning rate can cause the gradient to jump too far, missing crucial minima points.

What is the downside to having a small learning rate?

Study These Flashcards

A small learning rate may cause the network to find a bad local minima and be unable to leave it.

What happens if we have a low batch rate?

Study These Flashcards

A low batch rate will allow the network to change more frequently and train faster, but may cause too frequent oscillations.

What may happen with too high an epoch value?

Study These Flashcards

Too many epochs and we risk overfitting our network on the training data, meaning it will be unable to adapt to new data that isn’t from that training set.

What is the purpose of a dropout layer?

Study These Flashcards

A dropout layer is a type of hidden layer that randomly ‘drops out’ data from the network - deletes it - with the aim of making things intentionally harder for the network, forcing it to generalise to all possible data.

What is the difference between Stochastic Gradient Descent (SGD) and Mini-Batch SGD?

Study These Flashcards

Mini-Batch is simply a version that takes multiple inputs at once - multiple ‘batches’ - instead of a single x and y, producing one monolithic loss value.

What is the Sigmoid function best for?

Study These Flashcards

Sigmoid is best for Boolean classification, used at the end of a network for models that should predict either True or False.

What is the ReLu function best for?

Study These Flashcards

ReLu is commonly used at all points of a network, due to its speed compared to Sigmoid as well as its near immunity to the vanishing gradient problem.

What is the vanishing gradient problem?

Study These Flashcards

We want to make the gradient of each weight as small as possible to reach an optimal point.

As sequence length increases, the gradient magnitude is expected to decrease, but in reality it works too well - the gradient becomes so small that the network effectively cannot learn anymore.

What is the purpose of Leaky ReLu?

Leaky ReLu is simply ReLu but with a non-zero minimum, proportionate to 10% of the maximum. This is an attempt to avoid the vanishing gradient problem.

What is the difference between Sigmoid and Tanh activation functions?

Tanh is more aggressive than Sigmoid, with a faster rate of change.

What is a linear regression?

A linear regression is a form of regression where we expect the output, given an input, to be linear. For example, f(x) = wx.

What is a Bayesian linear regression?

Bayesian linear regression is one that considers the mean of one variable as a linear combination of other variables, with the goal of obtaining the probability of the regression coefficients.

What is a k-fold cross validation?

A K-fold cross validation is a way of partitioning our data into K 'folds', allowing our entire dataset to be used both for training and for testing at least once, allowing us to most efficiently generalise our dataset.

What are some examples of loss functions?

Mean Squared Error, which measures the average squared distance between the actual and predicted values. Cross-Entropic Loss, or Log Loss, which measures how close the prediction is to the actual, real value.

Fully Connected Networks Flashcards

(31 cards)