Lecture 5 Flashcards

Question

Why can we describe the encoding step as lossy information?

Answer 1

When we apply the encoding step, we are compressing the data, however we are losing information - this is lossy compression.

Answer 2

Autoencoders are trained to recognise the key features in data. Because the input is projected onto these features, it is generally quite resistant to noise. We can train autoencoders specifically for removing noise from the input - working out what the input should have been without noise.

Answer 3

We add some noise to the input before it goes into the auto encoder. The network learns how to remove noise. We apply the ampount of noise we expect the auto encoder to be produced with. The output is a non-noisy input but not identical to the original input version. [See diagram]

Answer 4

PCA reconstruction (feature projected to one principal component) and auto encoders (features project to one feature) are both able to reconstruct a linear function well - both show a straight line. PCAs are only capable of reproducing this linear relationship with a single feature, however autoencoders are able to achieve non-linearalities.

Answer 5

Principal component analysis (PCA) does a similar thing (it is linear, rather than network-based). An auto encoder could be thought of as a nonlinear version of PCA. The non-linearity combination of features is brought about by the coding region. This nonlinearity can mean that autoencoders may perform better when features are related nonlinearly. In turn, auto encoders might require fewer features than PCA to give the same level of error. Eg PCA - 10 features, AE - 1 or 2 features. Training an autoencoder involves a neural network, so we may incur problems (eg overfitting) that we may not have with PCA. PCA involves only one line of code (eg diagonalising a matrix), whereas an autoeconder involves setting up a neural network and weights etc. You can interpret PCA better, the auto encoder is harder to understand / interpret as you have less intuition.

Answer 6

Representation learning. Sometimes (eg when we know something about the symmetry of a problem) we can choose the right features from our data. Quite often, the problem is too complex to do so - but we still might feel we are using more data than we need to. Autoencoders project the input (nonlinear) into fewer dimensions, finding a smaller number of features that still represent the data very well. There are similar advantages to PCA, but with added non-linearity.

Answer 7

We train it as we do a standard (deep) network. In the dataset D, the inputs and outputs (x1, t1) can be represented as (x1, x1) as the inputs and outputs should be the same.

Answer 8

Reconstruction error.

Answer 9

Use of | | instead of ( ) because we have vectors.

Answer 10

We have the error function (reconstruction error). We can find the derivatives of the error with respect to the weights using back-propagation, and optimise the errors as usual.

Answer 11

The basic auto encoder architecture includes a bottle neck. An auto encoder can already be produced with a single hidden layer in the encoder and decoder sections (three layers total). Deeper networks are preferred for neural networks. There are several advantages to using deep networks including the amount of training data needed. For models involving images, convolutional layers are also likely needed in order to "pre process" the data.

Answer 12

Sparse auto encoders The code layer no longer needs to have fewer nodes than inputs (no bottle neck). Instead, the error function has an extra term that penalises activations - this means that not every node is activated. Some information will get to the code layer and no further, because large activations are penalised. Contractive auto encoders. Here the error function also has an extra term - this time, to penalise large derivatives of activations with respect to the data ie situations where a small change in input gives a large change in output. - Variational autoencoders. These are similar to a standard autoencoder, but the code is now a probability distribution, rather than a set of values. We can draw from this probability distribution and send the result through the decoder, to give an output that isn't in the training set, but should be from the same probability distribution. After training, we decouple the encoding and decoding regions.

Answer 13

- Removing noise from images - Compression of inputs - Detection of anomalies - data points that are not likely to be from the same probability distribution as the training set. Putting these points through the autoencoder, they will be poorly reconstructed. Eg hand drawn letters, passing in :), the auto encoder will not know to deal with this if it wasn't seen in the training set. You will get back nonsense. - Inference of missing values - similar to noise removal, we can train the network by removing features at random from the input data (to simulate missing data) and then the final network should be able to reconstruct the original point. - Generative models - for these, we could consider variational autoencoders.

Answer 14

Putting these points through the autoencoder, they will be poorly reconstructed.

Answer 15

Used to produce examples that are similar to its training set. Often re-cast in a probabilisitic way - we model the probability distribution that produced the training data and we generate new points from that distribution. We want to consider complex problems, such as training a model and showing it a variety of pictures, then having it generate new, unseen convincing pictures of these types.

Answer 16

A discriminative model takes in an input x and predicts an output z (classification or regression). More properly, in statistical terms it models P(z|x). That is, given an input x, what is the probability of output z? The final output of the model is then the most likely from this distribution.

Answer 17

A generative model is a model for the joint distribution P(x,z), from which we draw inputs and outputs. P(x,z) = P(z|x) P(x)

Answer 18

A generative adversarial network (GAN) is a deep learning architecture. It trains two neural networks to compete against each other to generate more authentic new data from a given training dataset. Made up of two networks: - Generative model - Discriminative model The two models play a zero-sum game. The generative model aims to fool the discriminative, and the discriminative aims not to be fooled (they have opposite goals).

Answer 19

Produces examples of whatever we want to generate eg generating images.

Answer 20

Decides whether or not the generative model's examples are "real" examples from the correct distribution. ie works to correctly classify the images produced, deciding if it is real or fake.

Answer 21

The discriminative model takes in a data point, x, which may be from the generative model (t=0) or a "real" example (t=1) It then classifies this model either as real (z=1) or fake (z=0)

Answer 22

There are two error functions, one for each model. These are combined to make 1 error function, so there are two terms. Whenever the discriminator makes a prediction, both error functions are updated based on whether the generator was able to "fool" it. Eg if the generator produces an image, and the discriminator assigns "fake", the generator error is updated. If the discriminator assigns this as "real", the discriminator error is updated. eg if a real image is produced and the discriminator assigns "real", the discriminator error is not updated. However, if it is assigned "fake", the discriminator error is updated. The weights are updated using gradients obtained from the discriminator loss function (and back-propagation).

Answer 23

When we start, what comes out of the generator will be garbage and the discriminator identifies this. But the generator will get better over time and the discriminator gets worse (it will make mistakes). The discriminator is just a model for classifying inputs (C1 = "real" and C2 ="fake") and can have any architecture.

Answer 24

A generative model is more difficult to train than a discriminative one - the two tasks are quite asymmetric. Consider the two tasks: - Model 1 decides whether or not a picture contains a human face - eg easier to recognise a face than it is to construct one yourself - Model 2 has to put together a convincing human face - this is a much harder job

Answer 25

Starts off with a random input - it doesn't matter what goes into the generative model, it matters what goes out. This is transformed into a sample data point (an image), which goes into the discriminator. This classifies the data point, and the generator error is updated. We then want to use the generator error to update the weights. This error depends on the weights of both generator and discriminator, but we want to only update the generator now. Back-propagate through the discriminator to the generator, and update the weights only of the generator.

Answer 26

We don't want both networks to change at the same time. The generator should adapt to fool the discriminator as it currently is. The discriminator should adapt to avoid being fooled by the current generator. We alternate between training each of the two models, with the other one kept constant. Each time we train, we send a number of examples either from the true distribution or the generative model through the discriminative model, and accumulate both errors. We then only update one of the models.

Answer 27

When the generator is as good as it can get, the discriminator is no longer able to make predictions about whether a data point is real. That means it will make a completely random guess. ie the discriminator will get things right 50% of the time. At this point, we can stop the training and decouple the two networks.

Answer 28

Classification - real or fake, therefore based on cross entropy error. The discriminator wants to maximise the probability that it labels a real or fake image correctly. The generator wants to maximise the probability that the discriminator labels a fake image as real. We use a minimax function which has two sums [see flashcard]

Answer 29

Minimax function [see flashcard] - is positive, but if the equation is negative, this will be the opposite. The discriminative network wants to maximise this. The generative network wants to minimise this. The two are competing against each other - they have opposite goals for this error function.

Answer 30

Mode collapse The generator may learn only to generate one type of output (eg on particular digit). Eg if we build a generative model to generate pictures of animals (you have a repository of real images), it might get very good at producing one kind of output. The generator stops updating, but none of the other animal types are very good, so it stays in a local minimum.Any different types of digits are caught by the discriminator, so the generator stays in a local minimum. We need to be careful about where we stop training. When the generator is "good enough", the responses of the discriminator are random. The generator will then start to train on this random feedback.

Answer 31

Using labelled data, and a label is supplied to the generator, to give a particular type of output.

Answer 32

- Large language models (LLM) - the underlying ideas are familiar, but there are some new things too. In particular, reinforcement learning.

Answer 33

Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to maximise cumulative rewards in a given situation. It is distinct from supervised / unsupervised learning. Unlike supervised learning, which relies on a training dataset with predefined answers, RL involves learning through experience. In RL, an agent learns to achieve a goal in an uncertain, potentially complex environment by performing actions and receiving feedback through rewards or penalties. ie the model takes actions and is "rewarded" or "punished" for them. There is no right or wrong answer, the model learns for itself how to perform a task.

Lecture 5 Flashcards

(58 cards)