Lecture 5 Flashcards
What is supervised learning in machine learning?
Given a set of inputs and outputs, we build a model to predict the output for a new input.
What are we interested in with unsupervised learning?
Only providing inputs to find patterns in the data.
There is not a known correct answer.
An unsupervised algorithm also outputs a model
- Clustering - grouping together data points that are similar to each other
- Detecting anomalies - finding data points that do not appear to be from the same probability distribution eg fraud detection
- Association - finding relationships between variables
What is a key use of unsupervised learning?
To infer probability distributions
What do we assume about the data in unsupervised learning?
What do we do for parametric methods?
We assume this data was drawn from a probability density p(x).
For parametric methods, we choose a functional form for the distribution eg p(x) proportional to exp(-(x-u)T sum-1 (x-u))
We want to find the parameters (eg maximum likelihood). Finding which parameters are most likely to give you the data you see.
What are the disadvantages of parametric methods?
Not very flexible.
Requires us to make some assumptions about the data.
What parametric method might you choose if you don’t know much about the data?
The central limit theorem
- Assumes a large amount of data that behaves normally
- It is a very limited functional form
- There are a small number of parameters (mean and covariance) which reduces flexibility
In comparison to parametric methods, what do unsupervised learning give us access to?
Unsupervised learning gives us access to non-parametric methods that do not make assumptions.
NB: non-parametric may be misleading as there are still parameters
Describe an example of unsupervised learning giving access to non-parametric methods that do not make assumptions?
K-nearest neighbours (KNN).
We have N points and a parameter K set in advance. To find p(x) we sit at a point x and draw (hyper)spheres around this point until we find K points.
If V is the volume of the final hypersphere, p(x) ~= K / NV
Rather than a rigid functional form, the based probability distribution is based on how much data you have. The probability distribution is highest when the points are most dense.
This is an estimate of the probability distribution that is data driven.
What function does a supervised method often rely on?
Basis functions - we represent data by taking functions of the data.
We pick a set of functions to represent the data eg a + bx + cx^2 for linear regression.
The data we put into the neural network might be a function, not a set of numbers eg a histogram p(x,y). What is the problem with this?
There is a lot of redundant information. If we used the values of a histogram on a grid, we would have lots of information we don’t need (many of the numbers are zero).
We want to design better features to go into the model - ie representing this as a set of numbers.
What could we use as features instead of using the values of histogram on a grid?
The projections of our histogram on some basis functions, rather than looking at everything.
How do we pick the set of basis functions?
- Sometimes we can choose them by physical intuition
- But often we don’t know which best represent the data
Discussion of K-means clustering, with a predetermined K. We randomly assign each data point into one of the sets. Calculate the mean of each set and then re-assign data points to the set whose means they are closest to. We repeat until there are no changes.
What are two examples of how deep learning can be used in an unsupervised way?
- Autoencoders
- Generative adversarial networks (GANs)
Briefly, what do auto encoders do?
These can encode data by projecting it onto a lower-dimensional representation. This projection is learned by training a network whose outputs are as close as possible to its inputs.
Non-linear combinations of data of lower dimensionality.
Projecting information through non-linearity to get fewer dimensions.
Briefly, what are GANs?
Generative adversarial networks - these are used to train generative networks that can produce data points from an appropriate distribution. Two networks compete with each other, one to generate possible examples and one to detect false examples. (They compete in a zero sum game).
What is the difference between GANs and CNNs?
GANs are generative models that can generate new examples from a given training set, while convolutional neural networks (CNN) are primarily used for classification and recognition tasks.
What is the need for dimensionality reduction?
Often, we use more features to represent our data points than we need to (redundancies and dependencies of features). Dimensionality reduction aims to eliminate this redundancy, going from a high dimensional feature space to a lower-dimensional representation.
Briefly, discuss auto encoders and dimensionality reduction.
Autoencoders use deep learning to very efficiently reduce the dimensionality of a problem. (In a non-linear way, making it powerful).
What is the network architecture of an auto encoder?
The network architecture is deep, but becomes narrow in the middle. Effectively it is like a deep neural network with a bottle neck region, where are fewer nodes than inputs (and outputs).
The central layer is the code - this is the bottleneck region, here the number of features is smaller than the input.
There are two steps in forward propagation: encoding and decoding.
What are the two steps in forward propagation in an auto encoder?
- Encoding
- Decoding
Briefly describe the encoding part of forward propagation in an auto encoder.
Features propagate through the network until you reach the code region, where the number of nodes is smaller than the number of inputs.
When we apply the encoding step, we are compressing the data, however we are losing information - this is lossy compression.
Briefly describe the decoding part of forward propagation in an auto encoder.
After the input data has been encoded into a lower-dimensional representation, it is decoded it back to the original input space.
What is the target output of an auto encoder?
We want to get back something very similar to what we put in.
It will not be exactly the same as we have information loss due to lower-dimensional space. Recall that a single-layer neural network acts as a universal approximation, but only if we have lots of nodes.
Why are the inputs and outputs of an auto encoder not exactly the same?
Information loss due to lower-dimensional space.
Why can we describe the encoding step as lossy information?
When we apply the encoding step, we are compressing the data, however we are losing information - this is lossy compression.
What are autoencoders trained to recognise?
What is the implication of this?
Autoencoders are trained to recognise the key features in data.
Because the input is projected onto these features, it is generally quite resistant to noise. We can train autoencoders specifically for removing noise from the input - working out what the input should have been without noise.
How can we train autoencoders specifically for removing noise from the input?
We add some noise to the input before it goes into the auto encoder. The network learns how to remove noise.
We apply the ampount of noise we expect the auto encoder to be produced with.
The output is a non-noisy input but not identical to the original input version.
[See diagram]
How do PCA and out encoder reconstructions compare for functions with increasing powers of x?
PCA reconstruction (feature projected to one principal component) and auto encoders (features project to one feature) are both able to reconstruct a linear function well - both show a straight line.
PCAs are only capable of reproducing this linear relationship with a single feature, however autoencoders are able to achieve non-linearalities.
What other method involves extraction of important features?
How does the auto encoder compare?
Principal component analysis (PCA) does a similar thing (it is linear, rather than network-based).
An auto encoder could be thought of as a nonlinear version of PCA. The non-linearity combination of features is brought about by the coding region.
This nonlinearity can mean that autoencoders may perform better when features are related nonlinearly. In turn, auto encoders might require fewer features than PCA to give the same level of error. Eg PCA - 10 features, AE - 1 or 2 features.
Training an autoencoder involves a neural network, so we may incur problems (eg overfitting) that we may not have with PCA.
PCA involves only one line of code (eg diagonalising a matrix), whereas an autoeconder involves setting up a neural network and weights etc.
You can interpret PCA better, the auto encoder is harder to understand / interpret as you have less intuition.
What is a use of autoencoders, similar to PCAs?
Representation learning.
Sometimes (eg when we know something about the symmetry of a problem) we can choose the right features from our data. Quite often, the problem is too complex to do so - but we still might feel we are using more data than we need to.
Autoencoders project the input (nonlinear) into fewer dimensions, finding a smaller number of features that still represent the data very well.
There are similar advantages to PCA, but with added non-linearity.
How do we train an auto encoder?
We train it as we do a standard (deep) network.
In the dataset D, the inputs and outputs (x1, t1) can be represented as (x1, x1) as the inputs and outputs should be the same.
What is the error function for autoencoders?
Reconstruction error.
When writing the error function eg for regression, what is one subtlety between previous error functions?
Use of | | instead of ( ) because we have vectors.
How do we optimise the errors for autoencoders?
We have the error function (reconstruction error). We can find the derivatives of the error with respect to the weights using back-propagation, and optimise the errors as usual.
What is the “bare minimum” basic architecture for an auto encoder?
The basic auto encoder architecture includes a bottle neck.
An auto encoder can already be produced with a single hidden layer in the encoder and decoder sections (three layers total).
Deeper networks are preferred for neural networks. There are several advantages to using deep networks including the amount of training data needed. For models involving images, convolutional layers are also likely needed in order to “pre process” the data.
What are other possible architectures for an auto encoder?
Sparse auto encoders
The code layer no longer needs to have fewer nodes than inputs (no bottle neck). Instead, the error function has an extra term that penalises activations - this means that not every node is activated. Some information will get to the code layer and no further, because large activations are penalised.
Contractive auto encoders.
Here the error function also has an extra term - this time, to penalise large derivatives of activations with respect to the data ie situations where a small change in input gives a large change in output.
- Variational autoencoders.
These are similar to a standard autoencoder, but the code is now a probability distribution, rather than a set of values. We can draw from this probability distribution and send the result through the decoder, to give an output that isn’t in the training set, but should be from the same probability distribution. After training, we decouple the encoding and decoding regions.
What applications are there for autoencoders?
- Removing noise from images
- Compression of inputs
- Detection of anomalies - data points that are not likely to be from the same probability distribution as the training set. Putting these points through the autoencoder, they will be poorly reconstructed. Eg hand drawn letters, passing in :), the auto encoder will not know to deal with this if it wasn’t seen in the training set. You will get back nonsense.
- Inference of missing values - similar to noise removal, we can train the network by removing features at random from the input data (to simulate missing data) and then the final network should be able to reconstruct the original point.
- Generative models - for these, we could consider variational autoencoders.
How are anomalies detected using auto encoder?
Putting these points through the autoencoder, they will be poorly reconstructed.
What is a generative model used for?
Used to produce examples that are similar to its training set. Often re-cast in a probabilisitic way - we model the probability distribution that produced the training data and we generate new points from that distribution.
We want to consider complex problems, such as training a model and showing it a variety of pictures, then having it generate new, unseen convincing pictures of these types.
What is a discriminative model?
A discriminative model takes in an input x and predicts an output z (classification or regression). More properly, in statistical terms it models P(z|x). That is, given an input x, what is the probability of output z? The final output of the model is then the most likely from this distribution.
What is a generative model?
A generative model is a model for the joint distribution P(x,z), from which we draw inputs and outputs.
P(x,z) = P(z|x) P(x)
What is a generative adversarial network (GAN)?
A generative adversarial network (GAN) is a deep learning architecture. It trains two neural networks to compete against each other to generate more authentic new data from a given training dataset.
Made up of two networks:
- Generative model
- Discriminative model
The two models play a zero-sum game. The generative model aims to fool the discriminative, and the discriminative aims not to be fooled (they have opposite goals).
What does the generative model of a GAN do?
Produces examples of whatever we want to generate eg generating images.
What does the discriminative model of a GAN do?
Decides whether or not the generative model’s examples are “real” examples from the correct distribution. ie works to correctly classify the images produced, deciding if it is real or fake.
What kind of data does the discriminative model take?
The discriminative model takes in a data point, x, which may be from the generative model (t=0) or a “real” example (t=1)
It then classifies this model either as real (z=1) or fake (z=0)
Discuss the error function of a GAN?
There are two error functions, one for each model. These are combined to make 1 error function, so there are two terms.
Whenever the discriminator makes a prediction, both error functions are updated based on whether the generator was able to “fool” it.
Eg if the generator produces an image, and the discriminator assigns “fake”, the generator error is updated. If the discriminator assigns this as “real”, the discriminator error is updated.
eg if a real image is produced and the discriminator assigns “real”, the discriminator error is not updated. However, if it is assigned “fake”, the discriminator error is updated.
The weights are updated using gradients obtained from the discriminator loss function (and back-propagation).
How does updating the error influence the model?
When we start, what comes out of the generator will be garbage and the discriminator identifies this. But the generator will get better over time and the discriminator gets worse (it will make mistakes).
The discriminator is just a model for classifying inputs (C1 = “real” and C2 =”fake”) and can have any architecture.
Which model, the generative or discriminative, is harder to train?
Why?
A generative model is more difficult to train than a discriminative one - the two tasks are quite asymmetric.
Consider the two tasks:
- Model 1 decides whether or not a picture contains a human face - eg easier to recognise a face than it is to construct one yourself
- Model 2 has to put together a convincing human face - this is a much harder job
Describe the generative network, with aid of the diagram.
Starts off with a random input - it doesn’t matter what goes into the generative model, it matters what goes out. This is transformed into a sample data point (an image), which goes into the discriminator. This classifies the data point, and the generator error is updated.
We then want to use the generator error to update the weights. This error depends on the weights of both generator and discriminator, but we want to only update the generator now.
Back-propagate through the discriminator to the generator, and update the weights only of the generator.
Discuss training the GAN.
We don’t want both networks to change at the same time. The generator should adapt to fool the discriminator as it currently is.
The discriminator should adapt to avoid being fooled by the current generator. We alternate between training each of the two models, with the other one kept constant. Each time we train, we send a number of examples either from the true distribution or the generative model through the discriminative model, and accumulate both errors.
We then only update one of the models.
We alternate training the two networks - when do we stop?
When the generator is as good as it can get, the discriminator is no longer able to make predictions about whether a data point is real. That means it will make a completely random guess. ie the discriminator will get things right 50% of the time. At this point, we can stop the training and decouple the two networks.
What error (loss) function do we use for the GAN?
Classification - real or fake, therefore based on cross entropy error.
The discriminator wants to maximise the probability that it labels a real or fake image correctly. The generator wants to maximise the probability that the discriminator labels a fake image as real.
We use a minimax function which has two sums [see flashcard]
Discuss the minimax function.
Minimax function [see flashcard] - is positive, but if the equation is negative, this will be the opposite.
The discriminative network wants to maximise this.
The generative network wants to minimise this.
The two are competing against each other - they have opposite goals for this error function.
Discuss a potential problems with GANs.
Mode collapse
The generator may learn only to generate one type of output (eg on particular digit). Eg if we build a generative model to generate pictures of animals (you have a repository of real images), it might get very good at producing one kind of output. The generator stops updating, but none of the other animal types are very good, so it stays in a local minimum.Any different types of digits are caught by the discriminator, so the generator stays in a local minimum.
We need to be careful about where we stop training. When the generator is “good enough”, the responses of the discriminator are random. The generator will then start to train on this random feedback.
How are conditional GANs trained?
Using labelled data, and a label is supplied to the generator, to give a particular type of output.
What are other examples of generative models?
- Large language models (LLM) - the underlying ideas are familiar, but there are some new things too. In particular, reinforcement learning.
What is reinforcement learning?
Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to maximise cumulative rewards in a given situation.
It is distinct from supervised / unsupervised learning. Unlike supervised learning, which relies on a training dataset with predefined answers, RL involves learning through experience. In RL, an agent learns to achieve a goal in an uncertain, potentially complex environment by performing actions and receiving feedback through rewards or penalties. ie the model takes actions and is “rewarded” or “punished” for them. There is no right or wrong answer, the model learns for itself how to perform a task.