Neural Networks Assignment Flashcards

1
Q

What is a neural network?

A

It’s a technique for building a computer program that learns from data. It is based very loosely on how we think the human brain works. First, a collection of software “neurons” are created and connected together, allowing them to send messages to each other. Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In this assignment we first built up a ‘shallow’ neural network. How did we do this?

A

Changed the number of hidden layers from 2 to 0. Under DATA, selected the dataset with the two clearly separated clusters. Set the activation to “Linear”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What purpose does this simple neural network serve?

A

This network will try to use two variables, x1 and x2 (the input, or independent variables), to classify the different observations (dots in the rightmost graph) as their correct class (the dependent variable). In other words, it will try to classify orange dots as orange (class 0) and blue dots as blue (class 1).

In a classic practical example, the input variables could be weight and height and the classes could represent gender which is being estimated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does it mean to say that this network is untrained?

A

It is configured with random weights (sometimes called “parameters”). These weights will be multiplied and summed with the input variables (and rescaled using the activation function, which we’ll discuss in detail later, so you can ignore this for now). This multiplication-and-summing process of the variables and weights is an integral component of neural networks.

As an analogy, try to think of training as baking a cake, in which the input variables represent the ingredients and the weights represent the quantities of the ingredients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How many weights does this particular network have?

A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe what the rightmost plot represents (orange and blue dots against an orange and blue background) and what name is given to it.

A

Because the network starts off with random weights, its initial performance will most likely be quite bad. You can see this in the rightmost plot (see below), which visualizes the “decision surface”, which colors the space according to the class it would predict an observation. So, dots in blue regions will be classified as blue and dots in orange regions will be classified as orange. The saturation of the colour represents the confidence of the decision (more saturation indicates a higher confidence).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is meant by loss in these networks?

A

How badly the network is currently classifying the observation is summarised as its loss (or sometimes “cost”). Loss can be computed in different ways using different loss functions, but in general a higher loss means a worse performing network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

he current loss of the (untrained) network is displayed underneath OUTPUT (upper right corner on the website). This is reported separately for the “train” and “test” set. Why is there this distinction?

A

The train and test set are often drawn from the same dataset by splitting the full dataset into two partitions (e.g., 50% train, 50% test). The train set is subsequently used to, well, train the network and the test set is used to evaluate how well the network is doing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does training refer to in neural networks?

A

In the context of neural networks, training refers to the process of (iteratively) adjusting the network’s weights based on the loss such that, over time, the loss becomes as small as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name and describe two settings or options which influence the training procedure outside of the learning rate and amount of regularisation

A

There are a couple of options, or settings, that influence the training procedure. One is the specific loss function (sometimes called “objective function”) used; this cannot be changed on this website. Another important one is “batch size”, which represents the number of observations that will be passed through the network before computing the loss and updating the weights based on this loss. This might sound weird to you, as many of the more traditional statistical models you are familiar with just use all data available. However, many (deep) neural networks are trained on massive datasets, which may include millions of observations. Passing such amounts of data to a network at once will crash any computer! Therefore, networks are often iteratively trained on batches of observations with a particular batch size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Suppose that our dataset has 500 examples and we set the batch size to 25. After 200 epochs, how many times has our network updated its weights?

A

500/25 = 20

20 x 200 = 4,000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are epochs?

A

Epochs are the number times the full dataset (i.e., all observations) has passed through the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is meant by convergence?

A

You can see that the network’s weights change relatively little after a 100 epochs or so. This is nicely visualised as a graph next to the loss values (with loss on the y-axis and epochs on the x-axis). When a network stops updating its weights over time, it is said that the network has converged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When we changed the dataset to the one with the orange ring with the blue cluster inside rather than the the dataset with the two clearly separated clusters, how did this make the problem more challenging for the neural network?

A

So far, we have dealt with a relatively easy problem: classifying observations drawn from two distinct, relatively noiseless, clusters. Importantly, this represented a “linearly separable” problem: accurate classification could be achieved by drawing a straight (i.e., non-curved) line. This type of problem can often be solved by relatively simple models, including models from traditional statistics (such as logistic regression and linear discriminant analysis). This type of problem is not where neural networks shine. The new dataset is a nonlinear problem (i.e., its observations cannot be accurately predicted by drawing a straight line).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why was our network horrible at solving this?

A

Our network is completely linear and thus cannot solve a nonlinear problem! Any network that only involves only the multiplication-and-summing of values and weights (in combination with a linear activation function) can only solve linearly separable problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you make this network non-linear? (2)

A

There are different ways to do this. One popular way is to add hidden layers. Hidden layers are like an intermediary step between the input and the output. The nodes (or units/ neurons/ variables) of the hidden layers are the result of a separate multiplication + sum + activation step of the previous layer (in this case, the input layer). The input layer is “wired” to the hidden layer: each input unit has a connection with each hidden unit. (This is the reason why people sometimes call these types of networks “fully-connected networks”.) Also, note that the hidden layer is now directly connected to the output.

Hidden layers are not enough however, this network only involves linear operations (multiplication-and-summing) and any network that only involves linear operations won’t ever be able to solve a nonlinear problem. The activation function can be thought of as a function that rescales the data: it receives the result of the multiplication-and-summing operation (a single value) and outputs a rescaled version of that value. The way the data is rescaled depends on the activation function, but what they have in common is that they are often nonlinear functions (except for, as you would’ve guessed, the linear activation function). As such, we can use a nonlinear activation (tahn) function to “inject” nonlinearity in our network!

17
Q

Suppose we have a network with 3 input variables and two hidden layers with each five hidden units; how many weights (excluding biases) does this network have?

A

3 input variables have connections to each hidden layer node = 3 x 5 = 15

Each hidden layer node in the first layer has a connection to each in the second layer = 5 x 5 = 25

Each node in the second layer has a connection to the the output = 5

15 + 25 + 5 = 45

18
Q

After about 100 epochs, you should see that the network accurately classifies most observations. How does it do this?

A

By combining the (slightly) nonlinear units, or “representations”, from the hidden layer to create a nonlinear decision surface. The final decision surface is the result of the combination of the hidden layer decision surfaces of each neuron!

19
Q

So far, we have used a rather “idealised”, mostly noiseless dataset. Real data is often not so neat. We can simulate this by setting the noise level to 50 (drag the slider all the way to the right). Also, let’s pretend that we actually have less data by reducing the train set; do this by setting the “Ratio of training to test data” to 30%.

What is observed?

A

You’ll still notice the orange circle + blue cluster inside it, but it’s more noisy (more integration of the orange and blue dots). It will also take longer to create a reasonable decision surface like before, because the data is noisier. You will also see that the loss of the train and test set data seem to diverge over time: the loss of the train set is decreasing, as expected, while the loss of the test set is increasing…

20
Q

What is happening to cause the loss of the train and test set data to diverge?

A

overfitting: the divergence of accuracy of the model on the train set vs. accuracy on the test set, which happens more often in scenarios with relatively little data, a lot of noise, and/or when using very flexible and powerful models (e.g., neural networks with lots of layers and units per layer).

21
Q

There are different ways to counter overfitting. Name four

A

You could try to get better measurements (less noise in your data), more data, or using a less flexible/powerful model. Another often used technique is regularisation.

22
Q

What does regularisation do?

A

This technique tries to balance the learning process by imposing constraints on the weights (usually by limiting how large they can become; e.g., “L1” or “L2” regularisation) or even randomly “turning off” hidden units from the network during training (“dropout” regularisation). Often, when the amount of regularisation is chosen appropriately (which is sometimes a matter of trying out different settings), this leads to a better predictive generalisation of the model.

23
Q

If you set the regularisation to “L2” and the “Regularisation rate” to 0.03 and retrain the network what is observed? What is this called?

A

You should see that the difference between the train and test set loss should become much smaller (or even invert – a phenomenon called “underfitting”, suggesting a regularisation rate that may be too high).

24
Q

As you’ve seen, training neural networks can be quite finicky. Finding a set of hyperparameters that work may involve a lot of manual tuning (i.e., trying out different settings and seeing what works). One issue with this practice of hyper-parameter tuning is that it may lead to overfitting. Explain why this happens, even with a separate train and test set, and what you could do to prevent this.

A

We might start fitting to the noise since we are changing the parameters to get a better fit. If we start comparing to the test set and see how well we do, and consequently try to improve our performance we also will start fitting to noise in that data and the test and training set will not be independent anymore.

Even with “manual” hyperparameter tuning, the network may pick up on noise rather than signal. One solution would be to add yet another dataset to function as a truly independent test set.