Neural Networks Assignment Flashcards
What is a neural network?
It’s a technique for building a computer program that learns from data. It is based very loosely on how we think the human brain works. First, a collection of software “neurons” are created and connected together, allowing them to send messages to each other. Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure.
In this assignment we first built up a ‘shallow’ neural network. How did we do this?
Changed the number of hidden layers from 2 to 0. Under DATA, selected the dataset with the two clearly separated clusters. Set the activation to “Linear”.
What purpose does this simple neural network serve?
This network will try to use two variables, x1 and x2 (the input, or independent variables), to classify the different observations (dots in the rightmost graph) as their correct class (the dependent variable). In other words, it will try to classify orange dots as orange (class 0) and blue dots as blue (class 1).
In a classic practical example, the input variables could be weight and height and the classes could represent gender which is being estimated.
What does it mean to say that this network is untrained?
It is configured with random weights (sometimes called “parameters”). These weights will be multiplied and summed with the input variables (and rescaled using the activation function, which we’ll discuss in detail later, so you can ignore this for now). This multiplication-and-summing process of the variables and weights is an integral component of neural networks.
As an analogy, try to think of training as baking a cake, in which the input variables represent the ingredients and the weights represent the quantities of the ingredients.
How many weights does this particular network have?
2
Describe what the rightmost plot represents (orange and blue dots against an orange and blue background) and what name is given to it.
Because the network starts off with random weights, its initial performance will most likely be quite bad. You can see this in the rightmost plot (see below), which visualizes the “decision surface”, which colors the space according to the class it would predict an observation. So, dots in blue regions will be classified as blue and dots in orange regions will be classified as orange. The saturation of the colour represents the confidence of the decision (more saturation indicates a higher confidence).
What is meant by loss in these networks?
How badly the network is currently classifying the observation is summarised as its loss (or sometimes “cost”). Loss can be computed in different ways using different loss functions, but in general a higher loss means a worse performing network.
he current loss of the (untrained) network is displayed underneath OUTPUT (upper right corner on the website). This is reported separately for the “train” and “test” set. Why is there this distinction?
The train and test set are often drawn from the same dataset by splitting the full dataset into two partitions (e.g., 50% train, 50% test). The train set is subsequently used to, well, train the network and the test set is used to evaluate how well the network is doing.
What does training refer to in neural networks?
In the context of neural networks, training refers to the process of (iteratively) adjusting the network’s weights based on the loss such that, over time, the loss becomes as small as possible.
Name and describe two settings or options which influence the training procedure outside of the learning rate and amount of regularisation
There are a couple of options, or settings, that influence the training procedure. One is the specific loss function (sometimes called “objective function”) used; this cannot be changed on this website. Another important one is “batch size”, which represents the number of observations that will be passed through the network before computing the loss and updating the weights based on this loss. This might sound weird to you, as many of the more traditional statistical models you are familiar with just use all data available. However, many (deep) neural networks are trained on massive datasets, which may include millions of observations. Passing such amounts of data to a network at once will crash any computer! Therefore, networks are often iteratively trained on batches of observations with a particular batch size.
Suppose that our dataset has 500 examples and we set the batch size to 25. After 200 epochs, how many times has our network updated its weights?
500/25 = 20
20 x 200 = 4,000
What are epochs?
Epochs are the number times the full dataset (i.e., all observations) has passed through the network.
What is meant by convergence?
You can see that the network’s weights change relatively little after a 100 epochs or so. This is nicely visualised as a graph next to the loss values (with loss on the y-axis and epochs on the x-axis). When a network stops updating its weights over time, it is said that the network has converged.
When we changed the dataset to the one with the orange ring with the blue cluster inside rather than the the dataset with the two clearly separated clusters, how did this make the problem more challenging for the neural network?
So far, we have dealt with a relatively easy problem: classifying observations drawn from two distinct, relatively noiseless, clusters. Importantly, this represented a “linearly separable” problem: accurate classification could be achieved by drawing a straight (i.e., non-curved) line. This type of problem can often be solved by relatively simple models, including models from traditional statistics (such as logistic regression and linear discriminant analysis). This type of problem is not where neural networks shine. The new dataset is a nonlinear problem (i.e., its observations cannot be accurately predicted by drawing a straight line).
Why was our network horrible at solving this?
Our network is completely linear and thus cannot solve a nonlinear problem! Any network that only involves only the multiplication-and-summing of values and weights (in combination with a linear activation function) can only solve linearly separable problems