Intro ANN Flashcards
Why is it generally preferable to use a logistic regression classifier rather than a classical perceptron (i.1., a single layer of threshold logic units trained using the perceptron training algorithm)?
A classical perceptron will converge only if the dataset is linearly separable, and it won’t be able to estimate class probabilities. In contrast, a logistic regression classifier will converge to a good solution even if the dataset is not linearly separable, and it will output class probabilities.
How can you tweak a perceptron to make it equivalent to a logistic regression classifier?
If you change the perceptron activation function to the logistic activation function, and if you train it using gradient descent, then it becomes equivalent to a logistic regression classifier.
Why was the logistic activation function a key ingredient in training the first MLP?
Its derivative is always nonzero, so gradient descent can always roll down the slope. When the activation function is a step function, gradient descent cannot move, as there is no slope at all.
Name three popular activation function.
1) Step function,
2) logistic function
3) hyperbolic tangent function
4) Rectified linear unit function.
Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use ReLU activation.
a) What is the shape of the input matrix X
b) What are the shapes of the hidden layer’s weight vector Wh and its bias vector
c) What are the shapes of the output layer’s weight vector Wo and its bias vector
d) What is the shape of the network’s output matrix Y
e) Write the equation that computes the network’s output matrix Y as a function of: X,Wh,Bh,Wo,Bo
a) The shape of the input matrix X is mx10, where m represents the training batch size.
b) The shape of it hidden layer’s weight vector Wh is 10x50, and the length of its bias vector Bh is 50
c) The shape of the output layer’s weight vector Wo is 50x3. and the length of its bias vector Bo is 3
d) The shape of the network output matrix Y is mx3
e) Y* = ReLU(ReLU(XWh+Bh)Wo+Bo)
a) In a MLP, how many neurons do you need in the output layer if you want to classify email into spam or ham? What activation function should you use in the output layer?
b) If instead you want to tackle MNIST, how many neurons do you need in the output layer, and which activation function should you use?
c) What about for getting your network to predict housing prices ?
a) you just need one neuron in the output layer. You would typically use the logistic activation function in the output layer when estimating a probability.
b) For the MNIST dataset, you need 10 neurons in the output layer, and you must replace the logistic function with the softmax activation function, which can handle multiple classes.
c) You need 1 output neuron, using no activation function at all in the output layer.
What is backpropagation and how does it work?
Backpropagation: is a technique used to train artificial neural networks. It first computes the gradients of the cost function with regard to every model parameter (all the weights and biases), then it performs a Gradient descent step using these gradients. This backpropagation step is typically performed thousands or millions of time, using many training batches, until the model parameters converge to values that minimize the cost function.
In other words, this algorithm can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step.
What is the difference between backpropagation and reverse-mode autodiff?
backpropagation refers to the whole process of training an artificial neural network using multiple backpropagation steps. In contrast, reverse-mode autodiff is just a technique to compute gradients efficiently, and it happens to be used by backpropagation.
To compute the gradients, backpropagation uses reverse-mode autodiff. Reverse-mode autodiff performs a forward pass through a computation graph, computing every node’s value for the current training batch, and then it performs a reverse pass, computing all the gradients at once.
Can you list all the hyperparameters you can tweak in a basic MLP? If the MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?
1) Numer of hidden layers,
2) number of neurons in each hidden layer
3) The activation function used in each hidden layer and in the output layer. (ReLU activation function is a good default for the hidden layers)
4) The learning rate
If the MLP overfits the training data, you can try reducing the number of hidden layers and reducing the number of neurons per hidden layer.
What is a perceptron and how does it work ?
A perceptron is an algorithm for supervised learning of binary classifiers. This algorithm enables artificial neurons to learn and processes elements in the training set one at a time. Its a threshold logic unit.
The perceptron algorithm learns the weights for the input signals in order to draw a linear decision boundary. The weight are generally added up and then go through the activation function. This enables you to distinguish between the two linearly separable classes + 1 and -1 (or 0 and 1).
What is the perceptron function?
f(x) = {1 if w*x+b > 0 or
0 otherwise}
w: vector of real-valued weights
b: the bias (an elementa that adjust the boundary away from the origin without any dependence on the input value
x: vector of input x values.
w*x = Σ1m wixi
What is an MLP? What are its three components? In the perceptron, there are two symbols, what do they represent?
It a Multilayer perceptron. The three components are:
1) The Input layer takes in the features.
2) The Hidden Layers, which are neuron nodes stacked in between inputs and outputs, allowing neural networks to learn more complicated features.
3) The output layer output the results or predictions.
The sum represents the input function and the sigmoid the activation function. Note, the activation function take in w*x (dot product) and decide if the output is 1 or -1.
What is the ReLU function?
The rectified linear unit function (ReLU) is an activation function. The ReLU function is continuous but unfortunately not differentiable at z = 0. (The slope changes abruptly, which can make gradient descent bounce around). Its derivative is 0 for z < 0. It is given by:
ReLU(z) = max(0,z)
z is generally the dot product: z = w*x
The rectified linear activation function is a piecewise linear function that will output the input directly if is positive, otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.
In neural network, what is the activation function?
In a neural network, the activation function is responsible for transforming the summed weighted input from the node into the activation of the node or output for that input.
What is the softplus function ? What its derivative ?
It generally used as an activation function. Mathetically its:
f(x) = ln(1+ex)
f’(x) = ex/(1+ex) = 1/(1+e-x)
Note the derivative is the logistic function.