ANNs and Backprop Flashcards
What are ANNs?
A new method of programming computers with automatic learning through training examples
What tasks are ANNs particularly good at?
Pattern recognition and other conventionally difficult to program tasks
What is the architecture of ANNs based on?
Loosely based on a biological brain
How do ANNs process information?
Using interconnected neurons
What type of reasoning do ANNs use?
Inductive reasoning (data to rules)
What is the memory type of ANNs?
Distributed and short-term
What is a key advantage of ANNs?
Fault tolerant due to redundancy
Name three applications of ANNs in classification.
- Consumer behavior
- Medical diagnosis
- Fruit grading
What are two areas where ANNs are used for recognition/identification?
- Speech
- Vision
How are ANNs used in forecasting/prediction?
- Weather
- Stocks
- Crop yield
- Trends
What are the capabilities of ANNs?
Turing powerful, capable of approximating any function or mapping between vector spaces
What tasks do ANNs struggle with?
Symbolic manipulation and memory intensive tasks
Why are ANNs beneficial?
Avoids explicit system modelling by learning complex behaviors directly from data
How many neurons does a human brain have?
86 billion neurons
Fill in the blank: ANNs are best suited for _______.
classification and function approximation
True or False: ANNs can learn and adapt to changing conditions.
True
What are some applications of NLP?
Text categorization, part-of-speech tagging
NLP stands for Natural Language Processing.
What are examples of predictive analysis applications?
Stock market trends, weather prediction
Predictive analysis involves using data to forecast future outcomes.
What security applications are mentioned?
Motion detection, fingerprints
These applications enhance security systems.
In what business areas are predictive analytics widely used?
Data warehousing, uncovering patterns and trends
Major consulting firms utilize these techniques.
What is crucial for the success of Artificial Neural Networks (ANNs)?
Training data
The quality and quantity of training data directly affect ANN performance.
What is an artificial neuron?
A simplified model of a biological neuron
It serves as the foundational model for computational models in AI and neural networks.
What are the inputs of an artificial neuron denoted as?
I1, I2… In
These inputs are real numbers that the neuron processes.
What determines the significance of each input in an artificial neuron?
Weights
Each input has an associated weight that influences the neuron’s output.
What does the summation unit of an artificial neuron compute?
The weighted sum (logit) of the inputs
The formula is Σwi . Ii + b, where b is an optional bias term.
What is the role of the activation function in an artificial neuron?
Transforms the logit into the neuron’s output
The function f defines the behavior of the neuron.
What is a popular activation function mentioned?
Sigmoid
It is smooth and bounded between 0 and 1.
What are linear layers in deep learning?
Layers with no activation function
The output of each neuron in these layers is the logit.
Why must the activation function in neural networks be differentiable?
Required by algorithms that optimize the weights
Differentiability is necessary for gradient-based optimization methods.
What are two activation functions that are replacing sigmoid?
TanH, ReLu
These functions offer advantages in performance and convergence.
What is the linear step function also known as?
Heaviside function
It maps input values to 0 or 1 based on a threshold t.
Describe the typical structure of an ANN.
Input layer, 1+ hidden layers, output layer
Data flows from inputs X to outputs Y, with neurons connected by weights.
How is pattern recognition implemented in neural networks?
Using a feed forward neural network
The network associates target outputs with input patterns during training.
What must be available for effective pattern recognition in neural networks?
Good labelled training data
Quality training data is essential for accurate pattern association.
What are weights in the context of neural networks?
Model parameters or just parameters
They are adjusted during training to optimize performance.
What is a Perceptron?
A simple two-layer feed forward neural network with an input layer and an output layer.
Uses a linear step function with t=0.5.
What types of functions can a Perceptron compute?
Boolean AND and OR functions.
Requires finding a set of weights for binary input.
How do input neurons function in a Perceptron?
Input neurons act as identity functions with a weight of 1.
Always output the input value, whether 0 or 1.
What is a limitation of the Perceptron?
It can only compute functions that are geometrically linearly separable.
This means inputs can be separated by a straight line in input space.
Is XOR linearly separable?
No, XOR is not linearly separable.
True and false inputs cannot be separated by a straight line.
What is required to compute XOR?
A nonlinear activation function.
XOR cannot be computed by a standard Perceptron.
What did Minsky and Papert’s book highlight about the Perceptron?
It led researchers to focus on symbolic AI instead of neural networks.
Their analysis suggested limitations of single-layer perceptrons.
What is a Multi Layer Perceptron (MLP)?
A type of Perceptron with at least three layers: input, hidden, and output.
It is Turing powerful if it has a nonlinear activation function.
What is a key feature of MLPs according to Minsky and Papert?
An MLP can theoretically compute any computable function.
Requires at least one hidden layer and a nonlinear activation function.
What was one main issue in developing MLPs?
Finding a consistent set of weights for training examples.
Also involves determining the number of layers and neurons.
What impact did Minsky and Papert’s analysis have on neural networks?
It led to a perception that all neural network architectures were flawed.
Resulted in reduced funding and interest in neural networks.
What is the relationship between MLPs and ANNs?
MLP is one type of artificial neural network (ANN).
It is one of the simplest and most popular neural networks.
Classification vs Regression
These are the two main categories for supervised learning algorithms. The biggest difference is that while regression tries to predict a continuous quantity, classification predicts discrete class labels.
Ex of Regression
Predicting tomorrow’s price of a certain stock from historical data.
Ex. of Classification
is recognizing dog images from cat images.
MLP Training vs Testing
In supervised learning we need labelled datasets. Usually divided randomly into training and testing examples. Python ML packages fit() fits training data to the model. Testing data is classified by the trained model to determine the classification or prediction performance. Various metrics used to evaluate these algorithms.
Training a Multi-Layer Perceptron
Task for finding a set of weights that will allow the MLP to classify the training examples correctly. If we have 50 labelled images of cats and dogs, use feature selection to represent them as vectors. MLP is then trained to output 0,1 for cat and 1,0 for dog. Weights initially randomized between -1 and 1. It is infeasible to find weights by inspection.
The goal of any learning algorithm
is to fund a function that best maps inputs to their correct output. MLP training is an optimization task of finding the right weights to compute any arbitrary mapping of input to output. This can be done using the Error backprop algorithm.
MLP Topology
Number of layers and neurons in each layer
Connections between the neurons and their direction
Activation function used for the neurons
error surface
The error surface of a neural network is rarely very smooth and well-behaved. Error surfaces tend to be very convoluted with numerous local minima.
One hot encoding
If you want a NN to classify unseen items into 3 classes, you assign for each class ‘1’ to a specific output neuron with all other output neurons assigned a ‘0’. For n classes you need n output neurons. For example ‘100’ ‘010’ and ‘001’ can be assigned for the 3 classes. Used often to label the training examples. Most popular output representation (target labelling) sometimes called ‘distributed representation’.
Multi-Layer Perceptron - Pre-Training
Randomize all the training examples
Initialize all the weights with random values between -1 and 1.
Set the learning rate η hyperparameter (usually 0.2). Determines how much of the neuron error is used to modify the weights.
Set the Error Threshold μ hyperparameter. (usually 0.2). Determines how much of the neuron error ‘leeway’ is given to the output layer neurons.
Another hyperparameter is the maximum number of epochs.
Notes about MLP Training
Error Backprop Algorithm modifies the weights when training encounters a bad fact. The required number of epochs varies as sometimes the network is trained for a fixed number of epochs, and sometimes it trains until converges.
The network converges when it performs one epoch with only good facts. MEaning that for each training example, the error for each output neuron was less than the error threshold.
Backprop was not called to modify weights in the above case.
You don’t need to reshuffle the order of the training examples for each epoch.
When the network converges the weights must be stored. These weights correctly classify each training example.
Weights are used to classify unseen examples to determine if the network can generalize. If you do not save the weights you will have to retrain if you switch off the computer.
Common to store weights after each epoch (checkpointing).
To generalize in ML means learning from a fixed number of training examples in order to be able to classify correctly any unseen examples.
The set of weights obtained by convergence is not unique. If training is performed again, the weights will probably be different after convergence.
Weights are called parameters, other settings are hyperparameters (LR, Error Threshold, Activation function, epochs…)
Determine optimal hyperparameters through hyperparameter optimization.
ReLu vs Sigmoid
Sigmoid has some disadvantages; Computationally expensive, and input values below -4 or above 4 are mapped to 0 or 1 respectively, losing magnitude information ex. 5 and 500 are both mapped to 1.
ReLU and its derivative are much faster to compute
Addresses the vanishing gradient problem to some extent.
Networks like ReLU in practice tend to show better convergence performance than sigmoid
Tends to blow up activation since there is no mechanism to constrain the output of the neuron.
Dying ReLU problem - if too many inputs go below 0 then most of the units in the network will simply output zero, die and prohibit learning. This can be solved with leaky-relu.
Neuron Bias
Bias works like the intercept added in a linear equation. Additional parameter in an ANN which is used to adjust the output align with the weighted sum of the inputs to the neuron. Thus bias is a constant which helps the model to best fit for the given data. Bias allows you to shift (left or right) the activation function by adding a constant (bias) to the inpu
Output Formula
Output = f(sum(weight*input)+bias)
The bias term performs several important functions
Translation of Activation Function: The bias allows the activation function to be shifted to the left or right which helps the model make better approximations of the target function. Without a bias term, the neuron will be constrained to always pass through the origin, limiting its expressive capability.
Increased Flexibility: By adjusting bias and weights during the learning process, the model becomes more flexible. This added degree of freedom allows it to fit the training data better and generalize to new data.
Complexity and Non-Linearity: When used in conjunction with non linear activation functions, the bias term aids to introduce non-linearity to the model. This is important for tackling complex problems that cannot be solved adequately with linear methods.
Breaking symmetry: In the initialization phase, if neurons in the same layer have the same weights and biases, they’ll produce the same output, effectively making them identical. Biases help break this symmetry, allowing neurons to learn different features during training.
Dropout
Dropout is a regularization technique commonly used in MLPs and other NNs. The idea is to prevent overfitting by randomly setting a subset of neuron outputs to zero at each training stage. Overfitting is a critical issue in ML where a model performs exceedingly well on a dataset but poorly on unseen or validation data.