Neural networks and tidy models Flashcards
week six
Neural Networks
Comprises an ordered sequence of
layers of neurons.
Each node connects to all nodes in
next layer.
The input layer (predictor nodes)
feeds to a hidden layer (derived
features ), and then to an output
node (target variable).
Can have multiple hidden layers
(deep learning) but we’ll be sticking
with one for simplicity!
Rule of thumb: try ⌈p/2⌉ hidden nodes.
⌈p/2⌉ is smallest integer at least as large as ⌈7/2⌉ = 4
p is effective number of predictors, where a factor on K levels contributes K−1 predictors (through coding to indicators variables).
If there are 3 numerical predictors and one factor on 4 levels, then
p =3+(4−1)=6
activation function.
The activation function is like the translator in a conversation between the input data and the derived features. It’s a special function that transforms the input data into the output or derived features. In a neural network, for example, the activation function takes the weighted sum of the input data and biases, applies some mathematical operation, and produces the output.
Sigmoid Activation Function
The sigmoid activation function is like a squasher—it takes any input value and squashes it down to a value between 0 and 1.
Common choice of is the sigmoid function .
ϕ
ϕ(v) = 1/
1+e^(−v)
Note that large input v >4 returns almost one; small input v <−4
returns almost zero.
Fitting A Neural Network
A neural network is defined by large number of parameters:
These are often referred to as weights for the neural net.
The
M(p+1)+M+1
α01,α11,…,αpM,β0,…,βM
weights must be estimated by fitting the model to training data.
.
We aim to select the weights that minimize the residual sum of squares for predictions from the neural
network.
Basic idea of back propagation:
- We generate random initial starting weights.
- We ‘feed-forward’ the training data through the layers to give a prediction .
y
^ - We compute the residual sum of squares RSS. This is our loss function that we want to minimise.
- RSS is a complicated function of the weights, but it’s construction means we can compute the partial
derivatives with respect to each weight quite easily (i.e. the direction and amount each weight needs to
move to decrease RSS). - We use batch gradient descent to move all the weights a step in the right direction. The step size is
known as the decay - it’s basically to ensure we don’t overshoot the minimum RSS.
w - We take the step, update the weights, then loop back to step 2.
Implementing Neural Networks in R
R function for fitting a neural network is
nnet()
nnet(y ~ x1 + x2 + x3, data=mydata, size=2, linout=TRUE)
size
argument specifies the number of nodes in the hidden layer. It has no default value.
The argument linout is logical, and indicates whether the relationship between the prediction (output ) and the derived features is linear or not.
.
gradient descent
Adjust Weights: We use a method called “gradient descent” to adjust each weight a little bit in the right direction to reduce the error. It’s like tweaking the knobs on a machine to make it work better.
how to find weights per hidden node, weights for output node , weights to be determined.
This first model has 8 hidden nodes. How many weights?
p +1=17 weights per hidden node (number of s);
8 +1=9 weights for output node (number of s);
17 ×8+9=145weights to be determined
maxit
nnet() uses a small number of iterations (100) by default. Increasing this can be required if the decay/learning rate is low. We set this by specifying maxit.
decay
decay controls the weight updating rate (learning rate) in back propagation algorithm.
Setting decay to a value such as 0.01 or 0.001 can be useful to speed up or slow down learning.
size
The size parameter determines the number of neurons (nodes) in the hidden layer of the neural network. Tuning this parameter can affect the model’s flexibility and performance.
Training neural nets is a complex problem
Fitting methods are less stable than for linear models and regression trees or forests.
Numerical issues can arise when predictors are on (very) different scales, particularly if large in magnitude.
Pre-scaling predictors can be advisable.
Complexity of Neural Network
Complexity depends on number of hidden nodes.
Bias-variance trade-off applies in theory: increasing number of hidden nodes will reduce bias but increase variance.
In practice the picture is confused by the difficulties in model fitting
Tidy models
Step 1: Splitting the data ( rsample)
Step 2: Prepare the data (recipes)
Step 3: Mode ling (parsnip)
Step 4: Prediction
Step 5: Model selection (yardstick)
Step 6: Predicting the unknown