Neural networks and tidy models Flashcards

week six

1
Q

Neural Networks

A

Comprises an ordered sequence of
layers of neurons.
Each node connects to all nodes in
next layer.
The input layer (predictor nodes)
feeds to a hidden layer (derived
features ), and then to an output
node (target variable).
Can have multiple hidden layers
(deep learning) but we’ll be sticking
with one for simplicity!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Rule of thumb: try ⌈p/2⌉ hidden nodes.

A

⌈p/2⌉ is smallest integer at least as large as ⌈7/2⌉ = 4

p is effective number of predictors, where a factor on K levels contributes K−1 predictors (through coding to indicators variables).

If there are 3 numerical predictors and one factor on 4 levels, then
p =3+(4−1)=6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

activation function.

A

The activation function is like the translator in a conversation between the input data and the derived features. It’s a special function that transforms the input data into the output or derived features. In a neural network, for example, the activation function takes the weighted sum of the input data and biases, applies some mathematical operation, and produces the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sigmoid Activation Function

A

The sigmoid activation function is like a squasher—it takes any input value and squashes it down to a value between 0 and 1.

Common choice of is the sigmoid function .
ϕ
ϕ(v) = 1/
1+e^(−v)
Note that large input v >4 returns almost one; small input v <−4
returns almost zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Fitting A Neural Network

A

A neural network is defined by large number of parameters:
These are often referred to as weights for the neural net.
The
M(p+1)+M+1
α01,α11,…,αpM,β0,…,βM
weights must be estimated by fitting the model to training data.
.
We aim to select the weights that minimize the residual sum of squares for predictions from the neural
network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Basic idea of back propagation:

A
  1. We generate random initial starting weights.
  2. We ‘feed-forward’ the training data through the layers to give a prediction .
    y
    ^
  3. We compute the residual sum of squares RSS. This is our loss function that we want to minimise.
  4. RSS is a complicated function of the weights, but it’s construction means we can compute the partial
    derivatives with respect to each weight quite easily (i.e. the direction and amount each weight needs to
    move to decrease RSS).
  5. We use batch gradient descent to move all the weights a step in the right direction. The step size is
    known as the decay - it’s basically to ensure we don’t overshoot the minimum RSS.
    w
  6. We take the step, update the weights, then loop back to step 2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Implementing Neural Networks in R

A

R function for fitting a neural network is
nnet()

nnet(y ~ x1 + x2 + x3, data=mydata, size=2, linout=TRUE)

size
argument specifies the number of nodes in the hidden layer. It has no default value.

The argument linout is logical, and indicates whether the relationship between the prediction (output ) and the derived features is linear or not.

.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

gradient descent

A

Adjust Weights: We use a method called “gradient descent” to adjust each weight a little bit in the right direction to reduce the error. It’s like tweaking the knobs on a machine to make it work better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how to find weights per hidden node, weights for output node , weights to be determined.

A

This first model has 8 hidden nodes. How many weights?
p +1=17 weights per hidden node (number of s);
8 +1=9 weights for output node (number of s);
17 ×8+9=145weights to be determined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

maxit

A

nnet() uses a small number of iterations (100) by default. Increasing this can be required if the decay/learning rate is low. We set this by specifying maxit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

decay

A

decay controls the weight updating rate (learning rate) in back propagation algorithm.
Setting decay to a value such as 0.01 or 0.001 can be useful to speed up or slow down learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

size

A

The size parameter determines the number of neurons (nodes) in the hidden layer of the neural network. Tuning this parameter can affect the model’s flexibility and performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Training neural nets is a complex problem

A

Fitting methods are less stable than for linear models and regression trees or forests.

Numerical issues can arise when predictors are on (very) different scales, particularly if large in magnitude.

Pre-scaling predictors can be advisable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Complexity of Neural Network

A

Complexity depends on number of hidden nodes.
Bias-variance trade-off applies in theory: increasing number of hidden nodes will reduce bias but increase variance.
In practice the picture is confused by the difficulties in model fitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Tidy models

A

Step 1: Splitting the data ( rsample)
Step 2: Prepare the data (recipes)
Step 3: Mode ling (parsnip)
Step 4: Prediction
Step 5: Model selection (yardstick)
Step 6: Predicting the unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly