Neural Networks Flashcards

1
Q

How to separate Test, Train and Validation Data

A

Separate base data set into 6 components

  1. Train Features
  2. Train Targets
  3. Validation Features
  4. Validation Targets
  5. Test Features
  6. Test Targets

Pic just shows training and test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is logistic regression?

A

Outputs a probability that a given input belongs to a certain class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the softmax function, why use e?

A

Turns scores(integers which reflect output of neural net) into probabilities

e turns negatives into positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain why sigmoids are no longer used as an activation function

A
  • Derivatives of sigmoid max out a .25, thus, during backprop, errors going back into the network will be shrunk by 75% -100% at every layer
    • For models with many layers, as you get closer to layers near input, weight updates will be tiny and take a long time to train
  • Not zero centered(More inconvenience) - feeds only positive value to next layer, in turn, gradient of weights of this input value X will be always positive or always negative, depending on the gradient of the whole expression f. This leads to undesirable zig-zagging of gradient updates for the weights
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain why Relu is used as an activation function

A
  • If max is positive, then derivative is 1, o there isn’t the vanishing effect you see on backpropagated errors from sigmoids.
  • Leads to faster training
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Drawbacks of Relu

A
  • Large learning rate, coupled with a large gradient, may lead to respectively large negative adjustment of weights and biases(our step down)
  • Adjustment may lead to a negative input into the Relu caculation. Negative = 0 for Relu. During backprop, the derivative of zero is zero, so chain rule leads to a zero update. This leads to a “dead” neuron which may waste computation and reduce learning
  • Its very hard if not impossible to input a large positive adjustment(to counter the earlier), since we are moving down to a local minimum. Not sure if this would even matter mathimatically, but may help if taken in a batch view
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Training Loss

A

Average cross-entropy loss

S = Softmax, D = Cross-Entropy, L = Loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When performing numerical optimization, what are good things to do with variables

A
  1. Mean of zero
  2. Equal Variance

This minimizes search performed by optimizer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

For images, how to you prepare your data for optimization

A

take each channel, subtract 128 and divide by 128. Doesn’t change the data, just makes it easer for numerical optimization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

discuss gaussian distribution and sigma

A
  • mean zero and standard deviation sigma.
  • sigma determines order of magnitude for outputs at inital point of optimization
  • Beaucse softmax sits ontop of sigma, the order of magnitude also determiness peakiness of inital probability distribution
  • Large sigma = uncertain, large peaks
  • Small sigma - Opinionated, small peaks -
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

For optimization, what are basic ideas on how to initialize weights and biases

A

start from gaussian distribution with small sigma. Small sigma means more certain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 3 sets of data used to measure performance and how are they used

A
  • training - optimize loss
  • validation - measure performance of training
  • test - never use until final measurments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain Stochastic Gradient Descent

A
  • Average loss for very small random fraction of training data
  • This average loss is typically a bad estimate at first and may actually increase error
  • Thus, you do it many times, taking small steps each time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain Momentum

A

Use running average of gradients as directio to take, instead of gradient in current batch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain learning rate decay

A

decreasing the learning rate over time during the training process(every time it reaches a plateau, exponential decay

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is ADAGRAD

A
  • modification of SGD which implicity does momentum and learning rate decay by default
  • Makes learning less sensitive to hyperparameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Whats a scalar

A

single value that represents a zero dimension shape(1, 2,4, -0.3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Whats a vector

A

a single dimesion shape with a certain length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a matrix and how do you describe it?

A

2 dimensional grid of values. If had 2 rows and 3 columns, its a 2x3 matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

describe a vector as a matrix

A

1x len matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

describe indices of a matrix

A

row then column index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to reshape vectors from horizontal to vertical

A

Less common way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

More common way of reshaping data

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe element-wise operations within Matrix

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Example of element wise operation -normalizing an image

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How to perform element wise operations between matrices

A

Must be the same size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How does matrix multiplication work

A

Take element wise operation for row and column pairs. Treat as separate horizontal and vertical vector pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

When can you NOT take a dot product of two matrices

A

if lengths of horizontal vector doesn’t match length of vertical vector. Since you take take element wise operation on these vector pairs, they must have same length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Important Reminders about Matrix Multiplication

A
  • The number of columns in the left matrix must equal the number of rows in the right matrix. When viewing shape of two matrices side by side, if the value dimensions match on inside, your good. 2x3 and 3X2. Values on inside(3) match.
  • The answer matrix always has the same number of rows as the left matrix and the same number of columns as the right matrix. (outside values, 2X2)
  • Order matters(not commutative). Multiplying A•B is not the same as multiplying B•A.
  • Data in the left matrix should be arranged as rows., while data in the right matrix should bearranged as columns.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the transpose of a matrix

A

switch the values in the rows to be values in the columns. If not a square, then 2x4 becomes 4x2, which may assist in matrix multiplication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

When can you safely use a matrix transpose

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How to call the transpose function in numpy

A

.T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

If you modify a transposed matrix, what are implimcations

A
  • it modifies both the transpose and the original matrix, too!
  • They are sharing the same copy of data.
  • Consider the transpose just as a different view of your matrix, rather than a different matrix entirely.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is a gradient

A
  • just a derivative generalized to functions with more than one variable
  • think slope or direction of greatest ascent. Taking negative of gradient is direction of steepest descent
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Explain local minima

A

area where loss is low, but not the lowest. Primarily caused by poor weight initialization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

When calculating the error of a function, what are upsides of using SSE

A
  • The squared error penalizes outliers more than small errors
  • Makes values all positive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Brief explaination of chain rule

A

Think of the network as a bunch of dominos. Some dominos are weights, biases, dot products, activation values, error. The goal is to fine tune the dominos that contain weights and biases. In order to due so, we take the derivative of the final domino(the error domino). Then, if you want to find a weight update, you just find the derivative of each connecting dominos and multiply the values togheter. This is the chain rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the gradient of really small and large values

A

zero, so in turn steps are zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How to do calculate weight update

A
  1. Calculate error term
  2. multiply by input x , then divide by number of records to get average delta weigt change
    3.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Describe components of Output Error Term

A

Output Error * Derivative of Output(Post Activation Function)

Pic uses sigmoid and derivative of sigmoid.

Error term(output or hidden) is used to update weights that connect into layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

For multilayer perceptron, how is the weight matrix written?

A

Just remember, in matrix notation, its always rows then columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

For backpropagation, how do you interpret the weights for each hidden layer

A

weights assign how impactful hidden and input layers are on the total error. Since inputs and hidden layers are all multiplied by weights, it only makes sense that the error stems from these weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is Backpropagation

A

Basically flipping the network around once you found an output error, and using this output error as the input. The output error is feed in as input, multiplied by weights to identify hidden unit error terms(the amount of error caused by the hidden unit). These error terms are then multiplied by hidden unit gradients to identify magnitude and direction a network should move.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How to create dummy variables for categorical variables

A

for loop, select the columns from the data, specify a prefix name, then concat those columns to the data. The pd.get_dummies will identify unique values and add those after the prefix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How do you scale variables to ensure they have zero mean and std of 1

A

take each variable, subtract the mean then divide by standard dev.

Code loops through features, calculates mean and std dev for each column, store these values in a dictionary as a tuple. Use data.loc[:,each] to perform calculation for every row of the looped column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

How to assign features and targets

A

create list of target fields, then use .drop method to remove these from data set, assigning dropped fields as targets and remaining fields as features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

how to split the feature and target data into train_features, test_features, val_features, val_targets

A

Just slice the data and assign to variables. The pic is selecting various rows since there is no comma, just the last amount of rows or the first amount of rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

How do you line up the features with targets

A

zip the two data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

How to implement SGD

A

create batches(which are just an array of index values from the data set) via np. random.choice.

Using this batch(list of index values), select rows(batch) and all columns from train_features and train_targets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

During Training, how do you get a vector of features as input

A

During SGD, you create a batch which is an index of values used to select data from train_features and train_targets. well call the features data a matrix of 128 rows by 56 columns and the targets a vector of 128 values. In order to get a vector of feature values to input into the neural network, you zip the train_features with the train_targets and assign the result to an X, y variable. Now the X variable is a single vector of 56 features with a target of 1 value.

  • Create batch of random index values
  • Select data using batch
  • Zip data with targets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Basic idea of K-Fold Cross Validation

A
  • Avoid separating test data that is never used while at the same time not cheating(overfitting)
  • Break data into k buckets, train k times and each time use a new bucket as the test bucket while training on the others
  • Average results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Implement K-Fold using SKLearn

A
54
Q

What is the purpose of the test set

A
  1. Does model is generalize well to unseen data.
  2. Is it overfitting(memorizing)
55
Q

discuss confusion matrix

A
56
Q

Code Accuracy

A
57
Q

Two types of errors in machine learning(Basic)

A
58
Q

Explain Model Complexity Graph

A

As models become more complex, errors become smaller and smaller. However, to esure model generalizes well and performs well on training data, look for point where validation errors increase while training errors decrease. Remember to use Validation Set, not Test Set

59
Q

For neural networks, how do you make the model more complicated

A
  1. More hidden nodes, layers
60
Q

General rule of thumb when selecting # of hidden nodes

A

Max = Twice the # of inputs

61
Q

create a function which counts how many words are in a review

A
62
Q

Describe basic neural net class functions

A
  1. Preprocess data
  2. Initialize architecture
  3. Store helper functions
63
Q

Code Softmax Function

A
64
Q

What happens if you multiply logits by 10 or divide by 10

A

multiply = probabilities get closer to 1 or 0

divide = since all the scores decrease in magnitude, the resulting softmax probabilities will be closer to each other.

65
Q

Why does softmax use e(natural exponential function)

A

using e, high scores in become much more probable than low scores. e has a nice derivative

66
Q

What is a linear combination in math?

A

an expression constructed from a set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of x and y would be any expression of the form ax + by, where a and b are constants).

67
Q

How to mathimatically say a change in the layer 2 value causes a change to the model loss/cost

A

This is just a derivative/gradient.

So a node with a larger gradient with respect to the cost is going to contribute a larger change to the cost. In this way, we can assign blame for the cost to each node. The larger the gradient for a node, the more blame it gets for the final cost. And the more blame a node has, the more we’ll update it in the gradient descent step.

68
Q

Conceptually(or mathimatically spoken), how do you update weights via gradient descent?

A

First find how much blame the layer value caused on the error, then how much blame the weights caused on the layer value. This is mathimatically spoken using the chain rule. Its a domino effect, chaning something in the network will domino all the way down.

C = Cost, L2 is final layer value, W2 is weights feed into L2

69
Q

How to counter the drawbacks of Relu

A

Use Leaky Relu, where negative values are multiplied by 0.01 instead of always being zero.

70
Q

When to use a Sigmoid versus Softmax

A

Sigmoid = Binary Classification(Yes or No)

Softmax = Multinomial Classification(possible values can be a car, boat, bike etc, not just yes or no)

71
Q

Why use cross-entropy compared to sum of squared errors as a measure of loss

A

SSE = scalar inputs

CE = vectors inputs. Still a scalar output When assessing difference between softmax output and one-hot encoded vectors, you need a vector output. Low loss means prediction vector is closely aligned to one-hot encoded vector.

72
Q

Explain the cross entropy calculation

A

multiply jth element of one hot encoded vector by natural log, multiply by jth element of softmax, sum, take negative value.

73
Q

Downsides of Gradient Descent

A
  1. Computation is expensive. The derivative caluculation and the amount of derivatives calculated through the iterative process(multiple epochs) is costly.
  2. Doesn’t scale well
74
Q

Best Practice for preparing inputs and weights

A
75
Q

What is Mini-Batching, Pro/Con

A

technique for training on subsets of the dataset instead of all the data at one time.

Pro - train using computer memory

Con - Computationally inefficeint since you can’t simulatenously calculate loss across samples

76
Q

Explain Mini_batching

A

Instead of feeding every single example input one by one, you feed a batch of inputs(features,labels) to the NN. Thus, the input is a matrix instead of a vector. SGD utilizes mini-batching to cacluate the loss for a group of images, not just a single image. Then, the weights are updated based on the batch, not just a single example.

77
Q

Basics of implementing mini-batching

A
  1. Just splitting the train_features data into batches
  2. Specify batch_size(typically 128)
  3. Divide batch size into length of features(the rows). This gives us total batches.
  4. Use for loop to create batches. Use range criteria to specify starting, ending and step size
78
Q

What is an Epoch

A

A single forward and backward pass of the WHOLE dataset. This is used to increase the accuracy of the model without requiring more data

79
Q

Purpose of adding a Bias

A

Allows neural net to move values in one direction or another. If no bias was present, you are relying solely on the weights for the linear combination. For example, If weights are negative, you will never get above 0 and certain activation functions will never “activate”, thus hindering the learning capabilty of the network

80
Q

Compare a perceptron and their inputs to a biological neuron

A

Inputs = dendrites

perceptron = neuron

If percetron activates(cacluates a 1), a positive value is processed. If neuron fires, the output signal is sent along the axon

81
Q

For gradient descent, the error function must have what properties?

A

Continuous- small variations in position translate to small variations error

Differentiable - you can calculate the derivative of the function

82
Q

Explain the sigmoid function

A
  • For large positive numbers, you get values close to 1
  • For large negative numbers, you get values close to 0
  • For numbers close to 0, you get values close to .5
83
Q

explain this diagram

A

example shows probability of points being blue. Going diagonally across the chart is a line(yx +b = 0), anything above line is blue, below line is red. This illustrates taking the sigmoid of the function WX + b. For values close to the line, you get a .5 probability, but as you move farther into blue, the probability gets higher.

84
Q
A

Plug in coordinate values to confirm value of X. Input X into sigmoid function. You don’t need to actually calculate the formulate becuase you know if X=0, then prob will be .5(per definition of a sigmoid)

85
Q

How to one-hot encode the picture

A
86
Q

Describe the Maximum Likeihood Concept

A
  1. find probabilities for each occurrence
  2. multiply these probabilities together(JUST IN THIS EXAMPLE. PRODUCTS ARE BAD, SUMS ARE GOOD. Whatever this value is, we try to maximize to find the best model. In the picture, the model on the right displays a higher maximium probability(P |ALL) and thus is cosidered the superior model(model classifies all occurences better than other model)
87
Q

Best way to calculate Maximum Likelihood -

A

Use Log function becuase it turns products into sums. Don’t want products because with many data points, changes to one value can alter equation drastically

88
Q

Log(1)? Whats the implication for calculating maximum likelihood?

A
  1. The log of any probability will be negative, so we must take the negative log of the probability.
89
Q

Explain Cross Entropy and how goal of Maximizing Probability has changed

A

If I have a bunch of events and bunch of probabilities, how likely are those events happen based on those probabilities. If likely, small CE and vice verse

  1. calculate probabilties of occurrences
  2. Take sum of negative logs of each probability,

low = good, high = bad

  1. Goal is to minimize cross entropy, not maximize probabilities
90
Q

Explain this Cross Entropy Diagram

A

In the left chart, the two points with 2.3 and 1.6 -log values correspond to the incorrectly classified points(red dot is in blue region, blue dot is in red region.) You can see the values are larger in comparison to the correctly classified points(red in red region). In turn, the distance between correctly and incorrectly classified points is a measurable error that we try to minimize. The right chart has a lower cross entropy error becuase each point is correctly classified.

91
Q

Describe this non-linear photo in terms of probability in regards to the line that separates the two regions

A

The white line represents a non-linear equation in which everything on the line has a probability of 50%. Anything in the blue space has a probability above 50% and red space is below 50%.

In the basic sense, we linearly combine probabilities for this point, and sense the value has to be between 0 and 1, we apply a non-linear activation function such as sigmoid. We do this for all points

92
Q

How would these linear expression be diagramed as a neural net

A
93
Q

How can you combine these two models using a neural net diagram

A
94
Q

How to clean up this diagram

A
95
Q

When you see this diagram of a neural net, what should you be thinking it represents?

A

it represents the non-linearly boundary in white.

96
Q

How could you write this with the bias outside?

A
97
Q

describe this photo

A

when you increase your variables(x1, x2, x3), you increase your dimensions. After 3 dimensions, it becomes very hard to visualize. But in 3 dimensions, now you just have planes in 3d space and the final product is some non linear plane.

98
Q

describe this photo

A

combine linear model into a non-linear model. Then you combine these non-linear models to makea more complex non-linear model. This is a deep neural net which allows us to generate highly complex probability boundaries

99
Q

describe this photo

A

for multi-class classification, we add a layer in the model(with n nodes that represent the items being classified, in this case 3 nodes for 3 animals) that identifies the probability for each animal.

100
Q

How many parameters

A

weights = 28*28*10

+

Bias = 1*10

101
Q

Why are linear models stable?

A
  1. Small changes in input don’t lead to big changes in output
  2. Derivatives are constant values
102
Q
A
103
Q

Regularization - Skinny Jeans Analogy

A

Skinny jeans “fit” great but they are really hard to get into. Thus, people wear jeans which are a little to big. In turn, the deep model that is just the right size for your data is very hard to optimize(“get into”). Thus, most models are a little or lot to big for our data and we try our best not to overfit.

104
Q

Best way of preventing overfitting

A

Early Termination - stop training when our validation set stops improving.

105
Q

Another way to prevent overfitting(not early termination) - Just the basic concept

A

Regularization - applying artificial contraints to network that implicitly reduce number of free parameters while not making it more difficult to optimize ( think yoga pants, not hard to get into, but fit great)

106
Q

L2 Regularization

A

Add L2 norm of weights(multiplied by small constant) to loss. The L2 norm is just the sum of squares of the individual elements in a vector.

107
Q

Explain Dropout

A

Randomly setting activation values to zero. You are destroying up to half of your data flowing through the model. Then you randomly do it again.

108
Q

Why use Dropout as a way to prevent overfitting?

A

Network can’t rely on any given activations to be present. Forced to learn redundant representations. Takes consensus over ensemble of networks. If dropout doesn’t work, you may need a bigger network.

109
Q

Should you implement drop-out on validation or test sets?

A

No, you should only use drop-out on training. Set keep_prob to 1 or remove on validation and test sets to ensure maximum accuracy

110
Q

create model using one hidden layer, relu and dropout

A
111
Q

Cross Entropy - What does it connect, what does it tell

A

Cross entropy is a connection between probabilities and error functions. It describes the difference between two vectors. The vectors could be predictions verus actuals. So, if wanting to predict and event, a lower error is desired.

small cross entropy = low error, more likely to occur

large cross entropy = large error, less likely to occur

112
Q

Gradient Descent - Describe how to find gradient

A

Calculate parital deriviates of error with respect to each input. Picture below has two inputs(two dimensional)

113
Q

Gradient Descent - Weight and Bias step

A

Learning rate * error term(either the output error term or the hidden unit error term) * input(either a raw input value or the hidden unit activation value)

114
Q

Gradient Descent - Write a function for weight and bias step

A
115
Q

Gradient Descent - Why does error have to be continuous?

A

If it was discrete, you may not be able to tell small variations in error and thus give us an idea of what direction to take.

116
Q

Gradient Descent - What do correctly and incorrectly classified points tell a line?

A

correctly classified = go farther away so my error is smaller, prediction closer to 1

incorrectly classified = come closer, error is smaller and prediciton closer to 1

117
Q

Neural Nets - How to configure for multi-classification?

A

Add as many output nodes as classes

118
Q

Chain Rule?

A

in the picture, if you want to find the partial derivative of B with respect to X, you just multiply the partial deriviative of B with respect to A by the partial derivative of A with respect to X.

Applies when you have functions of functions

119
Q

Gradient Descent - Error Term

A

The ouput error term is slightly different than the hiddent unit error terms. The output error term incorporates the model error as well as gradient.

Output error term = Output Error * Gradient of output activation function

Hidden Unit error term - Output error term * gradient of hidden unit activation function

Since model error is already incorporated in the output error term, we simply take the output error term and scale it by the weights and gradient of hidden unit activation function

The error term is used in calculating the weight steps. Once the error term is calculated, then you multiply by the input values(either a raw value or a hidden unit activation value) and learning rate.

120
Q

Training Neural Nets - Early Stopping

A

Accounting for model complexity, where # of epochs indicates model complexity, the early stopping algo will stop the training once the testing error stops decreasig and starts to increase, while the training error continues to decrease

121
Q

Training Neural Nets - Regularization

A

sigmoid of a larger number is closer to 1 and sigmoid of smaller number closer to zero.

Concept illustrate multiplying the activation formula by a scaler integer. For the sigmoid activation formula, this leads to steeper slopes and higher chance values either 1 or 0, nothing in between

Be weary of super accurate training models as they may overfit.

122
Q

Training Neural Nets - Regularization. Whats the issue with steeper slopes?

A

Harder to do gradient descent with lower range of continuous values.

Harder to tune model to correct errors

Model on right is too certain.

123
Q

Training Neural Nets - Regularization - Basic concept

A

Punish large coefficients/weights to avoid steep slopes

Add a term to error function

124
Q

Training Neural Nets - Regularization - Difference between L1 and L2 regulariztaion

A

L1 = Lamda * sum of absolute value of weights

L2 = Lamda * sum of squares of weights

125
Q

Training Neural Nets - Regularization - Why choose L1 or L2?

A

L1 = weights converted to 1 or 0.

Good for feature selection as 0 weights indicate features are not valuable in prediction. May have 100’s of features.

Good if you want less weights and end with small set

L2 = weights converted to small homogenuous set. Typically best for training models as weights can be tuned(versus on or off settign in L1)

example shows taking absolute value versus sum of squares.

126
Q

Training Neural Nets - Dropout - What is it? Whats a good comparison?

A

probability that each node will be turned off during each epoch. Requires other nodes to pick up slack and prevent overfitting

Having a dominant right hand in basketball, and using left hand only during practice to better the overall skill.

127
Q

Training Neural Nets - How to avoid local minimum?

A

Random Restarts - Start from a handful of different places and perform gradient descent from there.

128
Q

Training Neural Nets - Describe Vanishing Gradients?

A

Using the sigmoid function as an example, when you approach values on the left or right side of the function, the gradient is really close to zero(because the slope is flat)

129
Q

Training Neural Nets - How to mitigate vanishing gradients?

A

Use different activation functions which allow for wider range of gradients other than zero

Relu , Tanh

130
Q

Training Neural Nets - Batch versus Stochastic Gradient Descent

A

Batch - All data run through neural net, one step computationally expensive

Stochastic - Taking small batches of data, multiple steps, fast

131
Q

Training Neural Nets - Learning rate(high versus low)

A

lower typically better as smaller steps will lead to convergence. Large steps may skip over minimum.

132
Q

Training Neural Nets - Momentum(basic concept), advantages

A

Use a weighted average of previous steps to avoid a local minimum. The previous step is weighted highest while decreasing for each previous step. May even bounce of the global minimum but not very much.

Beta is between 0 & 1