Neural Networks Flashcards
How to separate Test, Train and Validation Data
Separate base data set into 6 components
- Train Features
- Train Targets
- Validation Features
- Validation Targets
- Test Features
- Test Targets
Pic just shows training and test
What is logistic regression?
Outputs a probability that a given input belongs to a certain class
Explain the softmax function, why use e?
Turns scores(integers which reflect output of neural net) into probabilities
e turns negatives into positives
Explain why sigmoids are no longer used as an activation function
- Derivatives of sigmoid max out a .25, thus, during backprop, errors going back into the network will be shrunk by 75% -100% at every layer
- For models with many layers, as you get closer to layers near input, weight updates will be tiny and take a long time to train
- Not zero centered(More inconvenience) - feeds only positive value to next layer, in turn, gradient of weights of this input value X will be always positive or always negative, depending on the gradient of the whole expression f. This leads to undesirable zig-zagging of gradient updates for the weights
Explain why Relu is used as an activation function
- If max is positive, then derivative is 1, o there isn’t the vanishing effect you see on backpropagated errors from sigmoids.
- Leads to faster training
Drawbacks of Relu
- Large learning rate, coupled with a large gradient, may lead to respectively large negative adjustment of weights and biases(our step down)
- Adjustment may lead to a negative input into the Relu caculation. Negative = 0 for Relu. During backprop, the derivative of zero is zero, so chain rule leads to a zero update. This leads to a “dead” neuron which may waste computation and reduce learning
- Its very hard if not impossible to input a large positive adjustment(to counter the earlier), since we are moving down to a local minimum. Not sure if this would even matter mathimatically, but may help if taken in a batch view
Training Loss
Average cross-entropy loss
S = Softmax, D = Cross-Entropy, L = Loss
When performing numerical optimization, what are good things to do with variables
- Mean of zero
- Equal Variance
This minimizes search performed by optimizer
For images, how to you prepare your data for optimization
take each channel, subtract 128 and divide by 128. Doesn’t change the data, just makes it easer for numerical optimization
discuss gaussian distribution and sigma
- mean zero and standard deviation sigma.
- sigma determines order of magnitude for outputs at inital point of optimization
- Beaucse softmax sits ontop of sigma, the order of magnitude also determiness peakiness of inital probability distribution
- Large sigma = uncertain, large peaks
- Small sigma - Opinionated, small peaks -
For optimization, what are basic ideas on how to initialize weights and biases
start from gaussian distribution with small sigma. Small sigma means more certain.
What are the 3 sets of data used to measure performance and how are they used
- training - optimize loss
- validation - measure performance of training
- test - never use until final measurments
Explain Stochastic Gradient Descent
- Average loss for very small random fraction of training data
- This average loss is typically a bad estimate at first and may actually increase error
- Thus, you do it many times, taking small steps each time
Explain Momentum
Use running average of gradients as directio to take, instead of gradient in current batch
Explain learning rate decay
decreasing the learning rate over time during the training process(every time it reaches a plateau, exponential decay
What is ADAGRAD
- modification of SGD which implicity does momentum and learning rate decay by default
- Makes learning less sensitive to hyperparameters
Whats a scalar
single value that represents a zero dimension shape(1, 2,4, -0.3)
Whats a vector
a single dimesion shape with a certain length
What is a matrix and how do you describe it?
2 dimensional grid of values. If had 2 rows and 3 columns, its a 2x3 matrix
describe a vector as a matrix
1x len matrix.
describe indices of a matrix
row then column index
How to reshape vectors from horizontal to vertical
Less common way
More common way of reshaping data
Describe element-wise operations within Matrix
Example of element wise operation -normalizing an image
How to perform element wise operations between matrices
Must be the same size
How does matrix multiplication work
Take element wise operation for row and column pairs. Treat as separate horizontal and vertical vector pairs.
When can you NOT take a dot product of two matrices
if lengths of horizontal vector doesn’t match length of vertical vector. Since you take take element wise operation on these vector pairs, they must have same length
Important Reminders about Matrix Multiplication
- The number of columns in the left matrix must equal the number of rows in the right matrix. When viewing shape of two matrices side by side, if the value dimensions match on inside, your good. 2x3 and 3X2. Values on inside(3) match.
- The answer matrix always has the same number of rows as the left matrix and the same number of columns as the right matrix. (outside values, 2X2)
- Order matters(not commutative). Multiplying A•B is not the same as multiplying B•A.
- Data in the left matrix should be arranged as rows., while data in the right matrix should bearranged as columns.
What is the transpose of a matrix
switch the values in the rows to be values in the columns. If not a square, then 2x4 becomes 4x2, which may assist in matrix multiplication
When can you safely use a matrix transpose
How to call the transpose function in numpy
.T
If you modify a transposed matrix, what are implimcations
- it modifies both the transpose and the original matrix, too!
- They are sharing the same copy of data.
- Consider the transpose just as a different view of your matrix, rather than a different matrix entirely.
What is a gradient
- just a derivative generalized to functions with more than one variable
- think slope or direction of greatest ascent. Taking negative of gradient is direction of steepest descent
Explain local minima
area where loss is low, but not the lowest. Primarily caused by poor weight initialization
When calculating the error of a function, what are upsides of using SSE
- The squared error penalizes outliers more than small errors
- Makes values all positive
Brief explaination of chain rule
Think of the network as a bunch of dominos. Some dominos are weights, biases, dot products, activation values, error. The goal is to fine tune the dominos that contain weights and biases. In order to due so, we take the derivative of the final domino(the error domino). Then, if you want to find a weight update, you just find the derivative of each connecting dominos and multiply the values togheter. This is the chain rule.
What is the gradient of really small and large values
zero, so in turn steps are zero
How to do calculate weight update
- Calculate error term
- multiply by input x , then divide by number of records to get average delta weigt change
3.
Describe components of Output Error Term
Output Error * Derivative of Output(Post Activation Function)
Pic uses sigmoid and derivative of sigmoid.
Error term(output or hidden) is used to update weights that connect into layers.
For multilayer perceptron, how is the weight matrix written?
Just remember, in matrix notation, its always rows then columns
For backpropagation, how do you interpret the weights for each hidden layer
weights assign how impactful hidden and input layers are on the total error. Since inputs and hidden layers are all multiplied by weights, it only makes sense that the error stems from these weights.
What is Backpropagation
Basically flipping the network around once you found an output error, and using this output error as the input. The output error is feed in as input, multiplied by weights to identify hidden unit error terms(the amount of error caused by the hidden unit). These error terms are then multiplied by hidden unit gradients to identify magnitude and direction a network should move.
How to create dummy variables for categorical variables
for loop, select the columns from the data, specify a prefix name, then concat those columns to the data. The pd.get_dummies will identify unique values and add those after the prefix.
How do you scale variables to ensure they have zero mean and std of 1
take each variable, subtract the mean then divide by standard dev.
Code loops through features, calculates mean and std dev for each column, store these values in a dictionary as a tuple. Use data.loc[:,each] to perform calculation for every row of the looped column.
How to assign features and targets
create list of target fields, then use .drop method to remove these from data set, assigning dropped fields as targets and remaining fields as features
how to split the feature and target data into train_features, test_features, val_features, val_targets
Just slice the data and assign to variables. The pic is selecting various rows since there is no comma, just the last amount of rows or the first amount of rows
How do you line up the features with targets
zip the two data sets
How to implement SGD
create batches(which are just an array of index values from the data set) via np. random.choice.
Using this batch(list of index values), select rows(batch) and all columns from train_features and train_targets
During Training, how do you get a vector of features as input
During SGD, you create a batch which is an index of values used to select data from train_features and train_targets. well call the features data a matrix of 128 rows by 56 columns and the targets a vector of 128 values. In order to get a vector of feature values to input into the neural network, you zip the train_features with the train_targets and assign the result to an X, y variable. Now the X variable is a single vector of 56 features with a target of 1 value.
- Create batch of random index values
- Select data using batch
- Zip data with targets
Basic idea of K-Fold Cross Validation
- Avoid separating test data that is never used while at the same time not cheating(overfitting)
- Break data into k buckets, train k times and each time use a new bucket as the test bucket while training on the others
- Average results