Neural Networks Flashcards

Question

Example of element wise operation -normalizing an image

Answer 1

Must be the same size

Answer 2

Take element wise operation for row and column pairs. Treat as separate horizontal and vertical vector pairs.

Answer 3

if lengths of horizontal vector doesn't match length of vertical vector. Since you take take element wise operation on these vector pairs, they must have same length

Answer 4

* The number of columns in the left matrix must equal the number of rows in the right matrix. When viewing shape of two matrices side by side, if the value dimensions match on inside, your good. 2x3 and 3X2. Values on inside(3) match. * The answer matrix always has the same number of rows as the left matrix and the same number of columns as the right matrix. (outside values, 2X2) * Order matters(not commutative). Multiplying A•B is not the same as multiplying B•A. * Data in the left matrix should be arranged as rows., while data in the right matrix should bearranged as columns.

Answer 5

switch the values in the rows to be values in the columns. If not a square, then 2x4 becomes 4x2, which may assist in matrix multiplication

Answer 6

* it modifies both the transpose and the original matrix, too! * They are sharing the same copy of data. * Consider the transpose just as a different view of your matrix, rather than a different matrix entirely.

Answer 7

* just a derivative generalized to functions with more than one variable * think slope or direction of greatest ascent. Taking negative of gradient is direction of steepest descent

Answer 8

area where loss is low, but not the lowest. Primarily caused by poor weight initialization

Answer 9

* The squared error penalizes outliers more than small errors * Makes values all positive

Answer 10

Think of the network as a bunch of dominos. Some dominos are weights, biases, dot products, activation values, error. The goal is to fine tune the dominos that contain weights and biases. In order to due so, we take the derivative of the final domino(the error domino). Then, if you want to find a weight update, you just find the derivative of each connecting dominos and multiply the values togheter. This is the chain rule.

Answer 11

zero, so in turn steps are zero

Answer 12

1. Calculate error term 2. multiply by input x , then divide by number of records to get average delta weigt change 3.

Answer 13

Output Error \* Derivative of Output(Post Activation Function) Pic uses sigmoid and derivative of sigmoid. Error term(output or hidden) is used to update weights that connect into layers.

Answer 14

Just remember, in matrix notation, its always rows then columns

Answer 15

weights assign how impactful hidden and input layers are on the total error. Since inputs and hidden layers are all multiplied by weights, it only makes sense that the error stems from these weights.

Answer 16

Basically flipping the network around once you found an output error, and using this output error as the input. The output error is feed in as input, multiplied by weights to identify hidden unit error terms(the amount of error caused by the hidden unit). These error terms are then multiplied by hidden unit gradients to identify magnitude and direction a network should move.

Answer 17

for loop, select the columns from the data, specify a prefix name, then concat those columns to the data. The pd.get\_dummies will identify unique values and add those after the prefix.

Answer 18

take each variable, subtract the mean then divide by standard dev. Code loops through features, calculates mean and std dev for each column, store these values in a dictionary as a tuple. Use data.loc[:,each] to perform calculation for every row of the looped column.

Answer 19

create list of target fields, then use .drop method to remove these from data set, assigning dropped fields as targets and remaining fields as features

Answer 20

Just slice the data and assign to variables. The pic is selecting various rows since there is no comma, just the last amount of rows or the first amount of rows

Answer 21

zip the two data sets

Answer 22

create batches(which are just an array of index values from the data set) via np. random.choice. Using this batch(list of index values), select rows(batch) and all columns from train\_features and train\_targets

Answer 23

During SGD, you create a batch which is an index of values used to select data from train\_features and train\_targets. well call the features data a matrix of 128 rows by 56 columns and the targets a vector of 128 values. In order to get a vector of feature values to input into the neural network, you zip the train\_features with the train\_targets and assign the result to an X, y variable. Now the X variable is a single vector of 56 features with a target of 1 value. * Create batch of random index values * Select data using batch * Zip data with targets

Answer 24

* Avoid separating test data that is never used while at the same time not cheating(overfitting) * Break data into k buckets, train k times and each time use a new bucket as the test bucket while training on the others * Average results

Answer 25

1. Does model is generalize well to unseen data. 2. Is it overfitting(memorizing)

Answer 26

As models become more complex, errors become smaller and smaller. However, to esure model generalizes well and performs well on training data, look for point where validation errors increase while training errors decrease. Remember to use Validation Set, not Test Set

Answer 27

1. More hidden nodes, layers

Answer 28

Max = Twice the # of inputs

Answer 29

1. Preprocess data 2. Initialize architecture 3. Store helper functions

Answer 30

multiply = probabilities get closer to 1 or 0 divide = since all the scores decrease in magnitude, the resulting softmax probabilities will be closer to each other.

Answer 31

using e, high scores in become much more probable than low scores. e has a nice derivative

Answer 32

an expression constructed from a set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of x and y would be any expression of the form ax + by, where a and b are constants).

Answer 33

This is just a derivative/gradient. So a node with a larger gradient with respect to the cost is going to contribute a larger change to the cost. In this way, we can assign blame for the cost to each node. The larger the gradient for a node, the more blame it gets for the final cost. And the more blame a node has, the more we'll update it in the gradient descent step.

Answer 34

First find how much blame the layer value caused on the error, then how much blame the weights caused on the layer value. This is mathimatically spoken using the chain rule. Its a domino effect, chaning something in the network will domino all the way down. C = Cost, L2 is final layer value, W2 is weights feed into L2

Answer 35

Use Leaky Relu, where negative values are multiplied by 0.01 instead of always being zero.

Answer 36

Sigmoid = Binary Classification(Yes or No) Softmax = Multinomial Classification(possible values can be a car, boat, bike etc, not just yes or no)

Answer 37

SSE = scalar inputs CE = vectors inputs. Still a scalar output When assessing difference between softmax output and one-hot encoded vectors, you need a vector output. Low loss means prediction vector is closely aligned to one-hot encoded vector.

Answer 38

multiply jth element of one hot encoded vector by natural log, multiply by jth element of softmax, sum, take negative value.

Answer 39

1. Computation is expensive. The derivative caluculation and the amount of derivatives calculated through the iterative process(multiple epochs) is costly. 2. Doesn't scale well

Answer 40

technique for training on subsets of the dataset instead of all the data at one time. Pro - train using computer memory Con - Computationally inefficeint since you can't simulatenously calculate loss across samples

Answer 41

Instead of feeding every single example input one by one, you feed a batch of inputs(features,labels) to the NN. Thus, the input is a matrix instead of a vector. SGD utilizes mini-batching to cacluate the loss for a group of images, not just a single image. Then, the weights are updated based on the batch, not just a single example.

Answer 42

1. Just splitting the train\_features data into batches 2. Specify batch\_size(typically 128) 3. Divide batch size into length of features(the rows). This gives us total batches. 4. Use for loop to create batches. Use range criteria to specify starting, ending and step size

Answer 43

A single forward and backward pass of the WHOLE dataset. This is used to increase the accuracy of the model without requiring more data

Answer 44

Allows neural net to move values in one direction or another. If no bias was present, you are relying solely on the weights for the linear combination. For example, If weights are negative, you will never get above 0 and certain activation functions will never "activate", thus hindering the learning capabilty of the network

Answer 45

Inputs = dendrites perceptron = neuron If percetron activates(cacluates a 1), a positive value is processed. If neuron fires, the output signal is sent along the axon

Answer 46

Continuous- small variations in position translate to small variations error Differentiable - you can calculate the derivative of the function

Answer 47

* For large positive numbers, you get values close to 1 * For large negative numbers, you get values close to 0 * For numbers close to 0, you get values close to .5

Answer 48

example shows probability of points being blue. Going diagonally across the chart is a line(yx +b = 0), anything above line is blue, below line is red. This illustrates taking the sigmoid of the function WX + b. For values close to the line, you get a .5 probability, but as you move farther into blue, the probability gets higher.

Answer 49

Plug in coordinate values to confirm value of X. Input X into sigmoid function. You don't need to actually calculate the formulate becuase you know if X=0, then prob will be .5(per definition of a sigmoid)

Answer 50

1. find probabilities for each occurrence 2. multiply these probabilities together(JUST IN THIS EXAMPLE. PRODUCTS ARE BAD, SUMS ARE GOOD. Whatever this value is, we try to maximize to find the best model. In the picture, the model on the right displays a higher maximium probability(P |ALL) and thus is cosidered the superior model(model classifies all occurences better than other model)

Answer 51

Use Log function becuase it turns products into sums. Don't want products because with many data points, changes to one value can alter equation drastically

Answer 52

0. The log of any probability will be negative, so we must take the negative log of the probability.

Answer 53

If I have a bunch of events and bunch of probabilities, how likely are those events happen based on those probabilities. If likely, small CE and vice verse 1. calculate probabilties of occurrences 2. Take sum of negative logs of each probability, low = good, high = bad 3. Goal is to minimize cross entropy, not maximize probabilities

Answer 54

In the left chart, the two points with 2.3 and 1.6 -log values correspond to the incorrectly classified points(red dot is in blue region, blue dot is in red region.) You can see the values are larger in comparison to the correctly classified points(red in red region). In turn, the distance between correctly and incorrectly classified points is a measurable error that we try to minimize. The right chart has a lower cross entropy error becuase each point is correctly classified.

Answer 55

The white line represents a non-linear equation in which everything on the line has a probability of 50%. Anything in the blue space has a probability above 50% and red space is below 50%. In the basic sense, we linearly combine probabilities for this point, and sense the value has to be between 0 and 1, we apply a non-linear activation function such as sigmoid. We do this for all points

Answer 56

it represents the non-linearly boundary in white.

Answer 57

when you increase your variables(x1, x2, x3), you increase your dimensions. After 3 dimensions, it becomes very hard to visualize. But in 3 dimensions, now you just have planes in 3d space and the final product is some non linear plane.

Answer 58

combine linear model into a non-linear model. Then you combine these non-linear models to makea more complex non-linear model. This is a deep neural net which allows us to generate highly complex probability boundaries

Answer 59

for multi-class classification, we add a layer in the model(with n nodes that represent the items being classified, in this case 3 nodes for 3 animals) that identifies the probability for each animal.

Answer 60

weights = 28\*28\*10 + Bias = 1\*10

Answer 61

1. Small changes in input don't lead to big changes in output 2. Derivatives are constant values

Answer 62

Skinny jeans "fit" great but they are really hard to get into. Thus, people wear jeans which are a little to big. In turn, the deep model that is just the right size for your data is very hard to optimize("get into"). Thus, most models are a little or lot to big for our data and we try our best not to overfit.

Answer 63

Early Termination - stop training when our validation set stops improving.

Answer 64

Regularization - applying artificial contraints to network that implicitly reduce number of free parameters while not making it more difficult to optimize ( think yoga pants, not hard to get into, but fit great)

Answer 65

Add L2 norm of weights(multiplied by small constant) to loss. The L2 norm is just the sum of squares of the individual elements in a vector.

Answer 66

Randomly setting activation values to zero. You are destroying up to half of your data flowing through the model. Then you randomly do it again.

Answer 67

Network can't rely on any given activations to be present. Forced to learn redundant representations. Takes consensus over ensemble of networks. If dropout doesn't work, you may need a bigger network.

Answer 68

No, you should only use drop-out on training. Set keep\_prob to 1 or remove on validation and test sets to ensure maximum accuracy

Answer 69

Cross entropy is a connection between probabilities and error functions. It describes the difference between two vectors. The vectors could be predictions verus actuals. So, if wanting to predict and event, a lower error is desired. small cross entropy = low error, more likely to occur large cross entropy = large error, less likely to occur

Answer 70

Calculate parital deriviates of error with respect to each input. Picture below has two inputs(two dimensional)

Answer 71

Learning rate \* error term(either the output error term or the hidden unit error term) \* input(either a raw input value or the hidden unit activation value)

Answer 72

If it was discrete, you may not be able to tell small variations in error and thus give us an idea of what direction to take.

Answer 73

correctly classified = go farther away so my error is smaller, prediction closer to 1 incorrectly classified = come closer, error is smaller and prediciton closer to 1

Answer 74

Add as many output nodes as classes

Answer 75

in the picture, if you want to find the partial derivative of B with respect to X, you just multiply the partial deriviative of B with respect to A by the partial derivative of A with respect to X. Applies when you have functions of functions

Answer 76

The ouput error term is slightly different than the hiddent unit error terms. The output error term incorporates the model error as well as gradient. Output error term = Output Error \* Gradient of output activation function Hidden Unit error term - Output error term \* gradient of hidden unit activation function Since model error is already incorporated in the output error term, we simply take the output error term and scale it by the weights and gradient of hidden unit activation function The error term is used in calculating the weight steps. Once the error term is calculated, then you multiply by the input values(either a raw value or a hidden unit activation value) and learning rate.

Answer 77

Accounting for model complexity, where # of epochs indicates model complexity, the early stopping algo will stop the training once the testing error stops decreasig and starts to increase, while the training error continues to decrease

Answer 78

sigmoid of a larger number is closer to 1 and sigmoid of smaller number closer to zero. Concept illustrate multiplying the activation formula by a scaler integer. For the sigmoid activation formula, this leads to steeper slopes and higher chance values either 1 or 0, nothing in between Be weary of super accurate training models as they may overfit.

Answer 79

Harder to do gradient descent with lower range of continuous values. Harder to tune model to correct errors Model on right is too certain.

Answer 80

Punish large coefficients/weights to avoid steep slopes Add a term to error function

Answer 81

L1 = Lamda \* sum of absolute value of weights L2 = Lamda \* sum of squares of weights

Answer 82

L1 = weights converted to 1 or 0. Good for feature selection as 0 weights indicate features are not valuable in prediction. May have 100's of features. Good if you want less weights and end with small set L2 = weights converted to small homogenuous set. Typically best for training models as weights can be tuned(versus on or off settign in L1) example shows taking absolute value versus sum of squares.

Answer 83

probability that each node will be turned off during each epoch. Requires other nodes to pick up slack and prevent overfitting Having a dominant right hand in basketball, and using left hand only during practice to better the overall skill.

Answer 84

Random Restarts - Start from a handful of different places and perform gradient descent from there.

Answer 85

Using the sigmoid function as an example, when you approach values on the left or right side of the function, the gradient is really close to zero(because the slope is flat)

Answer 86

Use different activation functions which allow for wider range of gradients other than zero Relu , Tanh

Answer 87

Batch - All data run through neural net, one step computationally expensive Stochastic - Taking small batches of data, multiple steps, fast

Answer 88

lower typically better as smaller steps will lead to convergence. Large steps may skip over minimum.

Answer 89

Use a weighted average of previous steps to avoid a local minimum. The previous step is weighted highest while decreasing for each previous step. May even bounce of the global minimum but not very much. Beta is between 0 & 1

Neural Networks Flashcards

(132 cards)