Deep Learning, xgb, BERT Flashcards
Intuitively, why should a NN/MLP with a bunch of successive layers of processing be good at finding patterns, like identifying images of digits?
The intuitive idea is that each subsequent layer is being trained to recognize higher-level patterns. So maybe layer 1 is edge detection, layer 2 is finding a shape like a circle, and layer 3 can identify full digits.
In a more complex image, maybe layer 1 is lines, layer 3 is texture, etc.
In a “vanilla NN”, or MLP, how does a given layer of processing work? How do we go from layer i of size N to layer i+1 of size M?
Each of the M neurons in the output layer is computed by taking a weighted sum of all the values of the input layer (plus a bias), then passing it through an activation function. Typically the weights are learned but the activation is not, it’s something like relu or sigmoid.
So in order to get one of the output neurons, you take the N inputs, plus an input of 1 that’ll be multiplied by the bias, as a column vector and multiply them by a length N+1 row vector of weights; then you take the that output and pass it through the activation.
So if you want a length M output, you need M row vectors, and thus you’re multiplying the length-N+1 input by an MxN+1 matrix to get the length M output (which goes through the activation).
What is the sigmoid activation? What is its formula, and what does the graph look like? What does it functionally “do”?
It squishes all the real numbers between 0 and 1, like in logistic regression.
What does the relu activation function look like?
What is the softmax function? How is it computed, and what is it used for?
The softmax is the go-to output layer if you’re predicting a categorical variable with more than 2 categories. All the layer outputs are between 0 and 1, and they sum to 1 – so they’re basically probabilities (they aren’t exactly but can often kinda be interpreted that way), and whichever outcome class is being predicted as having highest probability is chosen.
The formula is shown below, where there are K values you’re trying to predict, each has a corresponding value z that needs to be passed through the softmax.
It’s similar to sigmoid
When learning an NN, what gradient are you calculating during optimization, and why? How does gradient descent work?
In order to optimize a neural network, you need to find the derivative of the loss function with respect to each of the weights in the network (maybe thousands or millions), and then you update the weights by taking a small step in that direction (I think technically the opposite direction but whatever).
If you want the partial derivative of a function with respect to each input variable, that’s the gradient: the gradient of the loss function is the vector of the function’s partial derivatives with respect to each parameter. So that’s what we calculate and optimize based on.
Conceptually, how does backpropogation work?
Basically you use the chain rule to efficiently get the partial derivatives one layer at a time.
You start by setting up the formulas to get the partial derivatives of the loss function with respect to the weights in the last layer. **These formulas will depend on the activation of the previous layer**, but you just hold that value constant while simply calculating the partial derivatives of this layer.
Then, basically using the chain rule, you substitute in the formula for the activation from the previous layer, and now holding constant the stuff from the subsequent layer, you simply calculate your next round of partial derivatives.
Then repeat, because the now the formula is dependent on the activation of the previous layer, which you can again substitute in, etc! I’m not gonna get totally into the weeds memorizing the exact math.
New simple and valuable thing to remember: The chain rule is just dy/dx = dy/du*du/dx, so it makes sense that dLoss/dSecondLayer = dLoss/dFirstLayer * dFirstLayer/dSecondLayer. And that shows clearly how gradients are based on past ones, and are eventually long chains of multiplied gradients (which could lead to vanishing gradients)
What is one-hot encoding? Why is it needed for neural networks?
Basically if you have a categorical variable with N>2 outputs, you’ll represent each row’s value of that variable wth N columns, each pertaining to one of the N categories. There’ll be a 1 for the category in that row, and 0s otherwise.
You need to one-hot encode because NNs need numerical inputs, so they can do computations by multiplying input vectors by weight matrices, and use derivatives of numerical formulas to optimize.
Why is the activation function important?
Without a nonlinear activation, you would just be learning a bunch of complex weighted sums of the inputs; it would be all linear. Nonlinear activations let you learn nonlinear relationships, which is where the magic happens.
How are log loss and cross entropy loss related? How do they work?
New:
Remember the specifics here: it’s the sum of the negatives of isCorrect*log(predictedProb) for each class.
So a term only has weight if it’s the prob for the correct label (I had misremembered it as all the other ones have weight, not that one.)
And log(1)=0, log(decimal) = big negative number. So if you predict low for the truth, you get a big negative log value, then take its negative to get a big positive loss, as desired.
I’m confident this is the case.
Original:
Log loss (also called binary cross entropy) is for a binary categorical and cross entropy is 3+ outcomes, but they’re basically the same thing; it’s like sigmoid vs softmax.
These loss functions are just using negative log likelihood. So we are trying to find the maximum likelihood estimation of the best parameters: we try to find the parameters such that “the likelihood that those parameters, and the associated probabilties they yield, would have resulted in this dataset” is maximized.
So like, when we’re predicting a categorical variable, our model’s output is a bunch of probabilities. We want to get those probabilties close to being 1 for the correct answer and 0 for everything else, because that is the maximum likelihood solution: those are the probabilties that are most likely to have yielded this label, and thus the parameters associated with that probability are most likely to yield that label.
What’s the formula for log likelihood, aka binary cross entropy?
Hopefully this isn’t that important to memorize if you’ve got the concept
What final activation is typically used, and what loss function is typically used, for predicting a binary categorical variable?
What about a categorical with 3+ options?
Activation is sigmoid, loss is BCE.
For 3+, activation is softmax, loss is cross entropy.
Why is it important to normalize all of your input columns?
So all of the input columns have the same scale, making it easier to learn at approximately the same rate (and using the same learning rate parameter) for each input.
If one col had a really big scale and another had a really small scale, then a step that’s as large as the learning rate will be hard to get right for both columns: you might have a too-big step for the small-scale one, and vice versa.
What is the learning rate?
When would you decrease the learning rate? When would you increase it?
The learning rate is a positive scalar that determines how large of a step you take in the opposite direction of the gradient each time you take a step.
You would increase it if you’re learning too slowly, and decrease it if you’re underfitting or if your learning is jagged.
What is dropout regularization? Why does it work as a regularization tactic?
Dropout regularization is when we give nodes in the network a probability that they will be turned off on a training pass. So each time the model is run during training, we look at each node that might turn off, and if we pull the appropriate random number, set it to zero for this training run.
So for every training evaluation, we’re using a random subset of the nodes; the other nodes, and by extension their incoming and outgoing connections, are removed. (We don’t do dropout during validation or testing.)
My intuitive understanding of why it works for overfitting: first of all, it on average decreases the size of the model during training, and smaller/less complex models overfit less.
Also, because the model cannot consistently rely on having a specific node on a given run, it’s harder to, say, encode in one specific training point’s outcome variable in one specific node. Like if for example the model were trying to encode each training point’s individual outcome variable using one node each, that wouldn’t work super well with high dropout.
What causes vanishing gradients in neural networks, especially deep neural networks?
Certain activation functions have areas where their derivatives are very near zero: for example, the extreme values of sigmoid. So if all or most of the neurons get to the extreme values of sigmoid, the gradients will have a lot of very-near-zero values, which causes very slow training.
This is exacerbated by the fact that derivatives in NNs are often basically the the product of several of these individual derivatives, chained together by the chain rule. So you’ve got a bunch of near-zero values multiplied together.
Intuitively, why does using the relu activation function combat vanishing gradients, and exploding gradients?
A derivative in an NN is usually a bunch of individual derivatives of the activation function multiplied together, because of the use of the chain rule in backpropogation.
If the activation derivative tends to often be less than 1 (as with the extremes of sigmoid), these derivatives will tend to zero, and vanish. If they often to be greater than 1, they will tend to infinity and explode.
But the derivative of relu is always either zero or 1. So the product of a bunch of the derivatives will be either zero and 1, but some of them will typically be 1, because the network will need some info flowing through for each point. So there are usually always some gradients that aren’t vanishing and aren’t exploding
How do you get the best of both worlds of normal gradient descent and stochastic gradient descent
Stochastic batch gradient descent: take a step every batch of k datapoints, rather than every 1, or just every epoch. Super common
Why is learning rate decay useful?
Usually we want to take large steps at the beginning and slow steps at the end: at the end we’re near a local minimum and just want to slightly refine, where as at the beginning we probably have quite a ways to go.
How does momentum work, and what purpose does it attempt to solve?
In momentum, rather than taking a step in the direction of the current gradient, you take a step in the direction of an exponentially decaying weighted sum of all past gradients.
The hope is that it helps you “power through” local minima to reach global minima. So for example, if you got to the bottom of this local minimum, the current gradient would be zero, but the previous ones are still pointing right and would carry you through.
Another benefit is that momentum helps decrease jagged training. If the objective function is pointing in a consistent direction in one dimension (long side of ovals below), but prone to jumping around on another one (short side of ovals below), momentum smoothes this out.
What little optimization can often be made to the pairing of softmax output and cross entropy loss?
Rather than having softmax output probabilities, have it output the logs of the probabilties, and alter cross entropy to recieve them. As we know, optimizing based on the logs achieves the same optimization, and is often more computationally effective.
What is a good default approach to randomly initializing the weights and baises of an NN?
Init biases to zero; this is just super common.
Weights: when choosing outgoing weights from a layer with n nodes, we sample weights from a normal distribution with mean zero and stddev 1/sqrt(n)
The general idea is to have the weights inversely proportional to the # of nodes in the previous layer, and thus inversely proportional to the number of weights.
Intuitively, we can say that by doing it proportionally to the number of nodes that are feeding into the next layer, the inputs to the next layer aren’t too big or small, and they aren’t really dependant on the # of weights. But this is super hand-wavey, so I feel fine basically just saying “experimentally, this works really well.”
What are word embeddings?
A set of word embeddings is a mapping from each word in your vocabulary to a vector of a fixed length, say 768 (much shorter than your vocab size), where each word’s embedding contains meaningful information about the word’s meaning, its grammatical function, its relationship to other words, etc.
What is one potential danger of word embeddings?
Retaining the biases of the training data. For example, the embedding for “homemaker” might be closer to “woman” than “man”. Debiasing strategies become important for this reason.
What are 2 big advantages of using word embeddings over just the one-hot encodings of the words?
- It significantly reduces the dimensionality. Training is inefficient with one-hot because, if the input vector has one 1 and thousands of zeros, very few weights are lit up and thus are optimized per input.
- They learn semantically meaningful information. They learn that “sandwich” and “hoagie” are similar, so the learning from one can immediately apply to another. That’s better than having to relearn similar weights coming out of the node for sandwich and the node for hoagie.
Generally speaking, how are word embeddings learned?
In an unsupervised way, from a large and general text corpus like the set of all wikipedia articles.
What are 2 different ways word embeddings could be included in an NN, and how would each scenario work computationally and during training?
1: Each word is mapped to its pre-learned embedding using a key-value lookup table, and that is fed to the NN rather than the word itself. This key-value lookup is the first layer in the NN.
2: Custom word embeddings are learned as part of the NN you’re training (maybe using an initialization via transfer learning or maybe not). So the start of your network would be a sub-architecture creating the word embeddings instead of just a lookup table.
What is the general heuristic of the word-2-vec algorithm for word embeddings? Which words does it want to have similar embeddings?
The algorithm assumes words that often have similar contexts are similar, and thus should have similar embeddings.
What is a super fun example of doing arithmatic in the vector space of word embeddings?
King - Man + Woman = Queen :D
I’m not gonna record the nuts and bolts of the algorithms that learn word-2-vec embeddings, but what is the super general idea of the two algorithms you can use?
One is to use a word to predict its context, and one is to use a word’s context to predict that word.
What is the general idea of attention in an NN?
The computer pays attention to the relevant parts of the inputs at each step of learning or prediction. For example, in image classification, just looking at the pixels that contain the phenomena of interest, or in language translation, looking at the relevant words before writing the next word in your translation.
This method is big for seq-to-seq tasks like language translation.
Encoder/decoder architectures have been used to learn information from one source and translate it to another source. For example, an encoder CNN might understand an image, and the decoder RNN will write a text description of it.
But commonly, the encoder and decoder are both RNNs, for tasks like language translation. How does this traditional architecture work, and what is one big drawback?
The encoder RNN iterates over the input, updating its hidden state at every time step. Then the final hidden state of the encoder RNN is passed to the decoder RNN, which iteratively create the translated sentence one word at a time.
The big drawback is the decoder only has access to the final hidden state of the encoder RNN, so it only has access to what that final state was “thinking about”. Even if it’s an LSTM, maybe the long-term memory is only thinking about a certain portion of the input when we reach the end, and not thinking about the whole thing, for example. It’s just gonna be difficult for the encoder to learn to communicate all the necessary information about the input to the decoder.
How does attention work, at a high level, for an encoder-decoder where both are RNNs?
The encoder passes all of its hidden states, from every time step, to the decoder, rather than just the last one. This is great: the decoder has access to a hidden state for each individual part of the input, so it can sort of understand all parts of the input equally well.
Then, at each time step in the decoder (i.e. at each word it’s trying to produce), it focuses on the most important parts of the input. It learns parameters during training that figure out which parts of an input are important to focus on based on what part of the output it’s trying to produce.
It does this basically by learning to, at a given time step, assign each of the encoder’s hidden state a weight based on how important it is, and then it calculates a “context vector” which is just the weighted sum of all the encoder’s hidden states.
I’m not gonna get more in the weeds than that.
What is an example of when the decoder would need to have attention on more than one word in the input/encoder? Say for example we’re translating from English to Spanish?
Say we’re translating the sentence “I throw the ball.” In English, the conjugation “throw” can be used for the subject “I”, “we”, or “they”.
But in Spanish, each of these has a unique conjugation. So when writing the word “throw” in Spanish, the decoder can’t just look at the corresponding word “throw” in English; it also has to look at the subject of the sentence, so it knows how to conjugate in Spanish.
How might attention be used if the encoder is interpreting an image, and the decoder is writing a text description of that image?
The decoder figures out where in the image is relevant to the particular part of the description it’s currently writing. Awesome.
Do you use dropout during evaluation, or just training?
Just training.
This makes some intuitive sense: if you had regularization as a part of your objective function rather than through dropout, you would want to penalize the model for complex weights during training, but when examining the validation set you really just wanna see how good your predictions are regardless of how complex the model parameters are. So I suppose the analog is also true for dropout.
Greyscale images are stored with pixels between 0 and 255. What preprocessing step is basically always done, and why is it helpful for learning?
Normalizing the input, as usual! Subtract mean, divide by std dev. (There are some tricks to quickly approximate this process that probably aren’t important rn.)
This way, the network is recieving a standardized distribution of pixel values regardless of the input image, which helps training. Otherwise dim images vs bright images would be hard to treat similarly, for example.
What are 3 advantages to using a CNN for computer vision as opposed to a normal MLP?
- Fewer parameters: the same set of parameters are applied again and again, making the network more simple and probably decreasing overfitting.
- Because it’s using the same weights in different places, learning from one place can be applied elsewhere. A bird will look the same in the top right vs bottom left; with an MLP the network would have to re-learn that in every location on the network, but the CNN can learn it once and apply elsewhere. In that way it’s like an RNN: a word means the same thing at the beginning or end of a sentence.
- Because we’re using a square convolution, it uses spacial information more intuitively and much better than an MLP, which would recieve the input flattened into one long vector presented one row of pixels at a time.
How does a convolutional layer work in its most basic form? Say we have a square input greyscale image, and we’re applying a single convolution to it with dimension 3x3. How would the next layer be calculated?
The convolutional layer is going to have a convolution, or ‘filter’, which is a 3x3 array of learned weights. To perform a convolution, you apply it to a part of the grid by multiplying the pixel values by the corresponding weight, then summing the results, and then passing the sum through an activation function.
You do that for all parts of the image (depending on stride and padding and such, but ignore that for now): you scan across the image continually applying the convolution to form the output of the layer, which is still square.
If you’re trying to have an application of the convolution centered at every pixel in the input image, handling the edges gets weird: with a 3x3 convolution, if you’re on an edge, there will be parts of the image with no input pixel to multiply?
What are three ways this can be handled? What is the most common?
I’m pretty sure the most common is padding. It feels common.
How are color images, or ‘RGB’ images, represented as numerical input to a CNN? How does this compare to a greyscale image?
Say a greyscale image is 28x28 pixels. It is represented by a 28x28 grid of scalar values between 0 and 255, denoting brightness at a given pixel.
RGB images need to keep track of not just one color (and not just one “brightness level”), but three: red, green and blue. So it is represented by three 28x28 grids of scalar values between 0 and 255, with one pertaining to the “red brightness”, one to blue, and one to green.
So the greyscale image is represented as a (28,28) matrix. The RGB is a (28,28,3) matrix: it has three“channels”, and is referred to as having a “depth” of 3: its width and height are 28, and its depth is 3.
What is a convolution’s stride?
The amount of pixels it moves at a time. If it’s one, it scans one pixel at a time. If it’s 2, it skips every other pixel. And so on.
Suppose we’re using a 3x3 kernel, a stride of 1, and we haven’t padded or extended the image. If the input is NxN, what size will the output be?
What if it’s instead a 5x5 kernel?
3x3: On every side of the image, there will be one row/column that we can’t apply the kernel to, so each side decreases in size by 2. The output is (N-2)x(N-2)
5x5: now each size loses 3 rows, so it’s (N-4)x(N-4)
Suppose we’ve padded an image such that, with a stride of 1, the output image will be the same size as the input.
If the stride is 2, what size will the output be?
If the input is NxN, it’ll become (N/2)x(Nx2), because for every row and column, a pixel is only being formed in the output for every other pixel in the input.
How is a convolution applied to an RGB image? What shape would the filter be, and how would the resulting output be calculated?
An RGB image has 3 channels, so its shape is something like (28,28,3).
In a normal 28x28 image, we’d have a filter like a 5x5 array of weights, and we’d apply it at a point by multiplying the weights by the corresponding pixels, summing all the resulting numbers, and passing through an activation.
The RGB case is similar, except now the filter is 5x5x3. The height and width can be whatever, but the depth of the kernel will equal the depth of the image, so we can learn about each of the input channels. This way we basically have three 5x5 kernels being applied to the image: one to the red values, one to blue, and one to green. Then all 5x5x3=75 results are added together, across all 3 channels, and then passed through an activation function.
So conceptually, an edge detector could learn how to detect edges separately in each of the 3 colors, having one detector for each color. For example.
Below is a great image.
Suppose an input is of shape (28x28x3) (as with an RGB input image), and we want the output of our convolutional layer to be of size (28x28x4). Don’t get bogged down with getting the output 28x28 part right, and focus on the 4.
How would this work? Describe what weights the convolutional layer has, and how it applies them to get the output with a depth of 4.
Say we use a 5x5x3 filter, and we pad the image such that with a stride of 1, the outcome will be 28x28x_. What will the depth be?
Well we scan the width and height of the image, and at each point we apply our “three separate 5x5 filters” to the three channels, sum the 75 outputs across all 3 channels into one scalar, and pass through activation. So we’re getting one scalar at each point. That means the output is depth 1: 28x28x1.
So how do we get 28x28x4? We learn 4 different, separate 5x5x3 filters. Each will result in its own 28x28x1 output, yielding a 28x28x4 output.
Why would we want to have a convolutional layer whose output depth is higher than 1? Why would we wanna have, say, 5 different 5x5x3 filters to apply to a 28x28x3 RGB image, so the output dimension is 28x28x5?
Each filter can learn something different about the input! Maybe one detects edges, one records how bright it is, one checks if the dominant color is red, etc. Or maybe they just all detect different types of edges. One filter can only really learn one thing, but using multiple allows us to learn more complex and varied information during each layer.
We’ve learned that, to apply a convolution to an RGB input, we need a filter of a shape like (5x5x3): the depth then corresponds to the three color channels.
What if we’re later in the CNN, and the input channel is like (128,128,25). How would we make a filter for that?
Something like (5x5x25)! Whether it’s an input layer or not, all we need is for the depth in the kernel to match the depth of the input, so we can apply a 2d filter to each of the channels, and learn about all the channels.