Language Models Flashcards

Question

What is masked language modeling?

Answer 1

It’s a form of unsupervised learning where you take complete sequences of words and then “mask” (or erase) certain tokens to create training examples – the model then tries to predict the missing tokens.

Answer 2

A foundational LLM would be a basic one that has yet to adapt to your application’s specific needs.

Answer 3

It’s when you take your trained LLM and do follow-up training on examples specific to your task.

Answer 4

Usually small, you only need a few thousand examples.

Answer 5

It is quite expensive to fine-tune because you have to do backpropagation on the full set of parameters, which could be in the billions.

Answer 6

It’s when you only adjust a subset of parameters per-iteration.

Answer 7

It’s when you create a smaller version of an LLM.

Answer 8

Because the smaller “distilled” model will be faster and more efficient.

Answer 9

Using a teacher-student relationship where the student model is the distilled model.

Answer 10

These terms are used in the context of prompt engineering, where you are providing examples to the LLM in the prompt to teach it what kind of response you want. One-shot would mean you only give one example. Few-shot would mean many examples.

Answer 11

Online means you have the LLM respond at serving time. Offline means you’re caching predictions made in bulk in advance.

Answer 12

It’s when we try to regularize an encoding layer by encouraging the embedding to be just a few dimensions, and 0 for the rest.

Answer 13

They are the functions that are applied to the input data to get its embedding.

Answer 14

Sparse encoding would be trying to reduce the embedding to a weighted combination of just a few of the basis functions, and regularize out the rest.

Answer 15

An autoencoder is an architecture focused on finding an embedding. It has an encoding layer and a decoding layer.

Answer 16

It’s a lower-dimensional representation of your data, that still captures essential meaning and patterns.

Answer 17

Autoencoders are designed specifically to find an embedding for individual examples, and don’t try to model relationships. That is because their purpose is to find a low-dimensionality latent space that can be used to compress data. Transformers are designed for sequential data, which is why they include the attention mechanism that autoencoders lack.

Answer 18

Variational AutoEncoder.

Answer 19

It’s an autoencoder that learns a probabilistic model rather than exact outputs.

Answer 20

Because you could use it for data generation. Take your original examples, encode them, and then sample from the resulting distribution to get new examples with a similar encoding – which will be similar to your original unencoded example, if the embedding is good.

Answer 21

You take the input, reduce it to the latent space in the encoder, and then decode it from that latent space. So basically, take 100 dimensions down to 2, then expand back to 100. The loss function compares the output, which is a reconstruction of the input, to the real input. The better the latent space is at giving the decoder the information it needs, the lower the loss will be.

Answer 22

You use an autoencoder to map points to the latent space, and then search by similarity.

Answer 23

A denoising autoencoder adds noise to the input data during training, and then compares the output to the non-noisy data. So it learns to deal with noise. You’d use this when real-world data is noisy.

Answer 24

It’s when you reduce a set of numbers such that they add up to 1.

Answer 25

You would use it when each value in the set is supposed to represent a probability. In that case you need all the probabilities to add up to 1.

Answer 26

You would use it in the output layer, when you are outputting a probability, such as predicting what word will come next.

Answer 27

The temperature parameter controls how flat the distribution is – higher T means more chances are given to less-popular options, making the probability distribution flatter and thus less confident and more random.

Answer 28

It’s what you turn your vocabulary into after you learn the embedding.

Answer 29

It means that since each direction in the embedding space represents a concept, you can find add that concept to a word to get a new word, where you find the concept based on a diff between other words. Like “sushi” + (“germany - japan”) = “bratwurst” (switching from germany to japan concept) or “cat” + (“dogs” - “dog”) = “cats” (adding plurality concept).

Answer 30

Dot product.

Answer 31

It is trying to let each word ask some question of each other word, and then conceptually update itself based on that other word if the answer is “yes.”

Answer 32

Query matrix – lets each word represent itself as some conceptual “question”; Key matrix – lets each word represent itself as some conceptual “answer”; Value matrix – lets each word represent itself as some conceptual “update” to apply to another word.

Answer 33

It’s a representation of a question that an embedded vector asks – like “are there any adjectives in front of me?”

Answer 34

It’s a matrix you multiply your embedded vector by to get a question specific to that (embedded) word/phrase/concept.

Answer 35

It’s a representation of an answer that an embedded vector provides for a particular question, like “I am an adjective.”

Answer 36

It’s a matrix you multiple your embedded vector by to get an answer specific to that (embedded) word/phrase/concept.

Answer 37

For some embedded vector, it’s the “update” that that embedded vector provides, based on the question being asked. Like if the question is about adjectives, it might be “here is my adjective concept, you should add this to yourself.”

Answer 38

You take the dot product of the question (query for vector A) and the answer (key for vector B) and if they’re similar, you apply B’s update (value for vector B) to A.

Answer 39

An attention head is the set of attention operations applied on one set of query/key/value matrices, to let all the words ask each other one question. “Multi-headed attention” means you actually have a bunch of different attention heads, so you’re capturing a bunch of different concepts with this question/answer scheme.

Answer 40

An attention layer is one set of multi-headed attention operations – like 100 different attention heads running in parallel. Because they’re running in parallel, all the questions are independent of each other. If you want concepts to be related and learn from each other, you want multiple attention layers sequentially, which is of course more expensive.

Answer 41

Projector means a layer that maps data from one space to another. Encoder means mapping data to a different representation, which captures structures and patterns. An encoder is technically a class of projector, but it’s so specific and unique that we delineate it with its own term.

Language Models Flashcards

(65 cards)