Language Models Flashcards

1
Q

What is an embedding?

A

It is a low-dimensional representation of sparse data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

For what kind of data is an embedding specifically useful? Example?

A

For sparse data – such as a one-hot encoding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is word2vec?

A

An embedding for English-language words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you create an embedding?

A

You use a neural network, where the prediction task is based on similarity. Then the first hidden layer – the neurons the input values are mapped to – is called the “projection layer” and projects the input into a space that is most useful for solving that task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the projection layer?

A

It’s the first hidden layer in a neural network, because it’s taking the input values and projecting them onto an embedding space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an example of a task you would use to train a neural network to create an embedding?

A

Get words that frequently appear in the same context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is embedding related to generalization?

A

The embedding space captures similar concepts, relationships, and differences between examples. This is what allows the model to generalize outside of the training set, because it can reason about new examples based on what they are conceptually, and what their relationships to known examples are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What, specifically, is a language model?

A

It’s a model that predicts the probability of a token, or sequence, occurring within a longer sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a representation model?

A

It’s a model that learns useful vector representations of text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do “language model” and “representation model”, as concepts, related to transformers?

A

A representation model is for encoding – so the encoder part is one. A language model is for generating – so the decoder part is one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an N-gram? What is a bi-gram?

A

A sequence of N words. For a bigram, N=2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the output of a language model look like? How does the application use it?

A

A distribution over possible words or phrases. For example, 30% cat 70% tiger. The application usually picks probabilistically from the words above a certain threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are N-grams related to context? What is the tradeoff?

A

The bigger N is, the more context you have. However, this reduces the number of times in your dataset you would see each N-gram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the transformer?

A

It is the architecture used for LLM applications. It is composed of an encoder and a decoder, and relies on attention for context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is an encoder?

A

The encoder processes input text into some intermediate representation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a decoder?

A

The decoder converts an intermediate representation into output text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When would you want to use an encoder-only architecture?

A

If you only want the embedding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When would you want to use a decoder-only architecture?

A

If you only care about generating new tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do transformers solve the context problem?

A

They use a self-attention layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does BERT stand for?

A

Bidirectional Encoder Representations from Transformers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the name of the really famous transformers paper?

A

“Attention is all you need.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

BERT stands for Bidirectional Encoder Representations from Transformers. What does “bidirectional” mean?

A

It means the self-attention looks at both the preceding and the following tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Is BERT a representation model or a language model?

A

It’s a representation model, since its purpose is to give good encodings of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do you avoid “cheating” with a bidirectional self-attention layer?

A

You train using masked language modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is masked language modeling?

A

It’s a form of unsupervised learning where you take complete sequences of words and then “mask” (or erase) certain tokens to create training examples – the model then tries to predict the missing tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does the “foundational LLM” refer to?

A

A foundational LLM would be a basic one that has yet to adapt to your application’s specific needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is fine-tuning?

A

It’s when you take your trained LLM and do follow-up training on examples specific to your task.

28
Q

What is the cost of fine-tuning in terms of training data?

A

Usually small, you only need a few thousand examples.

29
Q

What is the cost of fine-tuning in terms of performance?

A

It is quite expensive to fine-tune because you have to do backpropagation on the full set of parameters, which could be in the billions.

30
Q

What is parameter-efficient tuning?

A

It’s when you only adjust a subset of parameters per-iteration.

31
Q

What is distillation?

A

It’s when you create a smaller version of an LLM.

32
Q

Why would you want to do distillation?

A

Because the smaller “distilled” model will be faster and more efficient.

33
Q

What is the most common way to do distillation?

A

Using a teacher-student relationship where the student model is the distilled model.

34
Q

What is the difference between one-shot and few-shot?

A

These terms are used in the context of prompt engineering, where you are providing examples to the LLM in the prompt to teach it what kind of response you want. One-shot would mean you only give one example. Few-shot would mean many examples.

35
Q

What is the difference between online and offline inference?

A

Online means you have the LLM respond at serving time. Offline means you’re caching predictions made in bulk in advance.

36
Q

What is a “sparse encoding”?

A

It’s when we try to regularize an encoding layer by encouraging the embedding to be just a few dimensions, and 0 for the rest.

37
Q

What are “basis functions”?

A

They are the functions that are applied to the input data to get its embedding.

38
Q

How do basis functions relate to sparse encoding?

A

Sparse encoding would be trying to reduce the embedding to a weighted combination of just a few of the basis functions, and regularize out the rest.

39
Q

What is an autoencoder?

A

An autoencoder is an architecture focused on finding an embedding. It has an encoding layer and a decoding layer.

40
Q

What is a “latent space”?

A

It’s a lower-dimensional representation of your data, that still captures essential meaning and patterns.

41
Q

How are autoencoders and transformers different?

A

Autoencoders are designed specifically to find an embedding for individual examples, and don’t try to model relationships. That is because their purpose is to find a low-dimensionality latent space that can be used to compress data. Transformers are designed for sequential data, which is why they include the attention mechanism that autoencoders lack.

42
Q

What does VAE stand for?

A

Variational AutoEncoder.

43
Q

What is a variational autoencoder?

A

It’s an autoencoder that learns a probabilistic model rather than exact outputs.

44
Q

Why would you want to use a variational autoencoder?

A

Because you could use it for data generation. Take your original examples, encode them, and then sample from the resulting distribution to get new examples with a similar encoding – which will be similar to your original unencoded example, if the embedding is good.

45
Q

How do you train an autoencoder?

A

You take the input, reduce it to the latent space in the encoder, and then decode it from that latent space. So basically, take 100 dimensions down to 2, then expand back to 100. The loss function compares the output, which is a reconstruction of the input, to the real input. The better the latent space is at giving the decoder the information it needs, the lower the loss will be.

46
Q

How are autoencoders related to EBR?

A

You use an autoencoder to map points to the latent space, and then search by similarity.

47
Q

What is a denoising autoencoder and why would you want to use it?

A

A denoising autoencoder adds noise to the input data during training, and then compares the output to the non-noisy data. So it learns to deal with noise. You’d use this when real-world data is noisy.

48
Q

What is Softmax?

A

It’s when you reduce a set of numbers such that they add up to 1.

49
Q

Why would you want to use Softmax, in general?

A

You would use it when each value in the set is supposed to represent a probability. In that case you need all the probabilities to add up to 1.

50
Q

Why would you use Softmax, in deep learning?

A

You would use it in the output layer, when you are outputting a probability, such as predicting what word will come next.

51
Q

What is the temperature parameter in Softmax and what happens when you change it?

A

The temperature parameter controls how flat the distribution is – higher T means more chances are given to less-popular options, making the probability distribution flatter and thus less confident and more random.

52
Q

What is an embedding matrix?

A

It’s what you turn your vocabulary into after you learn the embedding.

53
Q

What does it mean to say that you can do math on embeddings?

A

It means that since each direction in the embedding space represents a concept, you can find add that concept to a word to get a new word, where you find the concept based on a diff between other words. Like “sushi” + (“germany - japan”) = “bratwurst” (switching from germany to japan concept) or “cat” + (“dogs” - “dog”) = “cats” (adding plurality concept).

54
Q

What do you use to measure vector similarity?

A

Dot product.

55
Q

What is an attention head trying to do?

A

It is trying to let each word ask some question of each other word, and then conceptually update itself based on that other word if the answer is “yes.”

56
Q

What are the three parts of an attention head and what do they do?

A

Query matrix – lets each word represent itself as some conceptual “question”; Key matrix – lets each word represent itself as some conceptual “answer”; Value matrix – lets each word represent itself as some conceptual “update” to apply to another word.

57
Q

What is the query vector in attention?

A

It’s a representation of a question that an embedded vector asks – like “are there any adjectives in front of me?”

58
Q

What is the query matrix in attention?

A

It’s a matrix you multiply your embedded vector by to get a question specific to that (embedded) word/phrase/concept.

59
Q

What is the key vector, in attention?

A

It’s a representation of an answer that an embedded vector provides for a particular question, like “I am an adjective.”

60
Q

What is the key matrix in attention?

A

It’s a matrix you multiple your embedded vector by to get an answer specific to that (embedded) word/phrase/concept.

61
Q

What is the value vector in attention?

A

For some embedded vector, it’s the “update” that that embedded vector provides, based on the question being asked. Like if the question is about adjectives, it might be “here is my adjective concept, you should add this to yourself.”

62
Q

How do you know if you should apply an update in attention?

A

You take the dot product of the question (query for vector A) and the answer (key for vector B) and if they’re similar, you apply B’s update (value for vector B) to A.

63
Q

Explain what an “attention head” is and what “multi-headed attention” means

A

An attention head is the set of attention operations applied on one set of query/key/value matrices, to let all the words ask each other one question. “Multi-headed attention” means you actually have a bunch of different attention heads, so you’re capturing a bunch of different concepts with this question/answer scheme.

64
Q

Distinguish between “many attention layers” and “multi-headed attention”

A

An attention layer is one set of multi-headed attention operations – like 100 different attention heads running in parallel. Because they’re running in parallel, all the questions are independent of each other. If you want concepts to be related and learn from each other, you want multiple attention layers sequentially, which is of course more expensive.

65
Q

Distinguish between “encoder” and “projector”

A

Projector means a layer that maps data from one space to another. Encoder means mapping data to a different representation, which captures structures and patterns. An encoder is technically a class of projector, but it’s so specific and unique that we delineate it with its own term.