Jay Alammar Flashcards

1
Q

How do we transform words into things that the Blocks can work with?

A

Words are tokenized (integer ID)
Each token is converted into its embedding (a large vector corresponding to its token integer ID)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

What are the two components of an Encoder/Decoder block?

A

Self-attention
Feed Forward Neural Network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s the name for the scores given to each token in the vocabulary?

A

Logits
(these are then soft-maxed in the final layer, where they sum up to 1)
We can pick the highest one, or sample to give a wider selection from which to choose from

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Picking the highest probabilty for the next word is known as …

A

Greedy sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In Self-Attention what are three matrices that are used to manipulate the inputs?

A

Query
Key
Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the difference in architecture between GPT and Bert?

A

GPT - decoders
Bert - encoders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is auto-regression (in the context of LLMs)?

A

After each new token is generated, it gets added to the input prompt (ready for the next token to be generated)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s the difference between Encoder and Decoder stacks?

A

They both have Self-Attention and Feed Forward Neural Network

In addition, the Decoder has **Encoder-Decoder Self Attention **

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is length of embedding vector in GPT-3?

A

Largest version is 12,288 (dimension of vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When a token is sent to a Transformer (in GPT) how is it processed? two steps

A
  1. Embedding vector is looked up
  2. To this is added the position encoding vector

this is then passed onto the first Decoder block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe what Self-Attention does? give an example

A

Its purpose is to create a vector based on the current token, but modified for context

E.g. “the chicken crossed the road and then painted it”

each word will get a self-attention score, so “the” and “road” will score highly, and the current word “it”. Each of these vectors is then scored and the sum produces a vector which is then passed onto the FFNN (Feed Forward Neural Network)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the list of tokens and their probabilities, output from the model, called?

A

Logits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we consider only the top 5 probabilty words?

A

Set top_k to 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly