Jay Alammar Flashcards

Question 1

Q

How do we transform words into things that the Blocks can work with?

Answer

A

Words are tokenized (integer ID)
Each token is converted into its embedding (a large vector corresponding to its token integer ID)

Question 2

Q

What are the two components of an Encoder/Decoder block?

Answer

A

Self-attention
Feed Forward Neural Network

Question 3

Q

What’s the name for the scores given to each token in the vocabulary?

Answer

A

Logits
(these are then soft-maxed in the final layer, where they sum up to 1)
We can pick the highest one, or sample to give a wider selection from which to choose from

Question 4

Q

Picking the highest probabilty for the next word is known as …

Answer

A

Greedy sampling

Question 5

Q

In Self-Attention what are three matrices that are used to manipulate the inputs?

Answer

A

Query
Key
Value

Question 6

Q

What’s the difference in architecture between GPT and Bert?

Answer

A

GPT - decoders
Bert - encoders

Question 7

Q

What is auto-regression (in the context of LLMs)?

Answer

A

After each new token is generated, it gets added to the input prompt (ready for the next token to be generated)

Question 8

Q

What’s the difference between Encoder and Decoder stacks?

Answer

A

They both have Self-Attention and Feed Forward Neural Network

In addition, the Decoder has **Encoder-Decoder Self Attention **

Question 9

Q

What is length of embedding vector in GPT-3?

Answer

A

Largest version is 12,288 (dimension of vector)

Question 10

Q

When a token is sent to a Transformer (in GPT) how is it processed? two steps

Answer

A

Embedding vector is looked up
To this is added the position encoding vector

this is then passed onto the first Decoder block

Question 11

Q

Describe what Self-Attention does? give an example

Answer

A

Its purpose is to create a vector based on the current token, but modified for context

E.g. “the chicken crossed the road and then painted it”

each word will get a self-attention score, so “the” and “road” will score highly, and the current word “it”. Each of these vectors is then scored and the sum produces a vector which is then passed onto the FFNN (Feed Forward Neural Network)

Question 12

Q

What is the list of tokens and their probabilities, output from the model, called?

Question 13

Q

How can we consider only the top 5 probabilty words?

Answer

A

Set top_k to 5