Language Models Flashcards
What is an embedding?
It is a low-dimensional representation of sparse data.
For what kind of data is an embedding specifically useful? Example?
For sparse data – such as a one-hot encoding.
What is word2vec?
An embedding for English-language words.
How do you create an embedding?
You use a neural network, where the prediction task is based on similarity. Then the first hidden layer – the neurons the input values are mapped to – is called the “projection layer” and projects the input into a space that is most useful for solving that task.
What is the projection layer?
It’s the first hidden layer in a neural network, because it’s taking the input values and projecting them onto an embedding space.
What is an example of a task you would use to train a neural network to create an embedding?
Get words that frequently appear in the same context.
How is embedding related to generalization?
The embedding space captures similar concepts, relationships, and differences between examples. This is what allows the model to generalize outside of the training set, because it can reason about new examples based on what they are conceptually, and what their relationships to known examples are.
What, specifically, is a language model?
It’s a model that predicts the probability of a token, or sequence, occurring within a longer sequence.
What is a representation model?
It’s a model that learns useful vector representations of text.
How do “language model” and “representation model”, as concepts, related to transformers?
A representation model is for encoding – so the encoder part is one. A language model is for generating – so the decoder part is one.
What is an N-gram? What is a bi-gram?
A sequence of N words. For a bigram, N=2.
What does the output of a language model look like? How does the application use it?
A distribution over possible words or phrases. For example, 30% cat 70% tiger. The application usually picks probabilistically from the words above a certain threshold.
How are N-grams related to context? What is the tradeoff?
The bigger N is, the more context you have. However, this reduces the number of times in your dataset you would see each N-gram.
What is the transformer?
It is the architecture used for LLM applications. It is composed of an encoder and a decoder, and relies on attention for context.
What is an encoder?
The encoder processes input text into some intermediate representation.
What is a decoder?
The decoder converts an intermediate representation into output text.
When would you want to use an encoder-only architecture?
If you only want the embedding.
When would you want to use a decoder-only architecture?
If you only care about generating new tokens.
How do transformers solve the context problem?
They use a self-attention layer.
What does BERT stand for?
Bidirectional Encoder Representations from Transformers.
What is the name of the really famous transformers paper?
“Attention is all you need.”
BERT stands for Bidirectional Encoder Representations from Transformers. What does “bidirectional” mean?
It means the self-attention looks at both the preceding and the following tokens.
Is BERT a representation model or a language model?
It’s a representation model, since its purpose is to give good encodings of words.
How do you avoid “cheating” with a bidirectional self-attention layer?
You train using masked language modeling.
What is masked language modeling?
It’s a form of unsupervised learning where you take complete sequences of words and then “mask” (or erase) certain tokens to create training examples – the model then tries to predict the missing tokens.
What does the “foundational LLM” refer to?
A foundational LLM would be a basic one that has yet to adapt to your application’s specific needs.