Training Flashcards

1
Q

Steps to Data Preparation

A
  1. First, clean the data by applying techniques such as filtering, deduplication, and normalization.
  2. The next step is tokenization where the dataset is converted into tokens using techniques. Tokenization generates a vocabulary, which is a set of unique tokens used by the LLM. This vocabulary serves as the model’s ’language’ for processing and understanding text.
  3. Finally, the data is typically split into a
    training dataset for training the model as well as a test dataset which is used to evaluate the model’s performance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Context Window / Length

A

The number of previous tokens the model can ‘remember’ and use to predict the next token in the sequence. Longer context lengths allow the model to capture more complex relationships and dependencies within the text, potentially leading to better performance. However, longer contexts also require more computational resources and memory, which can slow down
training and inference. Choosing an appropriate context length involves balancing these trade-offs based on the specific task and available resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unsupervised Pre-Training

A

Training an LLM on a large corpus of unlabeled data. Then, supervised data is used to fine-tune the model for a specific task, such as translation or sentiment classification.

Before GPT-1 (2018), most language models were trained using a supervised learning objective. This means that the model was trained on a dataset of labeled data, where each example had a corresponding label. This data can be collected more easily and cheaply than labeled data. Additionally, the model can generalize to tasks that are different from the tasks that
it was trained on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Limitations of Using Only Labeled Data for Training

A

This approach has two main
limitations. First, it requires a large amount of labeled data, which can be expensive and
time-consuming to collect. Second, the model can only generalize to tasks that are similar to the tasks that it was trained on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Semi-Supervised Training

A

Unsupervised pre-training followed by supervised training was
superior to supervised training alone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Effect of More Parameters

A

The main innovation of
GPT-2 (2019) was a direct scale-up, with a tenfold increase in both its parameter count and the size of its training dataset. GPT-2 had 1.5 billion parameters, which was an order of magnitude larger
than the previous model. More parameters increase the model’s learning capacity. The model with the most parameters performed better on every subsequent task.

Increasing the number of parameters had a significant impact on logical reasoning and
reading comprehension, but it did not improve performance as much on tasks such as general knowledge, where performance eventually almost plateaued.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

“Test-Time Compute”

A

Enhances existing AI models during the so-called “inference” phase, or when the model is being used. For example, instead of immediately choosing a single answer, a model could generate and evaluate multiple possibilities in real time, ultimately choosing the best path forward. This method allows models to dedicate more processing power to challenging tasks like math or coding problems or complex operations that demand human-like reasoning and decision-making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A

Monte Carlo Tree Search, Chain of Code, and Mixture of Agents to boost model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly