[NLP] Lecture 3: Efficiency, Efficiency, Efficiency (Max Müller-Eberstein) Flashcards
What are the three pillars of efficiency?
-Compute: is expensive in electricity, in Denmark we are blessed with a lot of green energy
- Data (efficiency)
- Effort (how difficult is it to get started with NLP fx?)
How many times more power does a chatgpt question take compared to a google search?
10 times more
How many % do we use in denmark on data centres?
Almost 20%
Where does the model live?
In the cloud, which lives in some hardwere. We use GPUs to train, if we have a really big model, we use more GPUs and we can also use more servers.
Why does it work to split the model into servers?
When we use transformer models, we use blocks, and we can add blocks to another server, wait for the one block to finish and send data to the next sever with a block.
Explain what a transformer is made of
We feeed some word-ids (word vectors) into the transformer) and this output some vector with some word probability.
Vectors -> attention head -> feedforward
In the attention, what goes on it multiplying the vector with the query, key and value vectors. These words doesnt really mean anything, we do not know what goes on
WHat is considered a small LLM model?
32 blocks
Exp
Why is the transformer not efficient?
If you want to change something with the model, you need to change the whole model, because we dont know what does what
How can we make it more efficient?
- Only train some parts of the network. The problem is, if we tweak something we dont know how it will effect other things.
We can also do what is called parameter-efficient fune-tuning: adding new parameters to the model: adapter (provide a new hidden state?), prefix (pretends there is more context to the sentence), tuning and LoRA (like an adapter but at a different location)
What do we do now
Explain how an adapter works, how prefix tuning works and how LoRA works
What ways can we make efficient use of the data we have?
- Gather as much data as your possible target language
- Cross-lingual Transfer:
- ## Train on all data that you can find and hope for the best
- Learning dynmaics:
- Understand how and when models learn certain things
- Also cross-lingual transfer
Why can it be not correct to say smaller language low-resource?
Smaller language communities are under-resourced rather than low-resource.
Say how compute, data and effort are the three pillers:
Compute: making the compute go down with fx PEFT
data:
Quality > Quantity
Leverage Transfer Learning
Understand Learning Dynamics
effort: Less Effort on Model/Feature Engineering
More Effort on Software/Hardware Engineering
The Important Parts Still Require The Most Effort