Bare Min Must Know Flashcards
What is Top-P?
What is Top-K?
What is Temperature?
What is Chain of Thought Prompting
What is Least to Most Prompting?
What is Self-ask prompting?
ReAct prompting
Iterative Prompting
See https://cobusgreyling.medium.com/12-prompt-engineering-techniques-644481c857aa to fill in for prompts
How do you mitgate latency in GenAI?
On the model side: Knowledge Distillation, Quantization.
Note 4bit Quantization compresses parameters, and sometimes intermediate calculations from high-precision numbers like 32 bit floats to 4bit. This can reduce the model size significantly.
On the token processing side:
Parallel processing of tokens, caching frequently generated tokens
What is Grounding?
It’s a way to keep the LLM on track of the “story” we’re trying to tell it helps the model remember why we’re working on the problem.
How does grounding work?
Similar to RAG–there is a retriever based on relevant documents given the user input.
Difference between RAG and Grounding