ML and Gen AI Refresh Flashcards
What kind of DB does rag use?
Vector database
What is a vector database
It is a database that is designed to index, store, and query data in a vector format (e.g. an n-dimensional vector embedding).
How does a vector database work?
Works by querying for k-vectors that are closest to other vectors in terms of a distance metric like cosine similarity, dot product etc. Instead of using K-exact nearest neighbors, we typically use ANN approx nearest neighbors. It diminishes recall (e.g. drops some documents that might in fact be similar FN) but is more performant from a computational perspective.
Why I would want to optimize recall!
We optimize recall if we do not want many false negatives. E.g. me telling someone they don’t have an STD but they do.
Why would I want to optimize precision?
We optimize precision if we want to ensure that we don’t get a lot of false positives–e.g. telling someone they have cancer but they don’t.
What are some issues with vector queries
Not many great algorithms for efficent knn queries to gurantee finding the k-nearest neighbor to a given vector. Hence why we typically opt for ANN, which drops in accuracy, but is efficient.
What is zero-shot prompting?
Zero-shot prompting can be thought of like your asking someone to solve a problem with no context and hoping they get the right answer.
What is few-shot prompting?
Few-shot prompting can be thought of like your adding some context (examples) to help the person solve the problem.
What are some limitations of few-shot prompting?
They’re not great at dealing with complex reasoning tasks. In this case we need more of a chatbot like structure in our response, such as Chain of Thought (CoT) prompting.
What is CoT Prompting?
CoT Prompting is like a Q-A-Q-A answering technique that aims to get to the right answer by breaking out the reasoning into steps. For example you could ask a simple math problem, get the answer back, and then ask a different question that is somewhat an expansion on the first and get back the correct response compared to asking that question upfront.
How does CoT come into play with zero-shot?
You can add in zero-shot prompts with CoT by simply adding “Let’s think step by step” in the prompt.
What are chains in LangChain
They are a sequence of calls, whether to an LLM, tool, or a data preprocessing step.
Why do we use chains?
Chains allow you to go beyond just a single API call to a language model and instead chain together multiple calls in a logical sequence.
Give me an example of the input for chains.
prompt and model (llm)
Give me an example of the input for chain.run
query, text where query is the base prompt and text is what will be going into the chain.
What is the main architecture powering Foundational Models
Transformer Architecture. Essentially it just provides the ability to perform parallel training of gigantic neural networks with billions of parameters.
What is an encoder-only architecture?
BERT is an example of this. Essentially it only contains the encoder piece and transforms the text into it’s vector representation.
Explain to me the transformer architecture
Reference: https://blue-season.github.io/transformer-in-5-minutes/
What is a decoder-only architecture?
GPT-3. Contains only the decoder. They extend input text sequence by generating continuations. Text completion and generation.
What is an encoder-decoder architecture?
Contains both. Decoder consumes the encoded embeddings to generate output text. This can be used for things like text-to-text e.g. translation.
What makes Foundational Models different?
Scale, architecture, pretraining, customization, versatility, infrastructure.
What are some different types of FMs
Language, Computer Vision, Generative model, Multimodal
Walk me through a basic RAG architecture
Let’s break out what is in RAG. On an extreme high level, rag consists of three important things:
1. The question (user)
2. The external knowledge database (the library of current and or relevant knowledge)
3. The retriever (the librarian tasked to get documents related to that task and return it to better help the user answer their question)
4. A really smart, but also out of date or touch model (the LLM)
So here’s what happens:
1. The librarian takes the users question and finds similar documents that are relevant to that question from the library.
2. Those similar documents are then added to the users questions to help the LLM answer that question way more correctly.
What should all RAGS integrate? NNIC
Counter Factual Robustness + Noise Robustness
* Ability to handle noisy or irrelevant data contained in the retrieved documents
- Ability to be like hey, these documents I pulled are totally irrelevant to this question the user is asking. E.g. I want to know how to make hot chocolate, here are some documents about how to bake a cake :(.
Negative Rejection
* Reject the answer if it lacks sufficient knowledge (I.E. the LLM gave a crap answer and OR our database returned crap documents because it has nothing related to say a very nuanced question on how to formalize a university grade class on underwater basket weaving.
Information Integration
* Ability to integrate information from multiple sources to answer more complex questions. E.g. think of our library not centralized to just english, but science, music, and dare I say information even about yourself!
What are some quality scores that should be used to assess our RAG?
Happy that you asked. Think of it like a little calf…or CAF.
I. Context Relevance
* Retrieved context NEEDS to be relevant for answering the users question
II.Answer Relevance
* The answer has to directly answer users the question. I.e. no “I want to bake a cake” and the output ends up being “Top 10 hot sexy things to do in Austin this Tuesday”
III. Faithfulness
* The answer must be faithful to the context retrieved. I.e. we ask what planet has the most number of moons–and we provide literally the exact answer to it and BOOM get the wrong answer still. Wamp wamp.
What are the two primary MUST haves of a RAG
Good retrievers make good answers. The answer-Generation must be able to make good use of the documents.
The retriever-Retrievr must be able to find the most relevant documents for answering the users question
Name some ways we can address the retrieval issue with RAGs?
- Chunk Size Optimization
Chunking too small or too large may result in inaccurate answers - Structured Knowledge
Enables recursive retrievals and query routing - Sliding Window Chunking
Overlapping chunks to help alleviate long documents. - Metadata Attachments
Enables more efficient search via filtering, like on keywords!
What is chunking in RAGs?
Chunking in Large Language Model (LLM) applications breaks down extensive texts into smaller, manageable segments.
What is Sliding Window Chunking as it relates to RAGs?
In this method, chunks have some overlap, ensuring that information at the edges of chunks is not lost. Sliding window chunking provides a balance between fixed-length and context-aware techniques.
Name some ways a RAG can address the good answer generation issue!
Information Compression
* Reduces noise and helps alleviate context-window constraint
Generator Fine-Tuning
* Fine-tine LLM to help ensure retrieved docs are aligned to LLM
Result Re-Rank
* the process of reordering an initial list of retrieved documents or passages to improve the ranking quality.
* Alleviates lost in the middle phenomena in LLMs
*
What is Knowledge Distillation composed of?
The Teacher Network: The big power hungry sensei that contains all this vast knowledge
The Student Network: The eager young grashopper–the faster lighter weight model we are going to train.
What is Lost in the Middle problem for LLMs
It’s where LLMs really underperform on certain tasks when relevant information is in the middle of the prompt. The more we expand the size of the prompt the information in the middle gets lost.
Just like humans we respond well to information at the beginning or end of a piece of content. Information in the middle tends to get lost.
What is the context window?
The context window is the number of tokens the LLM can process at once in the input prompt.
Why do we use logistic regression?
Logistic regression is used if we want to calculate something discrete, like whether people like Troll 2.
It fits squiggle lineto data that tells us the predicted probability for discrete variables on the y axis.
What are the assumptions of linear regression? LIHN
Linearity: Relationship between predictors and the response is linear.
Independence: All observations are independent of each other. E.g. feature 1 does not influence feature 2
Homoscedasticity: The spread of the residuals is constant across all levels of the predictors. Or the variance of our error terms are similar across the independent variables
Normality of Residuals: The residuals should follow a normal distribution. The residuals should form a bell curve when you plot them.
How would you explain the Bias v. Variance Tradeoff to a high school student!
OK KIDS! Let’s talk about with Bias v. Variance Trade Off is. To understand this we have to understand what Bias and Variance even is in relation to a model–so let’s hop in!
Bias: Bias is actually exactly what it sounds like, BIAS! For example lets say we are using a model and we make some “assumptions” about the data. Let’s say, in our case, we assume everyone at all weight is going to be a height of 3 inches. We absolutely 100% REFUSE to believe otherwise, well that’s BIAS! Now, we say in statistics that a model that is often overly bias is underfitting the data. And when we think of something “fitting” the data we think of how well our model, or in this case a line matches up all the points with all points lying on the line being a perfect fit. We see when we draw the line that all x values = height of 3 inches we are essentially drawing a complete horizontal line and missing all the data points! This is called underfitting.
Variance: Now variance is exactly related to what it sounds like, how much things vary! Imagine the model now is so worried about being too biased that it now wants to ensure that every possible solution and miniscule consideration is taken into account. Therefore for all weights down into the ounce level we try to make sure that we get the exact possible height—say for example on average in the real world the height is 3 inches for people between weights 1-2 ounces. When we have a lot of variance, we’re adding in all these possible solutions of heights for a ton of different values between 1-2 ounces. What happens is we start to lose sight of the “trend” in the answer, and get to focused on the “exact” answer and get lost in the weeds. A way to see this is imagine drawing a squiggly line for all data points. What this does is add a ton of COMPLEXITY into our model, meaning just that…it’s super complex and gets lost in the weeds of the overall answer. When we have a highly variant or overly complex model we often worry about the model overfitting the data–meaning that when the model sees new data it may not be able to get the correct answer because it was soooo focused on the data it saw before.
So this is all great, but what is the Bias vs. Variance Tradeoff?? Well, you actually already have the pieces to know what it is! The Bias vs Variance tradeoff is ensuring that the model we make is not too bias (leading to underfitting the data) and does not have too much variance (overfitting the data) so that we can find a sweet spot of a model who has correct assumptions of the data, but isn’t so hyperfocused on being “overly” correct that it loses sight on the broader trend at large. You can think of it as bias is being so concerned with one particular right answer and it’s wrong, and variance being obsessed with all possible right answers and it’s wrong.
What is Regularization?
Think of Regularizations like a spanker ready to punish your model when it does something as horrible as overfitting the data (gasp, bad model)!
Regularization spanks your model into shape by adding a penalty on the size of coefficients for your model. There are two types (L1,L2,Elasticnet etc) and each have their own way of SPANKING or well shrinking the coefficents towards zero to simplify your overly complex complicado model that loves to overfit.
When should I use regularization
Use regularization when you think your model is being naughty and overfitting the training data. E.g. captures noise instead of the underlying pattern (bad model!)