Practice Exam 1 (Fine Tuning, Evals, Deployment) Flashcards
When optimizing and deploying a large language model on NVIDIA GPUs, which of the following techniques is most effective for reducing memory footprint without significantly sacrificing model performance?
Quantization
Quantization reduces the precision of the model parameters, decreasing memory and computation costs, but careful tuning is needed to avoid performance loss.
converting high-precision numbers (like 32-bit floating point values) into lower-precision representations (like 16-bit or 8-bit integers or floating point).
Can be applied to:
- Weights of a neural network
- Activations (intermediate values during inference)
- Sometimes gradients, during training
Memory usage drops by 2x to 4x or more depending on the precision level
- Fitting larger models on the GPU
- Reducing memory bandwidth usage
- Faster inference and training (especially with tensor cores)
What is model parellism?
Model parallelism splits the model across multiple GPUs, which could decrease each GPU’s memory load but might introduce communication overhead.
What is gradient checkpointing?
Gradient checkpointing reduces memory usage by storing only a subset of intermediate values and recomputing others, but it might increase computation time.
What is knowledge distillation?
Knowledge distillation involves training a smaller model to replicate the performance of a larger one, which can reduce memory usage but may require extensive retraining.
When fine-tuning a large language model (LLM) with a limited labeled dataset, which approach is most effective in preventing overfitting, while still improving the model’s performance?
Correct Answer
Implementing early stopping based on validation loss and using data augmentation techniques.
- Early stopping ensures you don’t train too long and overfit.
- Data augmentation ensures you don’t under-train on a limited dataset.
Early Stopping
- You train your model for multiple epochs (passes over the data).
- After each epoch, you calculate the validation loss (how well the model performs on data it hasn’t seen).
- If the validation loss doesn’t improve for a number of epochs (called patience), training stops early.
Data augmentation is a way to increase the variety of your training data without collecting more real data. It’s especially common in image, text, and audio tasks.
- Prevents overfitting by forcing the model to generalize.
- Simulates real-world variability — making the model more robust.
Applies random transformations to training samples — e.g., for images:
- Flipping
- Rotation
- Cropping
- Zooming
- Adding noise
- Color shifting
Incorrect Answers
Reducing the learning rate gradually throughout the training process.
- While reducing the learning rate can help in fine-tuning by ensuring the optimization steps become more precise, it doesn’t directly address overfitting and can prolong training, risking further overfitting.
Increasing the number of layers and the width of the model to capture more features.
- Increasing the model’s capacity can lead to overfitting, especially with a small dataset, as it allows the model to memorize the training data rather than generalize.
Utilizing a regularization technique such as Dropout or L2 regularization during training.
- Regularization techniques like Dropout or L2 add a penalty for complexity or encourage simplicity, which can help in preventing overfitting by reducing the model’s capacity to memorize the training data.
In the context of evaluating a Generative AI Language Model, which metric is most appropriate for assessing the diversity and novelty of generated text samples?
Self-BLEU
Self-BLEU measures diversity in generated text by evaluating how different each text is from others in the same set.
- Low Self-BLEU = High diversity
- High Self-BLEU = Low diversity (i.e., the model is generating very similar outputs)
Incorrect Answers
ROUGE Score measures the overlap of n-grams between generated text and reference texts, which is more indicative of recall and relevance, not diversity or novelty.
BLEU Score is used to evaluate the quality of machine-translated text by comparing it with one or more reference translations, focusing on accuracy rather than diversity or novelty.
Perplexity evaluates how well the language model predicts a sample, relating to fluency and coherence rather than diversity or novelty.
While fine-tuning a Large Language Model (LLM) on a specific dataset, you notice that the model’s performance on the validation set starts degrading after several epochs, even though training loss is still decreasing. What is the most likely cause of this issue?
Model is overfitting the training data
Overfitting occurs when the model learns the training data too well, capturing noise and outliers, which results in good training performance but poor generalization to unseen data, thus increasing validation loss.
Incorrect Answers
The model is underfitting the training data
- Underfitting occurs when the model is too simple to capture the underlying structure of the data, which would result in both training and validation losses being high. This is not the case here as training loss is decreasing.
The learning rate is set too high, causing divergence.
- A high learning rate can cause the loss to diverge, but this would typically result in erratic behavior in both training and validation losses, not just a degradation in validation performance.
The weight decay regularization parameter is too low.
-Low weight decay may contribute to overfitting, but it specifically indicates that the weights are not sufficiently penalized, allowing them to grow too large and memorize the training data.
In the context of Generative AI, what is the primary challenge associated with training large language models (LLMs) as it relates to scalability?
The communication overhead between distributed systems grows as model parameters increase, affecting synchronization efficiency.
- As models grow, distributing them across multiple processors leads to increased communication overhead, posing significant scalability challenges.
Incorrect Answers
Larger models tend to memorize training data rather than generalize, which limits scalability in practical applications.
- Memorization is a concern for generalization but doesn’t directly impact the scalability relative to computational infrastructure and resource constraints.
Training larger models requires exponentially greater amounts of labeled data, which is often not available.
-Though training data increases, LLMs are designed to leverage self-supervised learning on unlabeled data, rather than strictly requiring exponentially more labeled data.
The computational complexity increases exponentially with the size of the model due to nonlinear activations.
-While nonlinear activations contribute to complexity, the exponential increase is more related to the number of parameters and data throughput.
Which of the following techniques is most critical for improving the training stability and convergence of a Generative Adversarial Network (GAN)?
Batch normalization
Batch normalization helps stabilize training by normalizing inputs, reducing internal covariate shift, and allowing higher learning rates, which can be critical when training GANs to ensure stable updates to both generator and discriminator.
- Generator (G)
Learns to create fake data that looks like the real thing (e.g., fake images).
-Starts from random noise and tries to fool the discriminator. - Discriminator (D)
Learns to detect whether input is real or fake.
-It’s a binary classifier: real (from dataset) vs. fake (from generator).
✅ Why Batch Normalization Helps in GANs:
For each mini-batch, it:
- Computes the mean and variance of the batch.
- Normalizes each feature (zero mean, unit variance).
- Applies a learnable scale and shift (γ and β) to retain flexibility.
✅ Impact
-Stabilizes training by preventing exploding or vanishing gradients.
-Allows deeper networks by smoothing the optimization landscape.
-Helps generator gradients flow better through layers.
You often apply batch norm in the generator, but avoid it in the discriminator.
Why? In the discriminator, BN may leak information between samples (e.g., real and fake), which can hurt adversarial training.
In the context of evaluating generative AI language models, which metric is most suitable for measuring the alignment of generated text with human-like reasoning, especially in scenarios where subjective judgment is crucial?
Human Evaluation
Human evaluation involves subjective judgment and is often used to assess alignment with human-like reasoning since it considers context, coherence, and plausibility.
Incorrect Answers
Perplexity measures how well a probability model predicts a sample, which is more about model uncertainty than alignment with human-like reasoning.
The Bleu Score measures the similarity between machine-generated text and reference text but does not capture the reasoning or intention of the text.
Precision quantifies the number of relevant instances retrieved and is not specifically designed for evaluating reasoning within generated text.
In the context of evaluating a Generative AI model capable of producing text, which metric is most suitable for assessing the diversity of the generated outputs while maintaining semantic coherence with the input?
Embedding-based Similarity
Embedding-based similarity measures, such as cosine similarity using pre-trained language model embeddings, can capture semantic coherence between generated outputs and inputs, providing a balance between diversity and semantic cohesion.
Incorrect Answers
BLEU Score
BLEU (Bilingual Evaluation Understudy) score is primarily used to evaluate the quality of machine-translated text by comparing it to one or more reference translations. It is not primarily focused on diversity or semantic coherence.
ROGUE Score
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used for evaluating text summaries. It focuses on overlap with reference summaries, but does not directly measure diversity or semantic coherence.
In the context of evaluating large language models (LLMs), what is the main limitation of using BLEU score for assessing model performance on open-ended text generation tasks?
BLEU score is sensitive to minor word order variations, which may not impact the quality of coherent text.
BLEU score primarily measures n-gram precision and is more suited for tasks with fixed outputs like translation. It can penalize valid outputs in open-ended tasks due to its lack of accommodating synonymous or rephrased responses.
It can unfairly penalize creative and contextually appropriate outputs just because they don’t match the reference output exactly, making it less effective in tasks requiring high degrees of linguistic variability.
Incorrect Answers
BLEU score was specifically designed for measuring the grammatical correctness of text outputs.
-BLEU was not specifically designed to measure grammatical correctness; it instead evaluates n-gram overlap, which does not inherently relate to grammar assessment.
BLEU score requires significant computational resources for calculation, making it impractical for large datasets.
-Calculating BLEU is not computationally expensive compared to other evaluation metrics, hence this is not primarily a limitation in evaluating LLMs.
BLEU score does not consider the relevance of output with the context of conversation.
-While BLEU focuses on surface level n-gram precision, it is less effective in capturing the nuanced alignment of generated text to the conversational context.
Which of the following strategies is most effective for preventing catastrophic forgetting when fine-tuning a large pre-trained generative AI model on a specific domain?
Applying Elastic Weight Consolidation (EWC) during fine-tuning.
EWC is designed to prevent catastrophic forgetting by penalizing changes to important weights of the pre-trained model.
🧠 What is Catastrophic Forgetting?
Catastrophic forgetting happens when a model forgets previously learned knowledge (like general language skills) while being fine-tuned on a narrow domain (e.g., legal documents, medical records, etc.).
🔹 Elastic Weight Consolidation (EWC)
Idea: When fine-tuning, prevent the model from changing weights that are important for previous tasks.
EWC adds a regularization term to the loss function that penalizes deviation from the original weights, weighted by their importance (estimated using the Fisher Information Matrix).
In the context of evaluating language models, which of the following metrics is best suited for assessing the semantic coherence of generated text across multiple sentences?
BERTScore
BERTScore aligns token embedding similarity using BERT, providing a context-aware evaluation that better captures semantic coherence across sentences.
Incorrect Answers
ROUGE-L
-ROUGE-L evaluates the longest common subsequence but is more relevant for assessing recall rather than semantic coherence across sentences.
BLEU Score
-The BLEU score is used for evaluating the accuracy of machine-translated text against a reference but does not specifically address semantic coherence across multiple sentences.
Perplexity
-Perplexity measures how well a probability model predicts a sample and is primarily used for evaluating the fluency of text rather than semantic coherence.
When fine-tuning a pre-trained language model to improve its performance on a specific domain, which of the following techniques is most effective for avoiding catastrophic forgetting while ensuring domain-specific adaptation?
Multi-task Learning
- Multi-task learning involves training with multiple objectives, which can help models retain information from the original task while adapting to the new domain.
🧠 What is Multi-Task Learning (MTL)?
Multi-Task Learning is a machine learning paradigm where a model is trained on multiple tasks at the same time, rather than one task at a time.
🧪 In the context of LLM fine-tuning:
Let’s say you want your LLM to specialize in legal documents, but not forget general English capabilities.
In Multi-Task Learning, you might train it on:
- Task A: General language modeling (e.g., Wikipedia, Common Crawl)
- Task B: Legal domain documents
During fine-tuning, the model sees both types of tasks/data, and learns to perform well on both.
In the context of Generative AI, particularly focusing on LLMs, which of the following techniques is primarily used to improve the coherence and contextual relevance of generated text?
Transformer Architectures
Transformers, through mechanisms like self-attention, excel at maintaining context over long sequences and have significantly improved coherence and contextual relevance in generated text.