Midterm Review Flashcards by Kendra Newman

What does NLP stand for?

Natural Language Processing

How well did you know this?

Not at all

Perfectly

What are the two main categories of tasks in NLP?

Pre-neural
Post-neural and pre-LLM
Post-LLM

How well did you know this?

Not at all

Perfectly

What is the purpose of sequence tagging/classification tasks?

They mostly serve as features for downstream applications

How well did you know this?

Not at all

Perfectly

What is the role of Part of Speech Tagging in NLP?

Widely adopted as features in ML models

How well did you know this?

Not at all

Perfectly

What does morphology tagging involve?

Lemmatization and reducing words to their simplest form

How well did you know this?

Not at all

Perfectly

What is Dependency Parsing used for?

Understanding sentence structure and semantic disambiguation

How well did you know this?

Not at all

Perfectly

What does Semantic Role Labeling (SRL) identify?

Identifies predicates and roles of elements in sentences

How well did you know this?

Not at all

Perfectly

What is Named Entity Recognition (NER)?

Identifies all named entities in text that belong to certain types

How well did you know this?

Not at all

Perfectly

What type of models are Markov Chains?

Chains of states with transition probabilities

How well did you know this?

Not at all

Perfectly

What is a language model?

A model that predicts the next word/token in a sequence

How well did you know this?

Not at all

Perfectly

What is the objective of a language model?

Compute the probability of a sentence or sequence of words

How well did you know this?

Not at all

Perfectly

Define perplexity in the context of language models.

Inverse probability of test text normalized by the number of words

How well did you know this?

Not at all

Perfectly

What are some common classification tasks in NLP?

Document Classification
Spam Detection
Sentiment Analysis
Textual Entailment

How well did you know this?

Not at all

Perfectly

What is the goal of generative classifiers?

To model P(x, y) = P(x | y) P(y)

How well did you know this?

Not at all

Perfectly

What is the main advantage of discriminative classifiers?

Few assumptions and more direct modeling

How well did you know this?

Not at all

Perfectly

What is the function of the softmax function in multi-class logistic regression?

Converts logits into probabilities that sum to 1

How well did you know this?

Not at all

Perfectly

Fill in the blank: In Naive Bayes, you estimate P(x|c) and P(c) for classification based on _______.

features

How well did you know this?

Not at all

Perfectly

What is the purpose of TF-IDF in text representation?

To weigh the importance of a term in a document relative to the corpus

How well did you know this?

Not at all

Perfectly

What is Word2Vec used for?

To train a classifier to predict word embeddings

How well did you know this?

Not at all

Perfectly

What is the concept behind the skip-gram architecture in Word2Vec?

Uses nearby words to predict the target word

How well did you know this?

Not at all

Perfectly

What is the primary function of a neural unit in a neural network?

To carry information and produce a non-linear activation value

How well did you know this?

Not at all

Perfectly

What is the significance of using non-linear activation functions in neural networks?

They enable the network to approximate complex functions

How well did you know this?

Not at all

Perfectly

What does RNN stand for?

Recurrent Neural Network

How well did you know this?

Not at all

Perfectly

What is the purpose of training an RNN?

To handle sequences and maintain context from previous inputs

How well did you know this?

Not at all

Perfectly

What is the primary limitation of feed-forward neural networks?

They assume a fixed input size and struggle with temporal orders

What is cross-entropy loss used for in classification tasks?

To measure the performance of a model whose output is a probability value

True or False: Stochastic Gradient Descent (SGD) typically converges slower than Gradient Descent (GD).

False

What is the goal of optimization in machine learning?

To minimize the loss function

What do we call the process of updating weights in a neural network based on the loss?

Backpropagation

What is the purpose of training a RNN?

To handle sequences and generate predictions based on previous input.

What does 'unrolling' a RNN entail?

Generating a feedforward network specific to the input sequence.

What is a common loss function used in RNN for language modeling?

Cross entropy loss.

What is the size of the input matrix representing words in RNN LM?

|V| X H.

What matrix is used for output word probability distributions in RNN?

H X |V|.

What is a drawback of basic RNNs?

Early tokens receive very little attention in later context.

What does LSTM stand for?

Long Short-Term Memory Network.

What are the two subproblems addressed by LSTMs?

* Removing unnecessary info from context * Adding relevant info.

What is the function of the Forget Gate in an LSTM?

Decides on what info to forget from the current cell state.

What is the purpose of the Add Gate in an LSTM?

To decide which parts of new info should be added to the cell state.

What does the Output Gate in an LSTM do?

Decides which parts of cell state should be output for predictions.

What is a variant of LSTMs that allows gates to peek at cell states?

Peephole Connections.

What is a simpler model that merges cell states and hidden states?

Gated Recurrent Unit (GRU).

What is the main challenge when using Bidirectional LSTMs?

The backward layer has already seen next words that are being predicted.

What is ELMo used for?

To compute contextualized embeddings for entity typing.

What is the first step in using ELMo for entity typing?

Find all common contexts for each Wikipedia entity.

What are common tasks RNNs struggle with?

* Machine translation * Summarization * Question answering.

What is the role of an encoder in encoder-decoder models?

To generate contextualized representations of the input sequence.

What does Beam Search do during inference?

Keeps top-k most probable sequences.

What is the purpose of Top-K Sampling?

To improve diversity and creativity in generated sequences.

What does the term 'temperature' refer to in sampling methods?

It scales logits before softmax to control randomness.

What does BLEU measure?

N-gram precision in translations.

What does ROUGE-L measure?

Longest common subsequence in translations.

What is the BERT Score based on?

Cosine similarity between contextualized word embeddings.

What is the main advantage of self-attention in transformers?

It helps compute latent representations by attending to surrounding tokens.

What are the three roles in the actual attention mechanism?

* Query * Key * Value.

What is Multi-head Attention used for?

To capture more comprehensive and diverse information.

What is the purpose of positional embeddings in transformers?

To represent the temporal order of words.

What is a common type of positional encoding?

Sinusoidal Positional Encoding.

What is the advantage of learned positional embeddings?

They treat positional embeddings as learnable parameters.

What is the main function of T5 in NLP?

To unify different tasks into input-output sequences.

What is the goal of instruction tuning in models like Chat-GPT?

To train the model with human-labeled data for specific tasks.

What is a key characteristic of autoregressive models?

They generate tokens sequentially.

What is the Byte Pair Encoding technique used for?

To handle out-of-vocabulary words efficiently.

What is one of the benefits of Mixture of Experts (MoE)?

Computation efficiency at inference time.

What does in-context learning refer to in LLMs?

Learning tasks by examples provided in the input context.

What is zero-shot prompting?

Providing only the task description for the model to generate an answer.

What is the limitation of GPT-3 in instruction-following?

It often fails when the task is unclear or lacks proper context.

What kind of data did BERT use for training?

Wikipedia and books.

What is the significance of larger model sizes in LLMs?

They often lead to better performance and capabilities.

What is the first real LLM developed by OpenAI?

GPT-3.

What is the main limitation of pretrained LLMs?

They don't work well towards tasks.

What is instruction tuning?

Training the model with human-labeled data on how specific instructions or tasks should be handled.

What does GPT-3 primarily focus on during its training?

Next word prediction.

What is Chat-GPT?

GPT-3 further trained on human-label supervised data to acquire its 'assistant' behavior.

What type of responses can instruction tuning target?

Natural language explanations, not just answers.

What is UnifiedQA primarily focused on?

Question answering.

What does Flan-T5 build upon?

UnifiedQA but with more diverse data.

What is InstructGPT?

GPT-3 trained on instruction data from human labeling.

What does ChatGPT's instruction-tuning data contain more of?

Multi-turn conversational instances.

What is the purpose of Alpaca?

To replicate instruction following capabilities by letting GPT-3.5 generate instruction examples.

What is a characteristic of generative QA?

Asks models to provide a short answer.

What is SQuAD?

Standard Question Answering Dataset.

What does selective QA allow for?

Flexibility with deterministic evaluation metric (accuracy).

What does GSM8K refer to?

Grade school math problems.

What is DROP?

Discrete Reasoning Over the content of Paragraphs.

What is the purpose of the Putnam Bench?

To evaluate mathematical proof correctness.

What is a hallucination in the context of LLMs?

When LLMs use incorrect facts, knowledge, or reasoning paths.

True or False: Factual hallucinations are when LLMs provide incorrect historical facts.

True.

True or False: Commonsense hallucinations involve LLMs making logical errors based on everyday reasoning.

True.

What is the inverse scaling evaluation?

Evaluates if larger LMs perform better on tasks.

What is alignment in the context of LLMs?

The process of aligning model behavior to human expectations.

What is safety alignment?

Providing instructions as supervision to ensure models do not generate harmful content.

What is the challenge with labeling 'good' and 'bad' responses?

Good and bad are often relative and hard to quantify.

What is Reinforcement Learning (RL)?

A machine learning technique that teaches models to learn by trial and error based on reward signals.

What is PPO?

Proximal Policy Optimization, a standard RL objective.

What does Direct Preference Optimization (DPO) aim to achieve?

Train models without needing a reward model.

What is 'jailbreaking' in the context of LLMs?

Bypassing model's alignment.

What is cognitive overload?

When a model is overwhelmed by too much information, leading to incorrect outputs.

What is RLAIF?

Uses AI feedback for alignment.

What is the purpose of inference-time methods?

To acquire desired behaviors from models without changing model weights.

What technique involves sampling multiple chain-of-thought?

Self-Consistency.

What is the goal of multi-LLM debating?

To reach an agreement between multiple models.

What does RAP stand for?

Reasoning via Planning.

What is RAG?

Retrieval Augmented Generation.

Midterm Review Flashcards

(104 cards)