lecture 7 Flashcards

Question 1

Q

what is deep learning

Answer

A

subfield of machine learning
ML becomes just optimizing weights to best make a final prediction

Question 2

Q

representation learning vs deep learning

Answer

A

representation learning: attempts to automatically learn good features or representations

deep learning: extends this by adding layers. attempts to learn multiple levels of representation and an output

Question 3

Q

reasons for exploring deep learning

Answer

A

whereas manually designed features are often over-specified, incomplete, and take a long time to design and validate, learned features are easy to adapt and fast to learn
provides a flexible learnable framework for representing information
can learn unsupervised and supervised

Question 4

Q

representations at NLP levels: phonology

Answer

A

traditional: phoneme table

DL: trains to predict phonemes/words from sound features from speech data and represents them as numerical vectors

Question 5

Q

representations at NLP levels: morphology

Answer

A

traditional: breaking down words into morphemes (prefixes, stems, and suffixes)

DL: every morpheme is a numerical vector. a neural network combines two vectors into one vector representing the whole word.

Question 6

Q

representations at NLP levels: syntax

Answer

A

traditional: phrases are discrete categories like NP, VP

DL: every word an phrase is a vector. an NN combines two vectors into one.

Question 7

Q

representations at NLP levels: semantics

Answer

A

traditional: lambda calculus - take as inputs specific other functions. however, no notion of similarity or fuzziness of language

DL: every word/phrase/logical expression is a vector encapsulating semantics. NN combines two vectors into one

Question 8

Q

language models: narrow sense

Answer

A

a probabilistic model that assigns a probability P(W1, W2.. Wn) to every finite conceivable word sequence (grammatical or not)

uses conditional probability:
- probability of a word f=given the previous words in the sentence
- reflects impicit order

Question 9

Q

language models: broad sense

Answer

A

encoder only models (BERT, RoBERTa, ELECTRA)
encoder-decoder models (T5, BART)
decoder only models (GPT-x models)

Question 10

Q

encoder

Answer

A

converts raw input into contextual representation

attention can access sentence information at all stages (bi-directional)

output is provided all at once

for: sentence classification, NER

Question 11

Q

decoder

Answer

A

convert representation in to output

attention can only access previous words (auto-regressive)

for: (iterative) text generation

Question 12

Q

encoder/decoder

Answer

A

useful if input/output have different lenghts

for: summarization, translation

Question 13

Q

how large are LLMs

Answer

A

current language models have massively increased in:
1. their number of parameters
2. the size of the datasets they are trained on (large corpus size)

This scaling-up allows these models to learn more complex patterns and generate more coherent and contextually relevant text.

Question 14

Q

pre-training and adaptation

Answer

A

pretraining: training models on huge amounts of unlabeled text using ‘self-supervised- training objectives

adaptation: pretraining is followed by adaptation, which allows leveraging the broad knowledge from initial training while fine-tuning the pre-trained model with annotated examples to perform well on a downstream task

Question 15

Q

BERT: key contributions

Answer

A

fine-tuning approach based on a deep transformer encoder
key is to learn representations based on bidirectional context (because both left and right contexts are important to understand the meaning of words)
state-of-the art performance on a large set of sentence-level and token-level tasks

Question 16

Q

BERT: pre-training objectives

Answer

A

masked language modeling

next sentence prediction

Question 17

Q

pretraining objective 1: masked language modeling (MLM)

Answer

A

using both future and past contexts (bidirectional) simultaneously could lead to peeking at the target word, which defeats the purpose of language modeling
solution is to mask out k% of the input words, and then predict the masked words

Question 18

Q

MLM: 80-10-10 corruption

Answer

A

When 15% of the words in a sentence are chosen for prediction

80% of the time those words are replaced with the [MASK] token. This teaches the model to predict the masked word based on its context
–> learn to predict the missing word based on the context
10% of the time they are replaced with a random word in the vocabulary
–> adds noise
10% of the time they keep it unchanged
–> ensures the model doesn’t always expect a masked word

Question 19

Q

MLM rationale

Answer

A

[mask] tokens are not present during the fine-tuning phase, so the model learns to generalize better by not becoming dependent solely on predicting masked tokens

The model needs to generalize from the training data where [MASK] tokens are used, to real-world scenarios where no [MASK] tokens will be present.

By using a mix of masking, random replacements, and unchanged words, the model learns to handle various situations and contexts effectively.

Question 20

Q

pretraining objective 2: next sentence prediction

Answer

A

motivation: many NLP downstream tasks require understanding the relationship between two sentences

NSP is designed to reduce the gap between pre-training and fine-tuning by teaching the model to comprehend sentence relationships

This setup instructs the model on distinguishing between sentences that logically follow each other and those that do not
–> to indicate whether Sentence B logically follows Sentence A or not

Question 21

Q

pretraining objectives

Answer

A

masked language modeling
next sentence prediction

Question 22

Q

BERT: architecture

Answer

A

encoder: receives list of vectors as input
self-attention: looks at all tokens for clues to better understand target token
positional encoding: represent order of tokens within a sequence with a vector

Question 23

Q

NSP input structure

Answer

A

CLS: A special token that is always placed at the beginning of the input sequence. It helps in classification tasks as it holds the aggregated representation of the input.

SEP: A special token used to separate different segments (sentences) in the input.

Question 24

Q

NSP 50% split

Answer

A

IsNext: 50% of the time two contiguous segments are sampled to form a pair that naturally follows each other

NotNext: 50% of the time two random segments are sampled to create a pair that does not naturally follow each other

By distinguishing whether one sentence follows another, the model learns to understand the context and relationship between sentences.

Question 25

Q

BERT: putting MLM and NSP together

Answer

A

unlabeled sentence input: two sentences with [MASK], [SEP], [CLS] input embeddings
token embeddings: convert each token into a vector representation
segment embeddings: tell the model which sentence the token belongs to
positional embeddings: indicate the position of ach token in the sentence

Outcome: The model learns to understand both the internal context within sentences and the relationships between sentences, making it versatile for various NLP tasks.

Question 26

Q

Problem with fixed source representation

Answer

A

For the encoder, it is challenging to compress the entire sentence into a single fixed-size vector representation.
–> Important information might be lost in the process because the fixed-size vector cannot capture all the nuances and details of the source sentence.

For the decoder, different parts of the source sentence might be relevant at different steps of the decoding process.
–> A single fixed representation might not provide the necessary context for generating each token in the target sequence effectively.

Question 27

Q

Solution to fixed source representation problem

Answer

A

attention mechanism in decoder

at each decoder step, it decides which source parts are more important
now, the encoder does not have to compress the whole source into a single vector - it gives representations for all source tokens

Question 28

Q

encoder self-attention mechanism

Answer

A

Each token in the input sequence looks at all the other tokens (including itself) to gather contextual information.

computes the importance of each token within the sequence

Question 29

Q

feed-forward network

Answer

A

After the self-attention layer, a feed-forward network processes the information to further refine the token representations.

Question 30

Q

transformer architecture

Answer

A

inputs: encoder self-attention –> feed forward network
outputs: decoder self attention (masked) –> decoder-encoder attention –> feed forward network

Question 31

Q

decoder self-attention (masked)

Answer

A

Each token in the output sequence looks at the previous tokens up to the current position. This is masked to prevent the model from seeing future tokens.

Question 32

Q

decoder-encoder attention

Answer

A

Each token in the output sequence attends to all tokens in the input sequence to gather relevant information for generating the next token.

Question 33

Q

sentence-level tasks

Answer

A

sentence pair classification tasks
- MNLI: entailment
- QQP: duplicate (semantic equivalence)
single sentence classification tasks
- SST2: positive/negative

GLUE benchmark provides a comprehensive suite of tasks to evaluate the performance of NLP models

Question 34

Q

token-level tasks

Answer

A

extractive question answering
–> extract the exact span of text from the context that forms the answer
named entity recognition (NER)
–> classify each word in a sentence into predefined categories of named entities

Question 35

Q

BERT rediscovers the classical NLP pipeline

Answer

A

particularly through probing tasks

i.e., use the encoded representations of one system to train another classifier on some other (probing) task of interest

paper: the focus is on edge probing, which measures the ability to extract linguistic structure information from BERT’s pre-trained encoder.

Question 36

Q

classical NLP pipeline

Answer

A

tokenization
POS tagging
lemmatization/stemming
named entity recognition (NER)
syntactic parsing
coreference resolution (Identifying when different expressions in text refer to the same entity )
semantic role labeling
sentiment analysis
information extraction

Question 37

Q

shifting paradigms in NLP

Answer

A

word vectors + task specific architectures
multi layer RNNs
pre-trained transformers + finetuning

Question 38

Q

limitations of pretraining and finetuning

Answer

A

practical issues: need for large task-specific datasets for finetuning - end up with many copies of the same model tailored for different tasks
humans don’t need large supervised datasets: they can learn from simple directives to mix and match skills and task switch
overfitting: large models fine-tune on very narrow task distributions and struggle to generalize to new unseen data

Question 39

Q

advancements in the size of language models (LM) before and after the introduction of GPT-3

Answer

A

Pre-GPT-3 Landscape: Shows a gradual increase in the size of language models, with significant models like ELMo, GPT, BERT, and GPT-2 pushing the boundaries.

With GPT-3: Highlights a massive jump in model size, underscoring the trend towards scaling up to achieve better performance in NLP tasks.

Question 40

Q

rationale behind scaling up language models

Answer

A

how the performance of neural language models is affected by their scale

key findings:
- Performance Depends Strongly on Scale (number of parameters) and weakly on Model Shape

The relationship between empirical performance and key variables (number of parameters), dataset size, and computational resources follows a power law of the form y=ax^k
Transfer Improves with Test Performance: Larger models not only perform better on their primary tasks but also transfer better to other related tasks, improving overall utility
Larger Models Are More Sample Efficient: Bigger models can achieve better performance with fewer training examples compared to smaller models, making them more efficient in learning from limited data.

Question 41

Q

in-context learning in large language models (LLMs)

Answer

A

The process by which LLMs use examples within the input to perform tasks without parameter updates.

Question 42

Q

types of in-context learning

Answer

A

Different modes of in-context learning based on the number of examples provided.

zero-shot
no prompt: The model is asked to perform the task without any examples or context
prompt: The model is given an instruction but no examples.
1-shot
no prompt: The model is given one example of the task.
Prompt: The model is given an instruction and one example.
few-shot
No Prompt: The model is given a few examples of the task.
Prompt: The model is given an instruction and a few examples.

Question 43

Q

emergent behavior in LLMs

Answer

A

LM performs a task just by conditioning on input-output examples, without optimizing parameters
–> no explicit training or finetuning

Highlights the model’s capability to learn and generalize from context alone, emphasizing the versatility and efficiency of large pre-trained models.

Question 44

Q

which type of in-context learning to choose

Answer

A

fine-tuning (FT):
+ strongest performance
- needs a curated and labeled dataset for each new task
- poor generalization, spurious feature exploration
few shot (FS):
+ much less task-specific data needed
+ no spurious feature exploitation
- challenging
one-shot (1S):
+ most natural (e.g., giving humans instructions)
- challenging
zero-shot (0S):
+ most convenient
- challenging, can be ambiguous

gradient: stronger task-specific performance –> more convenient, general, less data

Question 45

Q

perplexity

Answer

A

standard scoring system for measuring uncertainty in prediction of the next word

is the average branching factor at each decision point, if our data distribution were uniform

Perplexity is related to cross-entropy, which measures the difference between two probability distributions

lower perplexity = lower uncertainty = better

Question 46

Q

explain ‘perplexity of 6’

Answer

A

This means that, on average, the model is as uncertain as if it had to choose between 6 equally likely options at each step of the prediction.

Question 47

Q

perplexity scores

Answer

A

human: 12

LLM: 20

ML NLP: 110

Question 48

Q

importance of context window

Answer

A

a larger context window allows for better comprehension and generation of text because it can take more information into account.

traditional machine learning-based NLP models have an average context window of 3, meaning they typically look at trigrams (three-word sequences) to make predictions or understand sentences. This small context window limits their understanding of longer, more complex sentences.

Large Language Models (like GPT-3 and GPT-4) have significantly larger context windows, capable of understanding and processing over 8000 words at a time. This allows them to grasp the context of much larger chunks of text, making them better at understanding and generating coherent text over longer passages.

Question 49

Q

dependency parsing

Answer

A

method in NLP that analyzes the grammatical structure of a sentence and establishes relationships between “head” words and words that modify those heads.

helps in understanding the syntactic structure of a sentence

Proper dependency parsing can help NLP models understand and disambiguate sentences by considering different possible structures.

Question 50

Q

linguistic categories

Answer

A

POS tags: Help in identifying the grammatical category of each word in a sentence
Features (inflectional and lexical): Provide additional details about the words, which can influence their grammatical role and relationship
syntactic relations: Establish how words are connected grammatically within a sentence, forming a dependency tree that reflects the structure and meaning.

Question 51

Q

dependency grammar

Answer

A

focus on the relationships between words in a sentence, where one word is the head and the other is the dependent

every word depends on exactly one other word (except for the root word)

normally with binary asymmetric relations (each dependency is a directed link between two words)

Question 52

Q

how is a dependency tree built

Answer

A

by determining which word every word depends on

Question 53

Q

how dependency parses work

Answer

A

trees: sentences are parsed into tree structures where each node (word) has ONE PARENT
label edges to indicate the head –> modifier relations
usually one word is the root
we dont want cycles

Question 54

Q

why is dependency parsing useful

Answer

A

resolves attachment ambiguities that can matter for meaning (i saw a girl with a telescope)
grammatical structure of a sentence based on the relationships (dependencies) between the words
syntactic dependencies can be close to semantic relations
language agnostic: can be applied across different languages because the fundamental principles of grammatical relationships are similar in many languages

Question 55

Q

types of dependency parsing

Answer

A

projective: if every subtree is a contiguous span of the sentence (no crossing edges)
prioritizing content words as heads
prioritizing functional heads

Question 56

Q

dealing with structural ambiguity in dependency parsing

Answer

A

identify two words in the sentence that are ambiguous wrt their POS tags
identify a structural ambiguity in the sentence
assign the contextually correct POS to words
draw two dependency trees

Question 57

Q

transition-based dependency parsing

goal + how

Answer

A

goal: construct a dependency parse of a sentence

how: process words from left to right, deciding if the two words should be attached. builds a dependency parse using a stack and buffer

iteratively:
1. consult oracle
2. modify the configuration (state of stack, buffer, relations) according to the action

Question 58

Q

transition-based dependency parsing mechanisms

Answer

A

input buffer: words of the sentence
stack: to manipulate words
dependency relations: list of relations that culminate in the dependency parse
transitions:
- shift: remove first word in buffer and put in stack
- left-arc: add dependency arc from topmost word in stack to secondmost word
- right-arc: add dependency from second topmost word in stack to topmost word
- head stays in stack, child is removed

Question 59

Q

oracle

Answer

A

algorithm that determines the next action during parsing