lecture 7 Flashcards

1
Q

what is deep learning

A
  • subfield of machine learning
  • ML becomes just optimizing weights to best make a final prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

representation learning vs deep learning

A

representation learning: attempts to automatically learn good features or representations

deep learning: extends this by adding layers. attempts to learn multiple levels of representation and an output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

reasons for exploring deep learning

A
  1. whereas manually designed features are often over-specified, incomplete, and take a long time to design and validate, learned features are easy to adapt and fast to learn
  2. provides a flexible learnable framework for representing information
  3. can learn unsupervised and supervised
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

representations at NLP levels: phonology

A

traditional: phoneme table

DL: trains to predict phonemes/words from sound features from speech data and represents them as numerical vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

representations at NLP levels: morphology

A

traditional: breaking down words into morphemes (prefixes, stems, and suffixes)

DL: every morpheme is a numerical vector. a neural network combines two vectors into one vector representing the whole word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

representations at NLP levels: syntax

A

traditional: phrases are discrete categories like NP, VP

DL: every word an phrase is a vector. an NN combines two vectors into one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

representations at NLP levels: semantics

A

traditional: lambda calculus - take as inputs specific other functions. however, no notion of similarity or fuzziness of language

DL: every word/phrase/logical expression is a vector encapsulating semantics. NN combines two vectors into one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

language models: narrow sense

A

a probabilistic model that assigns a probability P(W1, W2.. Wn) to every finite conceivable word sequence (grammatical or not)

uses conditional probability:
- probability of a word f=given the previous words in the sentence
- reflects impicit order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

language models: broad sense

A
  1. encoder only models (BERT, RoBERTa, ELECTRA)
  2. encoder-decoder models (T5, BART)
  3. decoder only models (GPT-x models)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

encoder

A

converts raw input into contextual representation

attention can access sentence information at all stages (bi-directional)

output is provided all at once

  • for: sentence classification, NER
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

decoder

A

convert representation in to output

attention can only access previous words (auto-regressive)

  • for: (iterative) text generation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

encoder/decoder

A

useful if input/output have different lenghts

  • for: summarization, translation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how large are LLMs

A

current language models have massively increased in:
1. their number of parameters
2. the size of the datasets they are trained on (large corpus size)

This scaling-up allows these models to learn more complex patterns and generate more coherent and contextually relevant text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

pre-training and adaptation

A

pretraining: training models on huge amounts of unlabeled text using ‘self-supervised- training objectives

adaptation: pretraining is followed by adaptation, which allows leveraging the broad knowledge from initial training while fine-tuning the pre-trained model with annotated examples to perform well on a downstream task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

BERT: key contributions

A
  • fine-tuning approach based on a deep transformer encoder
  • key is to learn representations based on bidirectional context (because both left and right contexts are important to understand the meaning of words)
  • state-of-the art performance on a large set of sentence-level and token-level tasks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

BERT: pre-training objectives

A

masked language modeling

next sentence prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

pretraining objective 1: masked language modeling (MLM)

A
  • using both future and past contexts (bidirectional) simultaneously could lead to peeking at the target word, which defeats the purpose of language modeling
  • solution is to mask out k% of the input words, and then predict the masked words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

MLM: 80-10-10 corruption

A

When 15% of the words in a sentence are chosen for prediction

  1. 80% of the time those words are replaced with the [MASK] token. This teaches the model to predict the masked word based on its context
    –> learn to predict the missing word based on the context
  2. 10% of the time they are replaced with a random word in the vocabulary
    –> adds noise
  3. 10% of the time they keep it unchanged
    –> ensures the model doesn’t always expect a masked word
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

MLM rationale

A

[mask] tokens are not present during the fine-tuning phase, so the model learns to generalize better by not becoming dependent solely on predicting masked tokens

The model needs to generalize from the training data where [MASK] tokens are used, to real-world scenarios where no [MASK] tokens will be present.

By using a mix of masking, random replacements, and unchanged words, the model learns to handle various situations and contexts effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

pretraining objective 2: next sentence prediction

A

motivation: many NLP downstream tasks require understanding the relationship between two sentences

NSP is designed to reduce the gap between pre-training and fine-tuning by teaching the model to comprehend sentence relationships

This setup instructs the model on distinguishing between sentences that logically follow each other and those that do not
–> to indicate whether Sentence B logically follows Sentence A or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

pretraining objectives

A
  1. masked language modeling
  2. next sentence prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

BERT: architecture

A
  1. encoder: receives list of vectors as input
  2. self-attention: looks at all tokens for clues to better understand target token
  3. positional encoding: represent order of tokens within a sequence with a vector
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

NSP input structure

A

CLS: A special token that is always placed at the beginning of the input sequence. It helps in classification tasks as it holds the aggregated representation of the input.

SEP: A special token used to separate different segments (sentences) in the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

NSP 50% split

A

IsNext: 50% of the time two contiguous segments are sampled to form a pair that naturally follows each other

NotNext: 50% of the time two random segments are sampled to create a pair that does not naturally follow each other

By distinguishing whether one sentence follows another, the model learns to understand the context and relationship between sentences.

25
Q

BERT: putting MLM and NSP together

A
  1. unlabeled sentence input: two sentences with [MASK], [SEP], [CLS] input embeddings
  2. token embeddings: convert each token into a vector representation
  3. segment embeddings: tell the model which sentence the token belongs to
  4. positional embeddings: indicate the position of ach token in the sentence

Outcome: The model learns to understand both the internal context within sentences and the relationships between sentences, making it versatile for various NLP tasks.

26
Q

Problem with fixed source representation

A

For the encoder, it is challenging to compress the entire sentence into a single fixed-size vector representation.
–> Important information might be lost in the process because the fixed-size vector cannot capture all the nuances and details of the source sentence.

For the decoder, different parts of the source sentence might be relevant at different steps of the decoding process.
–> A single fixed representation might not provide the necessary context for generating each token in the target sequence effectively.

27
Q

Solution to fixed source representation problem

A

attention mechanism in decoder

  • at each decoder step, it decides which source parts are more important
  • now, the encoder does not have to compress the whole source into a single vector - it gives representations for all source tokens
28
Q

encoder self-attention mechanism

A

Each token in the input sequence looks at all the other tokens (including itself) to gather contextual information.

computes the importance of each token within the sequence

29
Q

feed-forward network

A

After the self-attention layer, a feed-forward network processes the information to further refine the token representations.

30
Q

transformer architecture

A
  1. inputs: encoder self-attention –> feed forward network
  2. outputs: decoder self attention (masked) –> decoder-encoder attention –> feed forward network
31
Q

decoder self-attention (masked)

A

Each token in the output sequence looks at the previous tokens up to the current position. This is masked to prevent the model from seeing future tokens.

32
Q

decoder-encoder attention

A

Each token in the output sequence attends to all tokens in the input sequence to gather relevant information for generating the next token.

33
Q

sentence-level tasks

A
  1. sentence pair classification tasks
    - MNLI: entailment
    - QQP: duplicate (semantic equivalence)
  2. single sentence classification tasks
    - SST2: positive/negative

GLUE benchmark provides a comprehensive suite of tasks to evaluate the performance of NLP models

34
Q

token-level tasks

A
  1. extractive question answering
    –> extract the exact span of text from the context that forms the answer
  2. named entity recognition (NER)
    –> classify each word in a sentence into predefined categories of named entities
35
Q

BERT rediscovers the classical NLP pipeline

A

particularly through probing tasks

i.e., use the encoded representations of one system to train another classifier on some other (probing) task of interest

paper: the focus is on edge probing, which measures the ability to extract linguistic structure information from BERT’s pre-trained encoder.

36
Q

classical NLP pipeline

A
  1. tokenization
  2. POS tagging
  3. lemmatization/stemming
  4. named entity recognition (NER)
  5. syntactic parsing
  6. coreference resolution (Identifying when different expressions in text refer to the same entity )
  7. semantic role labeling
  8. sentiment analysis
  9. information extraction
37
Q

shifting paradigms in NLP

A
  1. word vectors + task specific architectures
  2. multi layer RNNs
  3. pre-trained transformers + finetuning
38
Q

limitations of pretraining and finetuning

A
  1. practical issues: need for large task-specific datasets for finetuning - end up with many copies of the same model tailored for different tasks
  2. humans don’t need large supervised datasets: they can learn from simple directives to mix and match skills and task switch
  3. overfitting: large models fine-tune on very narrow task distributions and struggle to generalize to new unseen data
39
Q

advancements in the size of language models (LM) before and after the introduction of GPT-3

A

Pre-GPT-3 Landscape: Shows a gradual increase in the size of language models, with significant models like ELMo, GPT, BERT, and GPT-2 pushing the boundaries.

With GPT-3: Highlights a massive jump in model size, underscoring the trend towards scaling up to achieve better performance in NLP tasks.

40
Q

rationale behind scaling up language models

A
  • how the performance of neural language models is affected by their scale

key findings:
- Performance Depends Strongly on Scale (number of parameters) and weakly on Model Shape

  • The relationship between empirical performance and key variables (number of parameters), dataset size, and computational resources follows a power law of the form y=ax^k
  • Transfer Improves with Test Performance: Larger models not only perform better on their primary tasks but also transfer better to other related tasks, improving overall utility
  • Larger Models Are More Sample Efficient: Bigger models can achieve better performance with fewer training examples compared to smaller models, making them more efficient in learning from limited data.
41
Q

in-context learning in large language models (LLMs)

A

The process by which LLMs use examples within the input to perform tasks without parameter updates.

42
Q

types of in-context learning

A

Different modes of in-context learning based on the number of examples provided.

  1. zero-shot
    no prompt: The model is asked to perform the task without any examples or context
    prompt: The model is given an instruction but no examples.
  2. 1-shot
    no prompt: The model is given one example of the task.
    Prompt: The model is given an instruction and one example.
  3. few-shot
    No Prompt: The model is given a few examples of the task.
    Prompt: The model is given an instruction and a few examples.
43
Q

emergent behavior in LLMs

A

LM performs a task just by conditioning on input-output examples, without optimizing parameters
–> no explicit training or finetuning

Highlights the model’s capability to learn and generalize from context alone, emphasizing the versatility and efficiency of large pre-trained models.

44
Q

which type of in-context learning to choose

A
  1. fine-tuning (FT):
    + strongest performance
    - needs a curated and labeled dataset for each new task
    - poor generalization, spurious feature exploration
  2. few shot (FS):
    + much less task-specific data needed
    + no spurious feature exploitation
    - challenging
  3. one-shot (1S):
    + most natural (e.g., giving humans instructions)
    - challenging
  4. zero-shot (0S):
    + most convenient
    - challenging, can be ambiguous

gradient: stronger task-specific performance –> more convenient, general, less data

45
Q

perplexity

A

standard scoring system for measuring uncertainty in prediction of the next word

is the average branching factor at each decision point, if our data distribution were uniform

Perplexity is related to cross-entropy, which measures the difference between two probability distributions

lower perplexity = lower uncertainty = better

46
Q

explain ‘perplexity of 6’

A

This means that, on average, the model is as uncertain as if it had to choose between 6 equally likely options at each step of the prediction.

47
Q

perplexity scores

A

human: 12

LLM: 20

ML NLP: 110

48
Q

importance of context window

A

a larger context window allows for better comprehension and generation of text because it can take more information into account.

traditional machine learning-based NLP models have an average context window of 3, meaning they typically look at trigrams (three-word sequences) to make predictions or understand sentences. This small context window limits their understanding of longer, more complex sentences.

Large Language Models (like GPT-3 and GPT-4) have significantly larger context windows, capable of understanding and processing over 8000 words at a time. This allows them to grasp the context of much larger chunks of text, making them better at understanding and generating coherent text over longer passages.

49
Q

dependency parsing

A

method in NLP that analyzes the grammatical structure of a sentence and establishes relationships between “head” words and words that modify those heads.

helps in understanding the syntactic structure of a sentence

Proper dependency parsing can help NLP models understand and disambiguate sentences by considering different possible structures.

50
Q

linguistic categories

A
  1. POS tags: Help in identifying the grammatical category of each word in a sentence
  2. Features (inflectional and lexical): Provide additional details about the words, which can influence their grammatical role and relationship
  3. syntactic relations: Establish how words are connected grammatically within a sentence, forming a dependency tree that reflects the structure and meaning.
51
Q

dependency grammar

A

focus on the relationships between words in a sentence, where one word is the head and the other is the dependent

every word depends on exactly one other word (except for the root word)

normally with binary asymmetric relations (each dependency is a directed link between two words)

52
Q

how is a dependency tree built

A

by determining which word every word depends on

53
Q

how dependency parses work

A
  1. trees: sentences are parsed into tree structures where each node (word) has ONE PARENT
  2. label edges to indicate the head –> modifier relations
  3. usually one word is the root
  4. we dont want cycles
54
Q

why is dependency parsing useful

A
  1. resolves attachment ambiguities that can matter for meaning (i saw a girl with a telescope)
  2. grammatical structure of a sentence based on the relationships (dependencies) between the words
  3. syntactic dependencies can be close to semantic relations
  4. language agnostic: can be applied across different languages because the fundamental principles of grammatical relationships are similar in many languages
55
Q

types of dependency parsing

A
  1. projective: if every subtree is a contiguous span of the sentence (no crossing edges)
  2. prioritizing content words as heads
  3. prioritizing functional heads
56
Q

dealing with structural ambiguity in dependency parsing

A
  1. identify two words in the sentence that are ambiguous wrt their POS tags
  2. identify a structural ambiguity in the sentence
  3. assign the contextually correct POS to words
  4. draw two dependency trees
57
Q

transition-based dependency parsing

goal + how

A

goal: construct a dependency parse of a sentence

how: process words from left to right, deciding if the two words should be attached. builds a dependency parse using a stack and buffer

iteratively:
1. consult oracle
2. modify the configuration (state of stack, buffer, relations) according to the action

58
Q

transition-based dependency parsing mechanisms

A
  1. input buffer: words of the sentence
  2. stack: to manipulate words
  3. dependency relations: list of relations that culminate in the dependency parse
  4. transitions:
    - shift: remove first word in buffer and put in stack
    - left-arc: add dependency arc from topmost word in stack to secondmost word
    - right-arc: add dependency from second topmost word in stack to topmost word
    - head stays in stack, child is removed
59
Q

oracle

A

algorithm that determines the next action during parsing