lecture 7 Flashcards
what is deep learning
- subfield of machine learning
- ML becomes just optimizing weights to best make a final prediction
representation learning vs deep learning
representation learning: attempts to automatically learn good features or representations
deep learning: extends this by adding layers. attempts to learn multiple levels of representation and an output
reasons for exploring deep learning
- whereas manually designed features are often over-specified, incomplete, and take a long time to design and validate, learned features are easy to adapt and fast to learn
- provides a flexible learnable framework for representing information
- can learn unsupervised and supervised
representations at NLP levels: phonology
traditional: phoneme table
DL: trains to predict phonemes/words from sound features from speech data and represents them as numerical vectors
representations at NLP levels: morphology
traditional: breaking down words into morphemes (prefixes, stems, and suffixes)
DL: every morpheme is a numerical vector. a neural network combines two vectors into one vector representing the whole word.
representations at NLP levels: syntax
traditional: phrases are discrete categories like NP, VP
DL: every word an phrase is a vector. an NN combines two vectors into one.
representations at NLP levels: semantics
traditional: lambda calculus - take as inputs specific other functions. however, no notion of similarity or fuzziness of language
DL: every word/phrase/logical expression is a vector encapsulating semantics. NN combines two vectors into one
language models: narrow sense
a probabilistic model that assigns a probability P(W1, W2.. Wn) to every finite conceivable word sequence (grammatical or not)
uses conditional probability:
- probability of a word f=given the previous words in the sentence
- reflects impicit order
language models: broad sense
- encoder only models (BERT, RoBERTa, ELECTRA)
- encoder-decoder models (T5, BART)
- decoder only models (GPT-x models)
encoder
converts raw input into contextual representation
attention can access sentence information at all stages (bi-directional)
output is provided all at once
- for: sentence classification, NER
decoder
convert representation in to output
attention can only access previous words (auto-regressive)
- for: (iterative) text generation
encoder/decoder
useful if input/output have different lenghts
- for: summarization, translation
how large are LLMs
current language models have massively increased in:
1. their number of parameters
2. the size of the datasets they are trained on (large corpus size)
This scaling-up allows these models to learn more complex patterns and generate more coherent and contextually relevant text.
pre-training and adaptation
pretraining: training models on huge amounts of unlabeled text using ‘self-supervised- training objectives
adaptation: pretraining is followed by adaptation, which allows leveraging the broad knowledge from initial training while fine-tuning the pre-trained model with annotated examples to perform well on a downstream task
BERT: key contributions
- fine-tuning approach based on a deep transformer encoder
- key is to learn representations based on bidirectional context (because both left and right contexts are important to understand the meaning of words)
- state-of-the art performance on a large set of sentence-level and token-level tasks
BERT: pre-training objectives
masked language modeling
next sentence prediction
pretraining objective 1: masked language modeling (MLM)
- using both future and past contexts (bidirectional) simultaneously could lead to peeking at the target word, which defeats the purpose of language modeling
- solution is to mask out k% of the input words, and then predict the masked words
MLM: 80-10-10 corruption
When 15% of the words in a sentence are chosen for prediction
- 80% of the time those words are replaced with the [MASK] token. This teaches the model to predict the masked word based on its context
–> learn to predict the missing word based on the context - 10% of the time they are replaced with a random word in the vocabulary
–> adds noise - 10% of the time they keep it unchanged
–> ensures the model doesn’t always expect a masked word
MLM rationale
[mask] tokens are not present during the fine-tuning phase, so the model learns to generalize better by not becoming dependent solely on predicting masked tokens
The model needs to generalize from the training data where [MASK] tokens are used, to real-world scenarios where no [MASK] tokens will be present.
By using a mix of masking, random replacements, and unchanged words, the model learns to handle various situations and contexts effectively.
pretraining objective 2: next sentence prediction
motivation: many NLP downstream tasks require understanding the relationship between two sentences
NSP is designed to reduce the gap between pre-training and fine-tuning by teaching the model to comprehend sentence relationships
This setup instructs the model on distinguishing between sentences that logically follow each other and those that do not
–> to indicate whether Sentence B logically follows Sentence A or not
pretraining objectives
- masked language modeling
- next sentence prediction
BERT: architecture
- encoder: receives list of vectors as input
- self-attention: looks at all tokens for clues to better understand target token
- positional encoding: represent order of tokens within a sequence with a vector
NSP input structure
CLS: A special token that is always placed at the beginning of the input sequence. It helps in classification tasks as it holds the aggregated representation of the input.
SEP: A special token used to separate different segments (sentences) in the input.