C7 Flashcards

Question 1

Q

Recurrent Neural Networks (RNN)

Answer

A

connections between the hidden layers of subsequent ‘time steps’ (words in a text)
internal state that is updated in every time step
hidden layer weights determine how the network should make use of past context in calculating the output for the current input (trained via backpropagation)

Question 2

Q

LSTM

Answer

A

Long Short-Term Memory: more powerful (and more complex) RNNs that take longer contexts into account by removing information no longer needed from the context and adding information likely to be needed for later decision making

Question 3

Q

bi-LSTM

Answer

A

bidirectional neural model for NER:
- first, word and character embeddings are computed for input word w_i and the context words
- these are passed through a bidirectional LSTM, whose outputs are concatenated to produce a single output layer at position i

Simplest approach: direct pass to softmax layer to choose tag t_i

But for NER the softmax approach is insufficient: strong constraints for neighboring tokens needed (e.g., the tag I-PER must follow I-PER or B-PER) => Use CRF layer on top of the bi-LSTM output: biLSTM-CRF

Question 4

Q

transformer models

Answer

A

encoder-decoder architecture
much more efficient than Bi-LSTMs and other RNNs because input is processed in parallel instead of sequentially
can model longer-term dependencies because the complete input is processed at once
but it uses a lot of memory because of quadratic complexity: O(n^2) for input length of n items

Question 5

Q

the attention mechanism

Answer

A

When processing each item in the input, the model has access to all of the input items

Self-attention: each input token is compared to all other input
tokens
=> comparison: dot product of each two vectors (the larger the value the more similar the vectors that are being compared)

Self-attention represents how words contribute to the representation of longer inputs and how strongly words are related to each other => allows us to model longer-distance relations between words

Disadvantage: attention is quadratic in the length of the input (computing dot products between each pair of tokens in the input at each layer)

Question 6

Q

BERT

Answer

A

Pre-training of Deep Bidirectional Transformers for Language
Understanding

Pre-training: language modelling
Bidirectional: predicting randomly masked words in context
Transformers: efficient neural architectures with self-attention
Language understanding: encoding, not decoding (not generation)

endoder-half of the transformer
Core idea of BERT: self-supervised pretraining based on language modelling

Question 7

Q

masked language modelling

Answer

A

Predicting randomly masked words in context to capture the meaning of words
Next-sentence classification to capture the relationship between sentences

both are trained in parallel

Question 8

Q

WordPiece

Answer

A

specific type of tokenization used by BERT

Fixed-size vocabulary is defined to model huge corpora
The WordPiece vocabulary is optimized to cover as many words as possible
- frequent words are single tokens, e.g. “walking” and “talking”
- less frequent words are split into subwords, e.g. “bi” + “##king”, “bio” + “##sta” + “##tist” + “##ics”
- this is not linguistically motivated, but purely computationally

Question 9

Q

success of BERT

Answer

A

achieves state-of-the-art results on a large range of tasks and even in a large range of domains
pre-trained models can easily be fine-tuned
pre-trained models are available for many languages, as well as domain-specific pre-trained BERT models: bioBERT etc.

Question 10

Q

BERT for similarity

Answer

A

With BERT, if we want to compute the similarity (or some other relation) between two sentences, we concatenate them in the input and then feed them to the BERT encoder

Finding the most similar pair in a collection of 10,000 sentences takes about 65 hours with BERT.

Question 11

Q

SBERT

Answer

A

independent encoding of two sentences with a BERT encoder
then measure similarity between the two embeddings

=> reduces the effort for finding the most similar pair from 65 hours with BERT to about 5 seconds with SBERT, while maintaining the accuracy from BERT

Question 12

Q

transfer learning with neural language models

Answer

A

Inductive transfer learning: transfer the knowledge from pretrained language models to any NLP task

During pre-training, the model is trained on unlabeled data (selfsupervision) over different pre-training tasks
For finetuning, the BERT model is first initialized with the pre-trained parameters
All of the parameters are fine-tuned using labeled data from the downstream tasks (supervised learning)

Each downstream task has separate fine-tuned models, even
though they are initialized with the same pre-trained parameters.

Question 13

Q

zero-shot use

Answer

A

using a pre-trained model without fine-tuning

We also use the term ‘zero-shot’ for the use of models that were fine-tuned by someone else or on a different task, eg.
- trained on newspaper benchmark, applied to twitter data
- trained on English, used for Dutch

Question 14

Q

few-shot learning

Answer

A

fine-tuning with a small number of samples

Question 15

Q

challenges of state-of-the art methods

Answer

A

time and memory expensive:
- pre-training takes time (days) and computing power
- fine-tuning takes time (hours) and computing power
- use of a fine-tuned model (inference) needs computing power

Hyperparameter tuning:
- optimization on development set takes time
- adoption of hyperparameters from pre-training task might be suboptimal

Question 16

Q

BERT input

Answer

Study These Flashcards

A

input embeddings are the sum of the token embeddings, segmentation embeddings and position embeddings

Question 17

Q

Semi-supervised relation extraction via bootstrapping

Answer

Study These Flashcards

A

example: find airline/hub pairs and we only know Ryanair has a hub at Charleroi
1. search for terms “Ryanair”, “Charleroi” and “hub” in some proximity to find example sentences
2. extract general patterns from these examples, eg.
/ [ORG], which uses [LOC] as a hub /
3. use these patterns to search for additional tuples
4. assign confidence values to new tuples to avoid semantic drift (avoid erroneous patterns and tuples)

Question 18

Q

two parts that BERT consists of

Answer

Study These Flashcards

A

pre-training: language modelling (takes a long time)
fine-tuning: training the model specific to a task (sentiment analysis, NER, question answering)

C7 Flashcards

(18 cards)