Session 2 Flashcards
1
Q
Named entity recognition
A
find spans of text that constitute proper names and tag the type of the entity. Four entity tags are most common: PER (person), LOC (location), ORG (organization), or GPE (geopolitical entity)
2
Q
POS tagging
A
- parts of speech tagging
- is the process of assigning a part-of-speech label to each of a sequence of words = taking a sequence of words and assigning each word a part of speech like NOUN or VERB
3
Q
RNN
A
- any NN that contains a cycle within its network connections, meaning that the value of some unit is dependent on its own earlier outputs as an input = predict next output dependent on the previous output (like having a memory of what has been so far)
- All RNNs have same weights (U, V & W)
- Input & number of RNNs can be arbitrary lengths
4
Q
RNN Relevance
A
- More efficient than transformer-based models
- Trainable from scratch (on own machine)
5
Q
RNN pro
A
- Arbitrary input length can be used
- Can be combined with other NNs
- Can handle a variety of task-types e.g. classification, generation
- Performance
- No limited context (e.g. FFN: outside context window of e.g. 3 words is not considered) & learns meaning of words in phrases combined together <-> FNN
6
Q
RNN limit
A
- Uni-directional (later info can be informative e.g. “the old man the boat” -> need the boat to know that the old people do smth versus the old man)
- Arbitrary long context in practice hard: vanishing gradient (multiplying many things -> 0)
- Computationally inconvenient: depending on output from previous input -> can’t calculate simultaneously
7
Q
LSTM
A
- Long Short-Term Memory Network
- Ct-1 = context vector = what previous block wanted us to remember (important part of whole information)
- 3 gates (each one is “masking” the input & decides what to take from previous):
- each gate consists of a feed- forward layer, followed by a sigmoid activation function, followed by a pointwise multiplication with the layer being gate
8
Q
LSTM gate
A
- Forget gate = delete info from context that is not longer needed
- Add gate = select information that is added to current context
- Output gate = select info that is required for current hidden state; output which can be used for label etc.
9
Q
LSTM advantage over RNN
A
- RNN: despite having access to the entire preceding sequence, the information encoded in hidden states tends to be fairly local, more relevant to the most recent parts of the input sequence and recent decisions because hidden layer, are being asked to perform two tasks simultaneously: provide info useful for the current, and updating & carrying forward info for future decisions
- RNN: vanishing gradient problem
10
Q
BiLSTM
A
- bidirectional LSTM
- consists of two LSTMs that do not share weights
- Limit: BOW assumption (word must exist in training)
- Solution: add unknown word token or
2, train model on Character level (can do both character & word level together
11
Q
Multi-task learning
A
- solving multiple tasks while sharing common patterns
12
Q
Multi-task learning. how?
A
- pre-training on raw text
- fine-tuning on annotated data
13
Q
Sequential multi-task learning
A
- Train on a task & save parameters
- Use (part of) saved parameters to initialize new models & train for new task
14
Q
Multi-task learning - why?
A
- Efficiency: 1 model for multiple tasks, no separate training (pre-training is very expensive, needs lots of time)
- Performance: many tasks related to each other, beneficial to share info between tasks
- Needs less annotated data which is expensive