Introduction to Transformers for NLP Flashcards

1
Q

RNN

A

Recurrant Neural Network

In an RNN, the information is repeated endlessly within a loop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bag of Words

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

n-grams

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

trigram

A

A trigram model keeps the context of the last two words to predict the next word in the sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

LSTM

A

Long Short-Term Memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

GRU

A

Gated Reccurrent Unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Feed-forward Neural Network

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

BP Mechanism

A

Back Propogation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Gradient Decent

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

T5 model

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

seq2seq

A

sequence-to-sequence neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

multi-head attention

A

multiple modules of self-attention capturing different kinds of attentions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

feed-forward

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

masked multi-head attention

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

linear

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

softmax

A

The softmax function is a mathematical function that converts a vector of real numbers into a probability distribution, where each value is between 0 and 1 and all values sum to 1.

17
Q

input embeddings

18
Q

output embeddings

19
Q

tokenize

20
Q

vectorize

21
Q

positional encoding

22
Q

self-attention

A

self-attention allows us to associate each word in the input with other words in the same sentence

23
Q

query vector

24
Q

key vector

25
Q

value vector

26
Q

embedding vector

27
Q

residual connection

A

The original positional input embedding is then given the multi-headed attention output vector as an additional component

28
Q

decoder

A
  1. Multi-headedattention layer
  2. Add and norm layers
  3. Feed-forward layer
29
Q

encoder

30
Q

BERT

A

Bidirectional Encoder Representations from Transformers

31
Q

BERT-Base

A

BERT-Base has a total of 110 million parameters, 12 attention heads, 768 hidden nodes, and 12 layers.

32
Q

BERT-Large

A

BERT-Large is characterized by having 24 layers, 1024 hidden nodes, 16 attention heads, and 340 million parameter values.

33
Q

Masked-LM

34
Q

NSP

A

Next Sentence Prediction

35
Q

CLS

A

Always consider the first token in a sequence to be a special classification token (also abbreviated CLS)

36
Q

SEP

A

The [SEP] token serves to demarcate the break between the two sentences.