IR – Word Representation, Transformer/BERT Flashcards
Lection 4
Why do we have to map words into some numerical vectors for the neural techniques?
Because the neural networks operate in continuous spaces, and without continued values there is no training/learning (no gradient descent)
What is One-Hot Encoding?
It is an encoding strategy where vectors have a size of a vocabulary. Each word in a vocabulary is mapped to a vector that has exactly one “1” and the rest is “0”.
It is a bad encoding strategy because it is very sparce: a lot of wasted memory
What are word embeddings?
It is a way to represent words as a dense vector of continuous values. The dimension of vectors are abstract (100-300 dimensions), and they allow for some math operations: semantically similar words are closer in the high-dimensional space.
How are the word embeddings created?
Using some unsupervised methods, such as Word2Vec, Glove, some Transformer-based methods (BERT) etc.
It is unsupervised because we don’t give explicit labels, but we have the real text how people use it, and model’s job is to predict the next word in a sequence.
What are some problems of Word2Vec algorithm?
- Word ordering: the mapping of one-hot encoding to some dense representation (what Word2Vec does) still have the same problems like the one-hot encoding (excluding the sparsity problem):
- Ordering of words matter: “it was not good, it was actually quite bad” vs “it was not bad, it was actually quite good”
- Different meanings of the same words (vector doesn’t change based on the context)
- What happens if there is a out-of-vocabulary (OOV) word?
Explain Word2Vec algorithm
We have a neural network with 1 hidden later to predict words. The input word is encoded using one-hot encoding. The output layer should predict the next word in a sequence, but we don’t care about that in this algorithm, we care about the hidden layer. By training this NN, the hidden layer should train how to make an abstraction of an input word using less neurons (values). So instead of one-hot encoding vector (size of the vocabulary), the hidden layer will have 100-300 neurons (values) which NN will learn how to ‘map’ one hot encoding to some lower-dimension representation.
What is query expansion?
It is a way to expand the search query with similar words. This can be done with word embeddings. We can find the similar words (vectors close to some other vector) and append them to the query.
sample -> example, sampling
query -> inquire, question
How can CNNs be used with word embeddings?
Word embeddings of N-grams is not feasable (sparcity problem: many values would be zeroes, or large vectors, and not enough of training data usually: no connection between ‘quite good’ and ‘very good’)
CNN can be used to do this. If we want to create 2-gram embeddings, we can use CNNs. In this case, we would have a filter of size 2 and would do a sliding window accross the input embedding vectors. This filter is learned during the training. Output is a sequence of N-gram representations.
At the end, there is the same number of vectors, but possibly smaller dimension vectors that represent N-Gram embeddings. Goal is to get some surrounding context, get some signal from surrounding words so that the embeddings are more accurate.
At the end, use padding of zeroes. Filters trained N-to-N, learned during the training. In code, need to transpose the matrix in the end.
Explain the Byte-Pair Encoding
Initialization:
- Define a vocabulary: set of all possible individual characters
Algorithm:
- Choose two symbols that appear as a pair most frequently
- Add the new merged symbol
- Replace all occurences of two symbols with the new symbol
Repeat until k amount of merges are done (predefined)
Main pros:
- Handles OOV text (can just use basic chars)
- Copression efficiency (common sequences are one token)
- Encoding can optimize different types of text ????
How can RNNs be used with word embeddings?
The problem of CNNs are that you have to define size of the N-gram (how many words you want to combine). This is quite limiting because words from way back can influence future words. RNNs can help us solve this. They take arbitrary large number of word embeddings (words) and sequentially, they are embedded together. Ideally, at the end, the output embedding should represent an embedding of the full input sequence.
Problem: can’t be parallelized and after some amount of input sequences, RNN loses the accuracy
Si = RNN(Si-1, Xi) = current token plus the previous embedding
= g(S-1 * Ws + Xi * Wx + b)
g is some activation function
Ws and Wx are matrices learned during training and are the same in all iterations
b is some bias, trainable parameter
Explain the Encoder - Decoder architecture
It is a architecture that supports sequences of input and output tokens.
Encoder:
- Can be an RNN that takes a sequence of input characters and tries to output an embedding of the whole sequence. RNN works sequentially, which means parallelism doesn’t work (with RNNs, not for transformers).
Decoder:
- Takes the encoder-s output (sentence-embedding) and tries to generate a sequence of tokens based on that. The way it works is an RNN that takes context (encoder’s output) and the last output, and tries to find the next most probable token. It is done until the <EOS> token is produced (end of sequence).</EOS>
What is softmax function used for?
When we want, based on some arbitrary distribution, to get a probability distribution where the sum of probabilities is 1. It is used for example in decoders where each token in the vocabulary gets some number. And the higher the number, the more probable it is to select that word as a next token. To get this probability distribution, we use a softmax function.
What is an attention mechanism?
It is a way to find relevant parts of the input.
When having encoder-decoder architecture, instead of having one encoder embedding as context, we want to find relevant info from the input sequence for that particular decoder-step. For that, we save every state of encoder (assuming RNN) and attention mechanism takes all of them, together with the preciously-generated output embedding, runs them through some Neural Networks (FC) and after a softmax function, we get a prob distribution that tells us what encoder-steps are the most important for that decoder-step. Then we take a WEIGHTED SUM and create our context for the decoder.
What is a self-attention?
It is a way to contextualize embeddings. It is assumed that every word (token) in a sequence influences (attends to) every other token. This is represented as a matrix (O(n2) = very intensive computation). This approach changes the meaning of words based on their surroundings (changed the word embedding of the word)
What is the contextualization?
It is changing the embedding (meaning) of a token/word based on its surroundings
What is BERT?
Bidirectional Encoder Representations from Transformers (BERT) is a way of encoding a word based on its surroundings (left and right of the word).
It uses the WordPiece tokenization technique (similar to BPE). It is a large ML model (base has 12 laters and 768 dimensions). It uses special tokens which are learned during training:
- CLS: Classification token used to represent an entire sequence as one vector/embedding.
- MASK: used to mask a word to predict
- SEP: Separator token to indicate the next sentence
Explain the training of BERT
If we have some text, we can mask any word in the sequence and ask the model to predict this word based on its surroundings. Of course, we know what the actual word is, so we can update the model’s weights based on the loss (difference between precitions and actual value). This can be done in parallel and we can utilize GPU power.
How does the input for BERT look like?
If we have a sentence, the CLS token is added to its beginning, and SEP tokens are added between sentences. Then, the tokens are embedded into vectors and adds trained position embeddings and sequence embeddings to it.
What is the model of BERT?
BERT model are just stacked Transformer models. Every transformer gets as input, the output of the previous transformer.
Tokens are learned during the training process, we train the model to learn this.
IMPORTANT: Splate from exercises
Find it, idk what that is