NLP Flashcards
General methods to create Word embeddings?
- One Hot Encoding
- Bag of Words
- N-Grams
- Tf-Idf
- Integer Encoding
OHE : Advantages & Disadvantages
Advantages - Easy & Intuitive
Disadvantages -
1. Sparse vectors
2. Out of Vocabulary words
3. No semantic meaning
4. Fixed Size
BOW : Advantages & Disadvantages
Advantages - Easy & Intuitive
Disadvantages -
1. Sparse vectors
2. Out of Vocabulary words
3. No semantic meaning
N-grams : Advantages & Disadvantages
Advantages - Easy & Intuitive; Captures semantic meaning to an extent
Disadvantages -
1. Out of Vocabulary words
2. Computationally expensive
What is Tf-Idf
Term Frequency-Inverse Document Frequency
TF measures the frequency of a word in a document.
TF(t,d)= Totalnumberoftermsindocumentd/ Numberoftimestermtappearsindocumentd
IDF measures how often a term appears in a document wrt whole corpus.
IDF(t)=log(Numberofdocumentscontainingtermt/
Totalnumberofdocuments)
Why do we use Log in IDF?
This is because for words that has occurred in very few documents, their IDF value will be too high. And, therefore contribution of TF value will be neglected.
Advance methods to create Word embeddings?
- Word2Vec
- GloVe
- fastText
- Transformers
What are Word Embeddings?
Vector representation of words.
Embeddings captures the semantic meaning of the information.
Ex - Embedding of the sentence “today is a sunny day” and “the weather is nice today” will be similar.
What is Word2Vec & GloVe?
Words that appear in similar contexts have similar embeddings.
However, these technique will struggle when a word has multiple meanings, because Glove and Word2Vec have fixed representation of a word.
Word2Vec
By Google
Trained on a very large corpus by training a shallow neural network.
1. Predict the surrounding words given the center word. - Skip-Gram
2. Predict the center word given the surrounding words. - CBOW
GloVe
By Stanford
Trained by looking at the co-occurrence matrix of words (how often words appear together within a certain distance) and then using that matrix to obtain the embeddings.
fastText
By Facebook
Embeddings using Transformers
Transformers learn embeddings in the context of their task.
For example, BERT learns word embeddings in the context of masked language modeling (predicting which word to fill in the blank) and next sentence prediction (whether sentence B follows sentence A).
CLS token
Classification Token - Represents the entire sentence
SEP token
Separator Token - Separates sentences
Word embedding for a word that is split in more than one token
It can be found out by using pooling strategy in which we obtain the embedding of each token and then average them to obtain the word embedding.
What are Sentence Embeddings?
Vector representation of a sentence.
Sentence embeddings are inherently compressions of the information in a sequence of text (words) and compressions are inherently lossy. This implies that sentence embeddings are representations with a lower level of granularity.
Three ways: -
1. CLS Pooling
2. Max Pooling
3. Mean Pooling
CLS Pooling
Embedding corresponding to the [CLS] token.
CLS token capture the idea of the entire sentence
Used when the transformer model has been fine-tuned on a specific downstream task that makes the [CLS] token very useful.
BERT vs RoBERTa
BERT performs NSP in its original training process so a pre-trained BERT produces meaningful CLS representations out-of-the-box.
RoBERTa does not use NSP in its pre-training process nor does it perform any other task that tunes the CLS token representation. Therefore, it does not produce meaningful CLS representations out-of-the-box.
Why do we need sentence embeddings over word embeddings?
Sentence embeddings are trained for tasks that require knowledge of the meaning of a sequence as a whole rather than the individual tokens.
Some concrete examples of such tasks are:
👉 Sentiment analysis
👉 Semantic similarity
👉 NSP (in BERT’s pre-training)
Cosine Similarity
Cosine of the angle between the embeddings.
It compare how similar two vectors are regardless of their magnitude.
If the angle between the embeddings is small, the cosine will be close to 1 and as the angle grows, the cosine of the angle decreases.
Mean Pooling
Averaging all the word embeddings of the sentence
More effective on models that have not been fine-tuned on a downstream task. It ensures that all parts of the sentence are represented equally in the embedding.
Max Pooling
Taking the maximum value of each dimension of the word embeddings.
Useful to capture the most important features in a sentence. This can be very useful if particular keywords are very informative, but it might miss the subtler context.
Is polling strategy better than what to Word2Vec or GloVe?
No, the reason is that the [CLS] token is not trained to be a good sentence embedding. It’s trained to be a good sentence embedding for next-sentence prediction!