NLP Flashcards

1
Q

General methods to create Word embeddings?

A
  1. One Hot Encoding
  2. Bag of Words
  3. N-Grams
  4. Tf-Idf
  5. Integer Encoding
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

OHE : Advantages & Disadvantages

A

Advantages - Easy & Intuitive
Disadvantages -
1. Sparse vectors
2. Out of Vocabulary words
3. No semantic meaning
4. Fixed Size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

BOW : Advantages & Disadvantages

A

Advantages - Easy & Intuitive
Disadvantages -
1. Sparse vectors
2. Out of Vocabulary words
3. No semantic meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

N-grams : Advantages & Disadvantages

A

Advantages - Easy & Intuitive; Captures semantic meaning to an extent
Disadvantages -
1. Out of Vocabulary words
2. Computationally expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Tf-Idf

A

Term Frequency-Inverse Document Frequency

TF measures the frequency of a word in a document.
TF(t,d)= Totalnumberoftermsindocumentd/ Numberoftimestermtappearsindocumentd

IDF measures how often a term appears in a document wrt whole corpus.
IDF(t)=log(Numberofdocumentscontainingtermt/
Totalnumberofdocuments)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why do we use Log in IDF?

A

This is because for words that has occurred in very few documents, their IDF value will be too high. And, therefore contribution of TF value will be neglected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Advance methods to create Word embeddings?

A
  1. Word2Vec
  2. GloVe
  3. fastText
  4. Transformers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Word Embeddings?

A

Vector representation of words.
Embeddings captures the semantic meaning of the information.

Ex - Embedding of the sentence “today is a sunny day” and “the weather is nice today” will be similar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Word2Vec & GloVe?

A

Words that appear in similar contexts have similar embeddings.
However, these technique will struggle when a word has multiple meanings, because Glove and Word2Vec have fixed representation of a word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Word2Vec

A

By Google

Trained on a very large corpus by training a shallow neural network.
1. Predict the surrounding words given the center word. - Skip-Gram
2. Predict the center word given the surrounding words. - CBOW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

GloVe

A

By Stanford

Trained by looking at the co-occurrence matrix of words (how often words appear together within a certain distance) and then using that matrix to obtain the embeddings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

fastText

A

By Facebook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Embeddings using Transformers

A

Transformers learn embeddings in the context of their task.

For example, BERT learns word embeddings in the context of masked language modeling (predicting which word to fill in the blank) and next sentence prediction (whether sentence B follows sentence A).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CLS token

A

Classification Token - Represents the entire sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SEP token

A

Separator Token - Separates sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Word embedding for a word that is split in more than one token

A

It can be found out by using pooling strategy in which we obtain the embedding of each token and then average them to obtain the word embedding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are Sentence Embeddings?

A

Vector representation of a sentence.
Sentence embeddings are inherently compressions of the information in a sequence of text (words) and compressions are inherently lossy. This implies that sentence embeddings are representations with a lower level of granularity.

Three ways: -
1. CLS Pooling
2. Max Pooling
3. Mean Pooling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

CLS Pooling

A

Embedding corresponding to the [CLS] token.
CLS token capture the idea of the entire sentence

Used when the transformer model has been fine-tuned on a specific downstream task that makes the [CLS] token very useful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

BERT vs RoBERTa

A

BERT performs NSP in its original training process so a pre-trained BERT produces meaningful CLS representations out-of-the-box.

RoBERTa does not use NSP in its pre-training process nor does it perform any other task that tunes the CLS token representation. Therefore, it does not produce meaningful CLS representations out-of-the-box.

20
Q

Why do we need sentence embeddings over word embeddings?

A

Sentence embeddings are trained for tasks that require knowledge of the meaning of a sequence as a whole rather than the individual tokens.
Some concrete examples of such tasks are:
👉 Sentiment analysis
👉 Semantic similarity
👉 NSP (in BERT’s pre-training)

21
Q

Cosine Similarity

A

Cosine of the angle between the embeddings.
It compare how similar two vectors are regardless of their magnitude.

If the angle between the embeddings is small, the cosine will be close to 1 and as the angle grows, the cosine of the angle decreases.

22
Q

Mean Pooling

A

Averaging all the word embeddings of the sentence

More effective on models that have not been fine-tuned on a downstream task. It ensures that all parts of the sentence are represented equally in the embedding.

23
Q

Max Pooling

A

Taking the maximum value of each dimension of the word embeddings.

Useful to capture the most important features in a sentence. This can be very useful if particular keywords are very informative, but it might miss the subtler context.

24
Q

Is polling strategy better than what to Word2Vec or GloVe?

A

No, the reason is that the [CLS] token is not trained to be a good sentence embedding. It’s trained to be a good sentence embedding for next-sentence prediction!

25
Q

Sentence Transformer

A

Also called as SBERT.

It uses Mean or Max Pooling.
Other step is to perform normalization which ensures that the embedding vector has a unit length (Magnitude = 1).

26
Q

Why do we need Normalization?

A

This is particularly helpful when we want to compare vectors.

The magnitude of the embeddings is not relevant when computing the cosine similarity, but it is required for other distance measures.

27
Q

Lexical search system

A

Looks at exact matches of the input question

28
Q

Fuzzy search system

A

Looks at approximate matches of the input question

29
Q

Statistical search system

A

Looks at the frequency of words in the input question

30
Q

Embedding dimensions

A

all-MiniLM-L6-v2 - 384
OpenAI - 1536

Open Source Models - 384 to 1024
Closed Source Models - 384 to 4096

Embedding dimension is a trade-off.
Very large embeddings, will potentially give better results, but is also costlier for hosting and inference.
For vector databases, will have to pay more for storage.

31
Q

Formula of Cosine similarity

A

Cosine similarity is defined as the dot product of the vectors divided by the product of their magnitudes.

32
Q

Dot Product

A

When the magnitude of the vector is important.
Dot Product measures how much one vector extends into the direction of another vector. It focusses on both magnitude and angle.

Cosine similarity is a normalized dot product.

33
Q

Euclidean Distance

A

Distance between two vectors by measuring a straight line between them.

34
Q

Temperature Scaled Mixing

A

The temperature parameter allows you to weigh certain examples more than other.

35
Q

Stemming vs Lemmatization

A

Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.

Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

36
Q

POS tagging

A
37
Q

N-gram

A

Sequence of N-words

38
Q

Perplexity

A
  • Perplexity is a measure of the complexity in a sample of texts.
  • Perplexity can be related to entropy, which measures uncertainty.

Perplexity is used to tell us whether a set of sentences looks like they were written by humans rather than by a simple program choosing words at random. A text that is written by humans is more likely to have lower perplexity, whereas a text generated by random word choice would have a higher perplexity.

39
Q

BERT

A

BERT employs an encoder-only architecture. The decision to use an encoder-only architecture in BERT suggests a primary emphasis on understanding input sequences rather than generating output sequences.

Traditional language models process text sequentially, either from left to right or right to left. This method limits the model’s awareness to the immediate context preceding the target word. BERT uses a bi-directional approach considering both the left and right context of words in a sentence, instead of analyzing the text sequentially, BERT looks at all the words in a sentence simultaneously.

40
Q

BERT undergoes two steps:
Pre-training & Fine tuning

Explain Pre-training

A

During pre training, BERT is trained on large amount of unlabeled data to learn contextual embeddings. It has two primary objective:

  1. Masked Language Modelling (MLM)
  2. Next Sentence Prediction

The model aims to minimize the combined loss function of the Masked LM and Next Sentence Prediction

41
Q

BERT undergoes two steps:
Pre-training & Fine tuning

Explain Fine tuning

A

BERT is fine-tuned using labeled data specific to the downstream tasks of interest like sentiment analysis, question-answering, named entity recognition.

42
Q

Masked Language Model (MLM)

A

In BERT’s pre-training process, a portion of words (around 15%) in each input sequence is masked (with a special symbol like [Mask]) and the model is trained to predict the original values of these masked words based on the context provided by the surrounding words.

43
Q

Next Sentence Prediction (NSP)

A

In the pre-training process, BERT learns to understand the relationship between pairs of sentences, predicting if the second sentence follows the first in the original document.
50% of the input pairs have the second sentence as the subsequent sentence in the original document, and the other 50% have a randomly chosen sentence.

44
Q

Why to train MLM and NSP together?

A

MLM helps BERT to understand the context within a sentence and NSP helps BERT grasp the connection or relationship between pairs of sentences.
Hence, training both the strategies together ensures that BERT learns a broad and comprehensive understanding of language, capturing both details within sentences and the flow between sentences.

45
Q

BERT architecture

A

The architecture of BERT is a multilayer bidirectional transformer encoder. A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side.

  1. BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack. These are more than the Transformer architecture described in the original paper (6 encoder layers).
  2. BERT architectures (BASE and LARGE) also have larger feedforward networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the original Transformer architecture (512 hidden units and 8 attention heads).
  3. BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.
46
Q
A