NLP Flashcards

Question

Sentence Transformer

Answer 1

Also called as SBERT. It uses Mean or Max Pooling. Other step is to perform normalization which ensures that the embedding vector has a unit length (Magnitude = 1).

Answer 2

This is particularly helpful when we want to compare vectors. The magnitude of the embeddings is not relevant when computing the cosine similarity, but it is required for other distance measures.

Answer 3

Looks at exact matches of the input question

Answer 4

Looks at approximate matches of the input question

Answer 5

Looks at the frequency of words in the input question

Answer 6

all-MiniLM-L6-v2 - 384 OpenAI - 1536 Open Source Models - 384 to 1024 Closed Source Models - 384 to 4096 Embedding dimension is a trade-off. Very large embeddings, will potentially give better results, but is also costlier for hosting and inference. For vector databases, will have to pay more for storage.

Answer 7

Cosine similarity is defined as the dot product of the vectors divided by the product of their magnitudes.

Answer 8

When the magnitude of the vector is important. Dot Product measures how much one vector extends into the direction of another vector. It focusses on both magnitude and angle. Cosine similarity is a normalized dot product.

Answer 9

Distance between two vectors by measuring a straight line between them.

Answer 10

The temperature parameter allows you to weigh certain examples more than other.

Answer 11

Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

Answer 12

Sequence of N-words

Answer 13

* Perplexity is a measure of the complexity in a sample of texts. * Perplexity can be related to entropy, which measures uncertainty. Perplexity is used to tell us whether a set of sentences looks like they were written by humans rather than by a simple program choosing words at random. A text that is written by humans is more likely to have lower perplexity, whereas a text generated by random word choice would have a higher perplexity.

Answer 14

BERT employs an encoder-only architecture. The decision to use an encoder-only architecture in BERT suggests a primary emphasis on understanding input sequences rather than generating output sequences. Traditional language models process text sequentially, either from left to right or right to left. This method limits the model’s awareness to the immediate context preceding the target word. BERT uses a bi-directional approach considering both the left and right context of words in a sentence, instead of analyzing the text sequentially, BERT looks at all the words in a sentence simultaneously.

Answer 15

During pre training, BERT is trained on large amount of unlabeled data to learn contextual embeddings. It has two primary objective: 1. Masked Language Modelling (MLM) 2. Next Sentence Prediction The model aims to minimize the combined loss function of the Masked LM and Next Sentence Prediction

Answer 16

BERT is fine-tuned using labeled data specific to the downstream tasks of interest like sentiment analysis, question-answering, named entity recognition.

Answer 17

In BERT’s pre-training process, a portion of words (around 15%) in each input sequence is masked (with a special symbol like [Mask]) and the model is trained to predict the original values of these masked words based on the context provided by the surrounding words.

Answer 18

In the pre-training process, BERT learns to understand the relationship between pairs of sentences, predicting if the second sentence follows the first in the original document. 50% of the input pairs have the second sentence as the subsequent sentence in the original document, and the other 50% have a randomly chosen sentence.

Answer 19

MLM helps BERT to understand the context within a sentence and NSP helps BERT grasp the connection or relationship between pairs of sentences. Hence, training both the strategies together ensures that BERT learns a broad and comprehensive understanding of language, capturing both details within sentences and the flow between sentences.

Answer 20

The architecture of BERT is a multilayer bidirectional transformer encoder. A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side. 1. BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack. These are more than the Transformer architecture described in the original paper (6 encoder layers). 2. BERT architectures (BASE and LARGE) also have larger feedforward networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the original Transformer architecture (512 hidden units and 8 attention heads). 3. BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.

NLP Flashcards

(46 cards)