NLP and Deep Learning Flashcards

1
Q

What is Syntactic Analysis?

A

Syntactic Analysis (Syntax analysis, Parsing): text analysis that tells us the logical meaning behind the sentence or part of the sentence.

It focuses on the relationship between words and the grammatical structure of sentences. You can also say that it is the processing of analyzing the natural language by using grammatical rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How would you reduce the inference time of a trained transformer model?

A

To improve inference time, we can use:

  • GPU, TPU, or FPGA for acceleration.
  • GPU with fp16 support
  • Pruning to reduce parameters
  • Knowledge distillation
  • Hierarchical softmax or adaptive softmax
  • Cache predictions
  • Parallel/batch computing
  • Reduce the model size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is FP16 support (when talking about GPUs)

A

Half precision (also known as FP16) data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network.

FP32 and FP16 mean 32-bit floating point and 16-bit floating point. GPUs originally focused on FP32.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Reinforcement Learning?

A

Reinforcement learning uses trial and error to reach goals. It is a goal-oriented algorithm and it learns from the environment by taking correct steps to maximize the cumulative reward.

In typical reinforcement learning:

  1. At the start, the agent receives state zero from the environment
  2. Based on the state, the agent will take an action
  3. The state has changed, and the agent is at a new place in the environment.
  4. The agent receives the reward if it has made the correct move.
  5. The process will repeat until the agent has learned the best possible path to reach the goal by maximizing the cumulative rewards.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the activation function in Deep Learning?

A

The activation function is a non-linear transformation in neural networks. We pass the input through the activation function before passing it to the next layer.

The net input value can be anything between -inf to +inf, and the neuron doesn’t know how to bound the values, thus unable to decide the firing pattern. The activation function decides whether a neuron should be activated or not to bound the net input values.

Most common types of Activation Functions:

Step Function
Sigmoid Function
ReLU
Leaky ReLU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Naive Bayes algorithm, When we can use this algorithm in NLP?

A

Naive Bayes algorithm: supervised learning algorithm, based on Bayes theorem and used for solving classification problems.

It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

Compared to other discriminative models like logistic regression, Naive Bayes model it takes less time to train, converges faster and requires less training data.

Assumes that each feature makes an:
- independent
- equal
contribution to the outcome.

This algorithm is perfect for use while working with multiple classes and text classification where the data is dynamic and changes frequently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Bayes Theorum?

A

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred.

Stated mathematically by:

P(A|B) = {P(B|A) P(A)} / {P(B)}

where A and B are events and P(B) ≠ 0.

We are trying to find probability of event A, given that event B is true.

Event B is also termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an unknown instance(here, it is event B).

P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Bag of Words?

A

A commonly used model that depends on word frequencies or occurrences to train a classifier.

This model creates an occurrence matrix for documents or sentences irrespective of its grammatical structure or word order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Pragmatic Ambiguity in NLP?

A

Pragmatic ambiguity refers to those words which have more than one meaning and their use in any sentence can depend entirely on the context.

e.g. ‘the river bank’ and ‘withdraw from the bank’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you compute the distance between two-word vectors in NLP?

A

Euclidean distance: the length of the shortest path connecting two points

Cosine Similarity: cosine angle between the vector of two words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you reduce the dimensions of NLP data?

A

Keyword Normalization
Latent Semantic Indexing
Latent Dirichlet Allocation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is TF-IDF?

A

TF (term frequency) and IDF (inverse-document-frequency)

K = #_occurrences
T = #_terms

formula for TF is K/T
formula for IDF is log(total docs / no of docs containing “data”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Attention?

A

Like gravity! But based on similarity of pre-conceived word embeddings

When words have different meanings in different contexts (pragmatic ambiguity), context specific embeddings are created based on cosine similarity between embeddings

e.g.
We have an original embedding for bank
When ‘bank’ is seen in the context of ‘river’ a new embedding is created for ‘bank’ which is weighted towards ‘river’

How?
1. If cosine similarity is 0.11
2. To calculate the ratios to make the new embedding, we take:
1 + 0.11 = 1.11 (cosine for ‘bank + bank’ plus cosine for ‘river + bank”)
3. Normalise cosines to calculate percentages for new embedding:
Bank = 1 / 1.11 = 0.9
River = 0.11 / 1.11 = 0.1
4. Create new embedding using 90% bank original embedding + 10% river embedding

In translation tasks, attention is needed to assist translation of each word; e.g. the chair -> we need to know if the noun is masculine/feminine to accurately translate ‘the’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Transformer model?

A

Most popular model architecture for NLP

Formed of encoder and decoder sections

Can have different model types:

Encoder only - BERT etc
Decoder only - Transformer XL
Seq-to-seq - GPT3, BART,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the architectural features of a transformer model?

A

Encoder and/or decoder parts

Encoder:
1. Tokenisation
2. Embedding generation
3. Positional encoding

Decoder:
1. Series of transformer blocks, each containing a self-attention step and a feed-forward step
2. Takes encoder embeddings and previously generated decoder output as inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is ‘positional encoding’ in transformer models?

A

In the encoder, positional encoding:

  1. Takes word embedding
  2. Adds pre-defined sequential vector to original embedding

This:
- Preserves order of words
- Ensures sentences with different word orders have different embeddings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does ‘multi-headed attention’ mean?

A

In a transformer model, the decoder is formed from a series of transformer blocks which modify word embeddings.

Each block contains a self-attention step and a feed-forward mechanism. Because there is attention at each stage this is known as multi-headed attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a feed-forward mechanism?

A

In a transformer model, the decoder is formed from a series of transformer blocks which modify word embeddings.

Feed-forward mechanism is a neural network in each of these blocks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is SoftMax?

A

Softmax is a post-processing step that takes place in a transformer model after the decoder (on the logits).

It applies a lower and upper bound so that they’re understandable.

The total sum of the output is then 1, resulting in a possible probabilistic interpretation.

Softmax converts transformer scores into probabilities and, for text generation tasks, selects high probability words (but not the same one each time). This helps the model to not give the same answer every time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is semantic search?

A

Using word/sentence MEANING to conduct a search.

Uses K-nearest neighbours or Annoy (approximate nearest neighbours)

Opposite to similarity (e.g. cosine) because in semantic search high numbers = closer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the problem with K-Nearest Neighbours and what are 4 solutions for this?

A

K-Nearest Neighbours = exponential compute with data size as need to calculate distance for each data point

Solutions:
1. Inverted file index (cluster then search for nearest neighbours) e.g. Annoy
2. Hierarchical Navigable Small Worlds (HNSW) (start with a few points, search here then add more points using previous knowledge and iterate
3. Re-ranking (separate model to rank and select best answer from all options)
4. Positive and negative pairs (update sentence embeddings to make correct answers closer and incorrect further from question embedding)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the five evaluation metrics for binary classifiers?

A
  • Accuracy
  • Precision
  • Recall
  • Specificity
  • F1 score
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you calculate Accuracy and why is it often not the best metric?

A

Number of correct predictions
/
Total number of predictions

Accuracy can be misleading when data is imbalanced; if True Negatives are dominant, the model will have high accuracy just by predicting the majority class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do you calculate precision?

A

TP
/
TP + FP

Precision is the proportion of TRUE PREDICTIONS that were correct - of all the true predictions, how many were true data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do you calculate recall?

A

TP
/
TP + FN

Recall is the proportion of TRUE POSITIVE DATA that were captured - of all the true data how many did the model correctly predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do you calculate specificty?

A

TN
/
TN + FP

Specificity is the proportion of TRUE NEGATIVE DATA that were captured (same as recall but for negatives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How do you calculate F1 score and when should this be used?

A

2 * Precision * Recall
/
Precision + Recall

There is a balance that needs to be achieved between precision and recall which depends on business use case. F1 should be used when both precision and recall are equally important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How do you calculate precision and recall for a multiclass classifier?

A

Precision and recall are calculated separately for each class.

-True Positive are the correct prediction
-False Negatives are when the true class were predicted as any other category
-False positives are when any other category was predicted as the true class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the Macro-Average

A

In a multiclass classifier, if there are too many classes to calculate precision/recall etc for each one, the average of these metrics can be taken to evaluate model performance

e.g.
macro-precision
macro-recall etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is UMAP?

A

Uniform Manifold Approximation and Projection for Dimension Reduction

This is a method of dimensionality reduction which can be used to reduce word embeddings down to 2 dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is PCA?

A

Principle Component Analysis

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space

Can project down to any number of dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the four main uses of word/document embeddings?

A
  1. Semantic search - recommender systems, Q&A
  2. Clustering - grouping documents
  3. Classification - spam/no spam
  4. Topic modelling (similar to clustering)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are some approaches to Topic Modelling?

A
  1. K-Means clustering
  2. BerTopic (uses HDBScan - clustering without pre-defined clusters)
  3. LDA (Latent Dirichlet Allocation) - dimensionality reduction
  4. Gaussian mixture models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is model fine-tuning?

A

Adapting foundational language models to your own data

e.g. for a classification task, embeddings will be modified to more distinctly separate the classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is Generative AI?

A

AI that creates or generates new data

e.g. GPT-3, DALL-E etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is a tensor?

A

Essentially like a numpy array but with multiple dimensions. Tensors are needed for deep learning models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the syntax to pre-process text into Tensors ready for a model with Hugging Face AutoTokeniser?

A

from transformers import AutoTokenizer

checkpoint = “distilbert-base-uncased-finetuned-sst-2-english”

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
“list_of_text_strings”,
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors=”pt”)

here we are preparing tensors for PyTorch implementation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How do you instatiate a pre-trained model in Hugging Face?

A

from transformers import AutoModel

checkpoint = “distilbert-base-uncased-finetuned-sst-2-english”
model = AutoModel.from_pretrained(checkpoint)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the model head in a transformer model?

A

An additional component, usually made up of one or a few layers, to convert the transformer predictions to a task-specific output

e.g. language modelling heads, question answering heads, sequence classification heads…

40
Q

What is zero shot classification?

A

Zero-shot-classification allows you to specify which labels to use for a classification, so you don’t have to rely on the labels of the pretrained model

This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise.

41
Q

What are common tasks for Transformer models in NLP?

A

feature-extraction (get the vector representation of a text)
fill-mask
ner (named entity recognition)
question-answering
sentiment-analysis
summarization
text-generation
translation
zero-shot-classification

42
Q

What is self-supervised learning?

A

Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model.

That means that humans are not needed to label the data!

43
Q

What is transfer learning?

A

A general pretrained model is fully or partially frozen, additional layers are added and trained on a given task via supervised learning, e.g. fill-mask, text-generation etc

Different to standard fine-tuning because this is training for a TASK rather than training for your data

Can be followed by full model fine-tuning

44
Q

How can you evaluate model training CO2 emissions?

A

ML CO2 Impact or Code Carbon which is integrated in 🤗 Transformers

45
Q

What is a model architecture?

A

This is the skeleton of the model — the definition of each layer and each operation that happens within the model.

There are multiple model architectures available in Hugging Face, including:

*Model (retrieve the embeddings)
*ForCausalLM
*ForMaskedLM
*ForMultipleChoice
*ForQuestionAnswering
*ForSequenceClassification
*ForTokenClassification
and others 🤗

46
Q

What are model checkpoints?

A

These are the weights that will be loaded in a given architecture.

47
Q

What is post-processing with transformer models?

A

All 🤗 Transformers models output the logits: the raw, unnormalized scores outputted by the last layer of the model.

To be converted to probabilities, they need to go through a SoftMax layer and then be assigned a class:

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

model.config.id2label
{0: ‘NEGATIVE’, 1: ‘POSITIVE’}

if the output tensor is:
tensor([[4.0195e-02, 9.5980e-01],
[9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)</SoftmaxBackward>

We can see that the model predicted:
First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

48
Q

What are the pre-processing steps needed for transformer models?

A

A tokeniser is used for this process. HuggingFace AutoTokenizer handles all of this for you (except step 4!)

  1. Convert input sentences to tokens
  2. Convert tokens to vocabulary indices (aka input IDs) - a list of lists containing ints
  3. Convert input IDs into tensors specific to the deeplearning framework in use, e.g. for pytorch call:
    torch.tensor(input_ids)
  4. Tensors can be fed to model:
    output = model(model_inputs)

There are multiple rules that can govern step 1, which is why we need to instantiate the tokenizer using the same model we want to run, to make sure we use the same rules that were used when the model was pretrained.

Tokenisers can also decode input IDs back to strings

49
Q

What are the three different types of tokenisation?

A
  1. Word-based
  2. Character based
  3. Subword based - frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords - e.g. WordPiece, BPE, Unigram
50
Q

What is tensor padding?

A

Padding is used to make tensors have a rectangular shape.

Padding makes sure all sentences have the same length by adding a special word called the padding token to the sentences with fewer values.

For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words.

Padding needs to be ignored by models - this is done with Attention Masks

51
Q

What is an Attention Mask?

A

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s:

  • 1s indicate the corresponding tokens should be attended to
  • 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).
52
Q

What can you do if you have very long sequences of text that need to be processed by an transformer?

A
  1. Use a model with a longer supported sequence length, such as Longformer or LED
  2. Truncate your sequences, by specifying the max_sequence_length parameter:

sequence = sequence[:max_sequence_length]

53
Q

What are special tokens in Transformers?

A

The HF tokenizer may add special words such as [CLS] and [SEP] at the beginning/end of a sequence.

This occurs when the model is pretrained with those, so to get the same results for inference we need to add them as well.

Some models don’t add special words, or add different ones; models may also add special words only at the beginning, or only at the end.

The HF tokenizer knows which ones are expected and will deal with this for you.

54
Q

How many dimensions does the tensor output by the base Transformer model have, and what are they?

A

3: The sequence length, the batch size, and the hidden size

55
Q

What is an AutoModel in HuggingFace?

A

An object that returns the correct architecture based on the model checkpoint provided

This is a general HF object that can be used with multiple different transformer architectures

56
Q

What are the attributes of a PyTorch tensor?

A

Tensor attributes describe their shape, datatype, and the device on which they are stored.

  • tensor.shape
  • tensor.dtype
  • tensor.device
57
Q

Where are PyTorch tensors created?

A

By default, tensors are created on the CPU.

We need to explicitly move tensors to the GPU using .to method (after checking for GPU availability).

Keep in mind that copying large tensors across devices can be expensive in terms of time and memory!

if torch.cuda.is_available():
tensor = tensor.to(“cuda”)

58
Q

What does ToTensor() do in PyTorch?

A

ToTensor converts a PIL image or NumPy ndarray into a FloatTensor. and scales the image’s pixel intensity values in the range [0., 1.]

59
Q

In PyTorch, how do Modules work together to form neural network architectures?

A

Neural networks comprise of layers/modules that perform operations on data.

The torch.nn namespace provides all the building blocks you need to build your own neural network.

Every module in PyTorch subclasses the nn.Module. A neural network is a module itself that consists of other modules (layers). This nested structure allows for building and managing complex architectures easily.

60
Q

In PyTorch, what does nn.Flatten do?

A

Convert each 2D array into a contiguous array

E.g. a 28x28 image will be converted to a contiguous array of 784 pixel values.

A batch of three images:
torch.Size([3, 28, 28])

will be transformed into 3 long rows:
torch.Size([3, 784])

61
Q

What is back-propagation?

A

Backpropagation is a process involved in training a neural network.

Parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

(Taking the error rate of a forward propagation and feeding this loss backward through the neural network layers to fine-tune the weights.)

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

62
Q

What is PyTorch?

A

A deep-learning framework for Python, making it easy to develop and deploy deep-learning models, and to leverage GPUs

63
Q

What happens when you train a Neural Network?

A

Neural networks (NNs) are a collection of nested functions that are executed on some input data. These functions are defined by parameters (consisting of weights and biases), which in PyTorch are stored in tensors.

In a neural network, weights are updated as follows:

Step 1: Take a batch of training data.
Step 2: Perform forward propagation (best guess about the correct output) to obtain the corresponding loss.
Step 3: Backpropagate the loss to get the gradients (traverse backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients))
Step 4: Use the gradients to update the weights of the network.

64
Q

In deep learning, what is Learning Rate?

A

The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.

Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.

Adam is a popular method that adapts the learning rate dynamically during training.

65
Q

What is the stochastic gradient descent algorithm?

A

Stochastic gradient descent (SGD) is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using the back-propagation of errors algorithm, referred to as simply backpropagation.

The amount that the weights are updated during training is referred to as the step size or the “learning rate.”

66
Q

What is momentum in deep learning?

A

Momentum is applied with the stochastic gradient descent algorithm.

It calculates exponentially weighed averages on training data, which can provide a better estimate which is closer to the actual derivate than our noisy calculations.

SGC estimates the derivate of our loss function on a small batch, rather than calculating the exact value. This means we’re not always going in the optimal direction, because our derivatives are ‘noisy’ - momentum helps with this.

67
Q

What are the two types of spelling error?

A
  1. Non-word errors
    e.g graffe -> giraffe
    Context insensitive
  2. Real word errors
    e.g. too -> two
    Context sensitive
68
Q

How can you detect and correct non-word spelling errors?

A

Detect: Any word not in a dictionary

Correct: Generate candidates: real words that are similar to error (similar pronunciation or similar spelling)
Choose the one which is best:
- Shortest weighted edit distance
- Highest noisy channel probability
- Noisy channel model for correction

69
Q

What is Levenshtein distance?

A

The number of character edits from one word to another

e.g. there -> three = 2

70
Q

What are the four main types of recommender system?

A
  1. Popularity based - recommends items that are currently trending
  2. Content based filtering - similarity score to item of interest based on pre-defined item features
  3. Collaborative filtering
    a. User based - recommend items to a user that similar users have liked
    b. Item based - recommend items to a user that that user has previously liked
  4. Knowledge-based systems
71
Q

What are three approaches to evaluate recommender systems?

A
  1. User studies - users presented recommendations from different algos and choose which is best
  2. Online evaluations (A/B tests)
  3. Offline evaluations - based on historic data (how user previously rated movie)
72
Q

What are four commonly used distance metrics in ML?

A
  1. Euclidean (distance between vectors)
  2. Manhattan (right angled distance between vectors (triangle))
  3. Cosine (degree of angle between vectors)
  4. Chebyshev
73
Q

Which algorithms are commonly used for search?

A
  1. K Nearest Neighbours - Euclidean distance to extract K closest data points to query BUT have to calculate distance for every data point
  2. Approximate Nearest Neighbours - make use of techniques like indexing, clustering, hashing, and quantization to significantly improve computation and storage at the cost of some loss in accuracy
74
Q

In information retrieval, what are the three algorithmic approaches for Approximate Nearest Neighbour search?

A
  1. Tree based (e.g. Annoy - recursively split vector space into further sub-spaces at each tree node, until contain x data points)
  2. Locality-sensitive hashing (LSH) (e.g. FAISS - bucket similar samples through a hash function, usually shingling, MinHashing, and banding)
  3. Quantization (split dataset vectors into smaller vectors, use k-means on each data subset, use centroids for computing distance from query vector, sort data in that cluster)
75
Q

What are the steps involved in transfer learning?

A
  1. Take layers from a previously trained model.
  2. Freeze them to avoid altering any of the information they contain during future training rounds.
  3. Disregard pretrained model head
  4. Add some new, trainable layers on top of the frozen layers. They will learn to turn old features into predictions on a new dataset.
    e.g. replace with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.
  5. Train the new layers on your dataset.
    a. Decide hyperparameters and ranges to train with
    b. Define evaluation metrics
    c. Create Trainer object with model, eval metrics, hyperparameter args, then train
76
Q

What is few-shot learning?

A

Few-shot learning, also known as low-shot learning, uses a small set of examples from new data to learn a new task.

Few shot learning is commonly used by OpenAI as GPT3 is a few-shot learner.

77
Q

What is fine tuning?

A

A pretrained model (either all layers or a subset of layers) are trained on a dataset specific to the task (e.g. legal documents)

This can follow transfer learning where task-specific layers are added to the model

78
Q

What is Dropout in neural network model training?

A

Dropout is a technique meant to prevent overfitting the training data by dropping out units in a neural network.

In practice, neurons are either dropped with probability
p or kept with probability 1−p

79
Q

In information retrieval, what are the key differences between TF-IDF and BM25?

A

These are both popular ranking algorithms, aiming to evaluate the relevance of documents to a query

TF-IDF: rewards term frequency and penalizes document frequency. Assumes all query terms are independent

BM25 uses a probabilistic approach to account for document length and term frequency saturation. Calculates probability based on term and document frequency, and document length. Does not assume independent query terms - allows for cooccurrence

80
Q

Do gradient descent methods always converge to similar points?

A

They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

81
Q

What is Active Learning?

A

Active learning is an approach where the model actively selects the most informative examples from unlabeled data for an expert to label.

This method allows us to efficiently utilize the expert’s time and knowledge while maximizing the impact of their input on the model’s performance.

82
Q

What are the steps taken when a transformer model is run?

A
  1. Each word forming an input sequence is transformed into a
    -dimensional embedding vector.
  2. Each embedding vector representing an input word is augmented by summing it (element-wise) to a positional encoding vector of the same length, hence introducing positional information into the input.
  3. The augmented embedding vectors are fed into the encoder block consisting of multi-head self-attention mechanism and a feed-forward block. Since the encoder attends to all words in the input sequence, irrespective if they precede or succeed the word under consideration, then the Transformer encoder is bidirectional.
  4. The decoder receives as input its own predicted output word at time-step. The input to the decoder is also augmented by positional encoding in the same manner done on the encoder side.
  5. The augmented decoder input is fed into the three sublayers comprising the decoder block: masked multi-head attention, multi-head attention, and feed-forward blocks. Masking is applied in the first sublayer in order to stop the decoder from attending to the succeeding words. At the second sublayer, the decoder also receives the output of the encoder, which now allows the decoder to attend to all the words in the input sequence.
  6. The output of the decoder finally passes through a fully connected layer, followed by a softmax layer, to generate a prediction for the next word of the output sequence. For text generation a word with high probability is selected.
83
Q

What is Retrieval Augmented Generation (RAG)?

A

RAG is used to get an LLM to answer prompts grounded in your custom data.

Documents that you want to use to provide the LLM custom context must be stored in a vector database so that when needed the LLM can look up and retrieve the relevant documents and consume them to learn more context prior to generating a prompt.

This not only provides the LLM custom data and context where it previously had none, but it also prevents hallucinations, since you help provide the LLM with relevant information, greatly reducing the chances it will make something up to fulfill the prompt.

combine the language model with an external storage provider, and create an overall software system that can orchestrate the interactions with and between these components in order to create a “chat with your data” experience.

Weaviate Cloud Services can be used for this

84
Q

What is LlamaIndex?

A

LlamaIndex provides a simple interface between LLMs and external data sources.

It is designed to simplify the process of searching and summarizing documents using a conversational interface powered by large language models (LLMs).

LangChain serves as the foundation for much of the tool’s functionality.

Graph indexes are a key feature of LlamaIndex, helping to efficiently organize and optimize the data it processes.

Overall, the goal of LlamaIndex is to enhance document management through advanced technology, providing an intuitive and efficient way to search and summarize documents using LLMs and innovative indexing techniques.

85
Q

What is LangChain?

A

LangChain provides a framework for building and managing LLM-powered applications.

86
Q

When it comes to vector DBs, what is chunking?

A

A standard way to index unstructured data is to split the source documents into text “chunks”, embed each chunk, and store each chunk/embedding in a vector database.

This creates a more accurate vector representation, and allows for the token size limitations of LLMs

87
Q

With LLMs, how can you enable conversation memory (with LangChain)?

A

Langchain provides memory objects that can be passed around in chains or you can use them standalone to investigate the history of an interaction, extract a summary, etc.

The three key options are:
1. ChatMessageHistory object
2. ConversationBufferMemory object
3. Saving History in a file

88
Q

In LangChain, what are chains?

A

Chains allow us to combine multiple components together to create a single, coherent application. For example, we can create a chain that takes user input, formats it with a PromptTemplate, and then passes the formatted response to an LLM.

We can build more complex chains by combining multiple chains together, or by combining chains with other components.

89
Q

In LangChain, what are Agents?

A

Some applications require a flexible chain of calls to LLMs and other tools based on user input. The Agent interface provides the flexibility for such applications.

An agent has access to a suite of tools, and determines which ones to use depending on the user input. Agents can use multiple tools, and use the output of one tool as the input to the next.

There are two main types of agents:
1. Action agents: at each timestep, decide on the next action using the outputs of all previous actions
2. Plan-and-execute agents: decide on the full sequence of actions up front, then execute them all without updating the plan

90
Q

In LangChain, what are Tools?

A

Tools are interfaces that an agent can use to interact with the world.

e.g. ‘openweathermap-api’, ‘llm-math’

An important aspect of a tool is its description which is the main piece of information the agent uses to decide whether it should use that tool for any given query

91
Q

What are the key components of a vector database?

A

The text embedding pair is stored in a vector database or a <KEY, VALUE> store, with KEY being vector embedding and VALUE being the text chunk.

The unique feature of a vector database is the capability to perform approximate near-neighbor (ANN) search on vectors efficiently for KEY matching instead of performing exact KEY matches in a traditional database.

92
Q

What is a hybrid search model?

A

Hybrid search is a technique that combines multiple search algorithms to improve the accuracy and relevance of search results.

It uses the best features of both keyword-based search algorithms (sparse embeddings) with vector (dense embedding) search techniques.

Sparse and dense vectors are calculated with distinct algorithms. Sparse vectors have mostly zero values with only a few non-zero values, while dense vectors mostly contain non-zero values.

Sparse embeddings are generated from algorithms like BM25 and SPLADE. Dense embeddings are generated from machine learning models like GloVe and Transformers.

Reciprocal Rank Fusion (RRF) is used to combine the results of BM25 and dense vector search into a single ranked list.

93
Q

What is BM25?

A

A vectorisation approach that builds on the keyword scoring method TF-IDF (Term-Frequency Inverse-Document Frequency)

It takes the Binary Independence Model from the IDF calculation and adds a normalization penalty that weighs a document’s length relative to the average length of all the documents in the database.

94
Q

What is Reciprocal Rank Fusion (RRF)?

A

RRF is used in the hybrid search model to combine the results of BM25 and dense vector search into a single ranked list.

The RRF score is calculated by taking the sum of the reciprocal rankings that is given from each list. By putting the rank of the document in the denominator, it penalizes the documents that are ranked lower in the list. (numerator is 1)

This means you can combine multiple ranking approaches into one

95
Q

What are the key limitations of Vector DBs?

A
  1. Any changes or updates to the LLM require re-indexing everything in the Vector DB.
  2. You need the exact same LLM for querying. Changing dimensions is not allowed.
  3. Privacy Risk: All text must go to the embedding models and the vector database. If both are different managed services, you create two copies of your COMPLETE data at two different places.
  4. Be Cost Aware: Every token in the complete text corpus goes to LLM and the Vector DB. In the future, if you update your LLM by fine-tuning, upgrading the model, or even increasing your dimensionality, you need to re-index and pay the full cost again.