NLP and Deep Learning Flashcards
What is Syntactic Analysis?
Syntactic Analysis (Syntax analysis, Parsing): text analysis that tells us the logical meaning behind the sentence or part of the sentence.
It focuses on the relationship between words and the grammatical structure of sentences. You can also say that it is the processing of analyzing the natural language by using grammatical rules.
How would you reduce the inference time of a trained transformer model?
To improve inference time, we can use:
- GPU, TPU, or FPGA for acceleration.
- GPU with fp16 support
- Pruning to reduce parameters
- Knowledge distillation
- Hierarchical softmax or adaptive softmax
- Cache predictions
- Parallel/batch computing
- Reduce the model size
What is FP16 support (when talking about GPUs)
Half precision (also known as FP16) data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network.
FP32 and FP16 mean 32-bit floating point and 16-bit floating point. GPUs originally focused on FP32.
What is Reinforcement Learning?
Reinforcement learning uses trial and error to reach goals. It is a goal-oriented algorithm and it learns from the environment by taking correct steps to maximize the cumulative reward.
In typical reinforcement learning:
- At the start, the agent receives state zero from the environment
- Based on the state, the agent will take an action
- The state has changed, and the agent is at a new place in the environment.
- The agent receives the reward if it has made the correct move.
- The process will repeat until the agent has learned the best possible path to reach the goal by maximizing the cumulative rewards.
What is the activation function in Deep Learning?
The activation function is a non-linear transformation in neural networks. We pass the input through the activation function before passing it to the next layer.
The net input value can be anything between -inf to +inf, and the neuron doesn’t know how to bound the values, thus unable to decide the firing pattern. The activation function decides whether a neuron should be activated or not to bound the net input values.
Most common types of Activation Functions:
Step Function
Sigmoid Function
ReLU
Leaky ReLU
What is Naive Bayes algorithm, When we can use this algorithm in NLP?
Naive Bayes algorithm: supervised learning algorithm, based on Bayes theorem and used for solving classification problems.
It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
Compared to other discriminative models like logistic regression, Naive Bayes model it takes less time to train, converges faster and requires less training data.
Assumes that each feature makes an:
- independent
- equal
contribution to the outcome.
This algorithm is perfect for use while working with multiple classes and text classification where the data is dynamic and changes frequently.
What is Bayes Theorum?
Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred.
Stated mathematically by:
P(A|B) = {P(B|A) P(A)} / {P(B)}
where A and B are events and P(B) ≠ 0.
We are trying to find probability of event A, given that event B is true.
Event B is also termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an unknown instance(here, it is event B).
P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
What is Bag of Words?
A commonly used model that depends on word frequencies or occurrences to train a classifier.
This model creates an occurrence matrix for documents or sentences irrespective of its grammatical structure or word order.
What is Pragmatic Ambiguity in NLP?
Pragmatic ambiguity refers to those words which have more than one meaning and their use in any sentence can depend entirely on the context.
e.g. ‘the river bank’ and ‘withdraw from the bank’
How can you compute the distance between two-word vectors in NLP?
Euclidean distance: the length of the shortest path connecting two points
Cosine Similarity: cosine angle between the vector of two words
How can you reduce the dimensions of NLP data?
Keyword Normalization
Latent Semantic Indexing
Latent Dirichlet Allocation
What is TF-IDF?
TF (term frequency) and IDF (inverse-document-frequency)
K = #_occurrences
T = #_terms
formula for TF is K/T
formula for IDF is log(total docs / no of docs containing “data”)
What is Attention?
Like gravity! But based on similarity of pre-conceived word embeddings
When words have different meanings in different contexts (pragmatic ambiguity), context specific embeddings are created based on cosine similarity between embeddings
e.g.
We have an original embedding for bank
When ‘bank’ is seen in the context of ‘river’ a new embedding is created for ‘bank’ which is weighted towards ‘river’
How?
1. If cosine similarity is 0.11
2. To calculate the ratios to make the new embedding, we take:
1 + 0.11 = 1.11 (cosine for ‘bank + bank’ plus cosine for ‘river + bank”)
3. Normalise cosines to calculate percentages for new embedding:
Bank = 1 / 1.11 = 0.9
River = 0.11 / 1.11 = 0.1
4. Create new embedding using 90% bank original embedding + 10% river embedding
In translation tasks, attention is needed to assist translation of each word; e.g. the chair -> we need to know if the noun is masculine/feminine to accurately translate ‘the’
What is a Transformer model?
Most popular model architecture for NLP
Formed of encoder and decoder sections
Can have different model types:
Encoder only - BERT etc
Decoder only - Transformer XL
Seq-to-seq - GPT3, BART,
What are the architectural features of a transformer model?
Encoder and/or decoder parts
Encoder:
1. Tokenisation
2. Embedding generation
3. Positional encoding
Decoder:
1. Series of transformer blocks, each containing a self-attention step and a feed-forward step
2. Takes encoder embeddings and previously generated decoder output as inputs
What is ‘positional encoding’ in transformer models?
In the encoder, positional encoding:
- Takes word embedding
- Adds pre-defined sequential vector to original embedding
This:
- Preserves order of words
- Ensures sentences with different word orders have different embeddings
What does ‘multi-headed attention’ mean?
In a transformer model, the decoder is formed from a series of transformer blocks which modify word embeddings.
Each block contains a self-attention step and a feed-forward mechanism. Because there is attention at each stage this is known as multi-headed attention
What is a feed-forward mechanism?
In a transformer model, the decoder is formed from a series of transformer blocks which modify word embeddings.
Feed-forward mechanism is a neural network in each of these blocks
What is SoftMax?
Softmax is a post-processing step that takes place in a transformer model after the decoder (on the logits).
It applies a lower and upper bound so that they’re understandable.
The total sum of the output is then 1, resulting in a possible probabilistic interpretation.
Softmax converts transformer scores into probabilities and, for text generation tasks, selects high probability words (but not the same one each time). This helps the model to not give the same answer every time.
What is semantic search?
Using word/sentence MEANING to conduct a search.
Uses K-nearest neighbours or Annoy (approximate nearest neighbours)
Opposite to similarity (e.g. cosine) because in semantic search high numbers = closer
What is the problem with K-Nearest Neighbours and what are 4 solutions for this?
K-Nearest Neighbours = exponential compute with data size as need to calculate distance for each data point
Solutions:
1. Inverted file index (cluster then search for nearest neighbours) e.g. Annoy
2. Hierarchical Navigable Small Worlds (HNSW) (start with a few points, search here then add more points using previous knowledge and iterate
3. Re-ranking (separate model to rank and select best answer from all options)
4. Positive and negative pairs (update sentence embeddings to make correct answers closer and incorrect further from question embedding)
What are the five evaluation metrics for binary classifiers?
- Accuracy
- Precision
- Recall
- Specificity
- F1 score
How do you calculate Accuracy and why is it often not the best metric?
Number of correct predictions
/
Total number of predictions
Accuracy can be misleading when data is imbalanced; if True Negatives are dominant, the model will have high accuracy just by predicting the majority class
How do you calculate precision?
TP
/
TP + FP
Precision is the proportion of TRUE PREDICTIONS that were correct - of all the true predictions, how many were true data
How do you calculate recall?
TP
/
TP + FN
Recall is the proportion of TRUE POSITIVE DATA that were captured - of all the true data how many did the model correctly predict
How do you calculate specificty?
TN
/
TN + FP
Specificity is the proportion of TRUE NEGATIVE DATA that were captured (same as recall but for negatives)
How do you calculate F1 score and when should this be used?
2 * Precision * Recall
/
Precision + Recall
There is a balance that needs to be achieved between precision and recall which depends on business use case. F1 should be used when both precision and recall are equally important
How do you calculate precision and recall for a multiclass classifier?
Precision and recall are calculated separately for each class.
-True Positive are the correct prediction
-False Negatives are when the true class were predicted as any other category
-False positives are when any other category was predicted as the true class
What is the Macro-Average
In a multiclass classifier, if there are too many classes to calculate precision/recall etc for each one, the average of these metrics can be taken to evaluate model performance
e.g.
macro-precision
macro-recall etc
What is UMAP?
Uniform Manifold Approximation and Projection for Dimension Reduction
This is a method of dimensionality reduction which can be used to reduce word embeddings down to 2 dimensions
What is PCA?
Principle Component Analysis
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space
Can project down to any number of dimensions
What are the four main uses of word/document embeddings?
- Semantic search - recommender systems, Q&A
- Clustering - grouping documents
- Classification - spam/no spam
- Topic modelling (similar to clustering)
What are some approaches to Topic Modelling?
- K-Means clustering
- BerTopic (uses HDBScan - clustering without pre-defined clusters)
- LDA (Latent Dirichlet Allocation) - dimensionality reduction
- Gaussian mixture models
What is model fine-tuning?
Adapting foundational language models to your own data
e.g. for a classification task, embeddings will be modified to more distinctly separate the classes
What is Generative AI?
AI that creates or generates new data
e.g. GPT-3, DALL-E etc
What is a tensor?
Essentially like a numpy array but with multiple dimensions. Tensors are needed for deep learning models
What is the syntax to pre-process text into Tensors ready for a model with Hugging Face AutoTokeniser?
from transformers import AutoTokenizer
checkpoint = “distilbert-base-uncased-finetuned-sst-2-english”
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
“list_of_text_strings”,
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors=”pt”)
here we are preparing tensors for PyTorch implementation
How do you instatiate a pre-trained model in Hugging Face?
from transformers import AutoModel
checkpoint = “distilbert-base-uncased-finetuned-sst-2-english”
model = AutoModel.from_pretrained(checkpoint)