Lecture 7 and 8 Flashcards
Neural Networks, Word Vectors
Introduction to Neural Nets
In 2018, Google introduced new text processing techniques heavily dependent on the use of deep learning neural networks. To understand deep learning, it is valuable to first understand how “regular” artificial neural networks (ANNs) work in a simpler form.
Deep Learning
Representation learning for automatically learning good features or representations
Representational learning:
learning representations of the data that make it easier to extract useful information when building classifiers or other predictors
Overview of Neural Networks
- Weights: These are calculated during the training process
- Bias: Like an intercept value in a regression
- Inputs: Observed Variables
Overview of Neural Networks
The activation function can be virtually any formula that will produce an output from the summated input, but for learning to work properly, the function must generally be differentiable. Here’s the original perceptron activation function (not differentiable):
Overview of Neural Networks
Activation Function - f(x)
Output - y
Common Activation Functions
The activation function of a node defines the output of that node given an input or set of inputs. Programmers choose different activation functions based on the system performance exhibited for various applications.
Common Activation Functions
- Hyper Tangent Function
- ReLU Function
- Sigmoid Function
- Identity Function
ReLU Function
Perceptrons neuron model (left) and activation function (right).
Neural Network Models with Hidden Layers
A typical neural network consists of a few layers; an input layer, an optional hidden layer and an output layer. Using an identity activation function and no hidden layers, the analysis is equivalent to OLS regression
Deep Learning:
Deep learning is simply a more complex neural network. There are often many hidden layers – sometimes dozens – and multiple output nodes to estimate multidimensional output(s). It is also possible to use different activation functions on different nodes.
Training –Forward pass
The forward pass
Initially, filter value is randomly assigned -> performance is expected to be (very) bad
The loss function
E(total) = Σ½(target - output)²
Cost/Error function
(mean squared error)
The backward pass
One way of visualizing this idea of minimizing the loss is to consider a 3-D
graph where the weights of the neural net (there are obviously more than
2 weights, but let’s go for simplicity) are the independent variables and
the dependent variable is the loss. The task of minimizing the loss
involves trying to adjust the weights so that the loss decreases. In visual
terms, we want to get to the lowest point in our bowl shaped object. To
do this, we have to take a derivative of the loss (visual terms: calculate
the slope in every direction) with respect to the weights.
Learning Through Backpropagation
Backpropagation takes the difference between the predicted value and the actual value and uses that error term to adjust each node’s weights.
Learning Through Backpropagation
The process works backwards from the final layers to earlier layers, one layer at a time, and computes the contribution that each weight in the given layer had in the loss value.
Learning Through Backpropagation
The algorithm that computes the loss value is called a “gradient descent:” this iteratively moves in the direction of greatest improvement in prediction
Backpropagation
includes: the forward pass, loss function, backward pass, and parameter update
Epoch:
0ne Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.
Batch:
one batch contains the training examples present in one weight update (recommended: no more than 32)
Iteration:
number of iterations/batches = total training data/batch size
Language analysis
- Speech
- Morphology
- Syntax
- Semantics
Word Embedding
One-hot Vector
TF-IDF
Word2Vec
GloVe
fastText
ELMo
Attention Mechanism –BERT
XLNet
One-hot Vector
In a vocabulary set, each word is represented as a vector. For example, if word chair is the 5391th word in that vocabulary, we can represent it as O(5391)
TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus). The TF-IDF value increases proportionally to the number of times a word appears in a document but is offset by the frequency of the word in the corpus. This helps to highlight words that are more specific to a particular document and are less common in the entire corpus.
Term Frequency (TF):
This measures how often a term (word) appears in a document. It is calculated as the ratio of the number of times the term appears in the document to the total number of terms in the document. TF is usually normalized to prevent it from biasing towards longer documents.
Inverse Document Frequency (IDF):
This measures how important a term is within the entire corpus. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. Terms that occur in many documents have a lower IDF, and vice versa.
Word2Vec
popular technique in natural language processing (NLP) and machine learning that is used to represent words as vectors in a continuous vector space. Developed by a team at Google led by Tomas Mikolov, Word2Vec is designed to capture semantic relationships between words based on their context in a given corpus.
Continuous Bag of Words (CBOW):
This model predicts a target word based on its context. It takes a context of surrounding words (the “bag of words”) as input and tries to predict the target word.
Skip-Gram:
In contrast to CBOW, the Skip-Gram model predicts the context words (surrounding words) given a target word. It takes a target word as input and tries to predict the words that are likely to appear in its context.
GloVe
“Global Vectors for Word Representation,” is another popular word embedding technique in natural language processing (NLP). Developed by researchers at Stanford University, GloVe is designed to capture the global context of words in a corpus and create vector representations that encode semantic relationships between words
fastText
open-source, free, lightweight library developed by Facebook’s AI Research (FAIR) lab for efficient learning of word representations and text classification. It is an extension of the Word2Vec model, developed by the same team. What sets fastText apart is its ability to represent each word as a bag of character n-grams, enabling it to capture morphological information and handle out-of-vocabulary words more effectively
ELMo
“Embeddings from Language Models,” is a deep contextualized word representation model developed by researchers at the Allen Institute for Artificial Intelligence (AI2). Unlike traditional word embeddings that assign a fixed vector to each word regardless of its context, ELMo produces word representations that are sensitive to the surrounding words in a given sentence. ELMo captures the context-dependent meaning of words by considering their usage in different contexts
Attention Mechanism –BERT
Bidirectional Encoder Representations from Transformers, is a natural language processing (NLP) model that utilizes an attention mechanism. The attention mechanism is a key component in BERT and many other transformer-based models. Here’s an explanation of the attention mechanism and its role in BERT
Attention Mechanism –BERT
Attention Mechanism:
The attention mechanism is a mechanism that allows a model to focus on different parts of the input sequence when making predictions. In the context of NLP, this input sequence is often a sequence of words in a sentence. Traditional sequence-to-sequence models or recurrent neural networks (RNNs) process input sequences sequentially, but attention mechanisms enable models to consider all words in the sequence simultaneously
XLNet
XLNet is a generalized autoregressive pretraining method for language understanding. It is a language model that learns unsupervised representations of text sequences. XLNet is an extension of Transformer-XL and uses an autoregressive method to denoise the input and achieve better performance on various tasks. It is capable of modeling bidirectional contexts, which is why it outperforms pretraining approaches based on autoregressive language modeling like BERT¹. XLNet has been shown to outperform BERT on 20 tasks, including question answering, natural language inference, sentiment analysis, and document ranking¹.
Cosine Similarity
Two vectors pointing in the same
direction have a cosine similarity of
1; two orthogonal vectors (90
degree angle) have a cosine
similarity of 0
Cosine Similarity
Cosine distance can be expressed
in difference ways, e.g., 1 – sim
Cosine Similarity
Cosine similarity can be computed
as the normalized dot product of
the two vectors – very efficient in
Python