Quiz #4 Flashcards
Name five types of neural network architectures.
- Fully connected Neural Networks
- Convolution Neural Networks
- Recurrent Neural Networks
- Attention-Based Networks
- Graph-Based Networks
What is an embedding?
A learned map from entities to vectors of numbers that encode similarities. For example you can have a word embedding that maps word -> vector, or a graph embedding that maps a node -> vector.
Why were architectures like RNN, Attention-Based Networks, and Graph based networks developed?
Generally, we want to develop models that can learn relationships between objects:
- We want to be able to model hierarchical composition between additional types of data like speech data and natural language.
- Additionally, we want to model the relationship between elements in a scene (like a scene graph)
- We may want to model inter-relationships between things like words, or concepts.
What are three important things needed to represent structural information?
- State: compactly representing all the data we’ve processed so far. These are the nodes in a graph.
- Neighborhoods: These are the edges in the graph. They represent relationships and can be calculated using something like a similarity measure or attention.
- Propagation of information: creating states, or vectors, that represent concepts
Given a set of vectors U = {u_1, … , u_n}, provide an equation you can use to find the most similar vector (p) to a given vector q.
You can select a most similar vector using softmax:
p = Softmax(Uq)
Language models allows us to [1.] and [2.].
- estimate probabilities of sequences of words such as p(“I eat an apple”).
- perform comparisons: p(“I eat an apple”) > p(“Dromiceiomimus does yoga with T-Rex”)
how can you describe a sentence, s, as a product of probabilities?
Expressing the probability of s as p(x) = p(w_1,w_2,..,w_n) where w_1 is the first word in the sentence, we can use the chain rule to express this probability as:
p(w_1)p(w_2|w_1)…p(w_n|w_{n-1},…,w_1)
Or more generally:
Product_i p(w_i | w_{i-1}, … , w_1)
Language models are generative models of language. We can generate new sequences of words from it given a history of past words. (T/F)
True. We can generate new words by randomly sampling from the conditional probability of the next word, given the history:
Product_i p(w_i | w_{i-1}, ... , w_1) Here, w_i is the next word, w_{i-1}, ... , w_1 is the history.
List 3 applications of language modeling and give an example for each.
Predictive typing:
- Search fields (i.e. google)
- text completion on phone
- assisted typing (i.e. sentence completion)
Automatic speech recognition:
- How likely is the user to have said “my hair is wet” vs “my hairy sweat”?
Basic grammar correction:
- p(They’re happy together) > p(Their happy together)
What is the product you can use to calculate p(s|c) where s is a sentence, and c is a provided context.
p(s|c) = Product_i p(w_i | c, w_{i-1}, … , w_1).
Note, this is like a standard language model, but with the added parameter, c.
Provide 3 examples of how one can use condition language models in NLP tasks
- Topic-aware language model: c = topic, s = text
- Text summarization: c = long document, s = summary
- Machine translation: c = French text, s = English text
Provide 3 examples of how one can use condition language models in non-NLP tasks
- Image captioning: c = an image, s = its caption
- Optical character recognition: c = image of a line of text, s = its context
- Speech recognition: c = a recording, s = its content
Speech recognition and optical character recognition are [ ] -> [ ] sequence models. These types of models are also referred to as [ ] -> [ ]
many to many
encoder -> decoder
Sentiment analysis and topic classification are [ ] -> [ ] sequence models.
many -> one
Image captioning models are [ ] -> [ ] sequence models.
one -> many
How can you use one-hot encoding to represent words in a vocabulary?
The vector is the length of the vocabulary, and a vector is created for each word in the vocabulary. All elements in each vector are zeros, except for the index that corresponds to the location of the word in the vocabulary (i.e. where it shows up in the sentence of document)
The dog barks:
The: [1,0,0]
dog: [0,1,0]
barks: [0,0,1]
What are the pain points of using a multi-layer perceptrons to process sequence data?
- cannot easily support variable-sized sequences as inputs or outputs
- no inherent temporal structure. The multi-layer perceptron doesn’t keep track of which element in the sequence came first, second, third, etc.
- There is no practical way of holding state. There is not a memory of which words came before. I.e. bank appearing after “river” vs “money at the”.
- Size of the network grows with the maximum allowed size of the input and output sequence we want to support
Describe how a single RNN node works at a given time-step, t during training vs inference.
At time-step t, the node receives an input, which it uses to update a state, h_t.
To update the state, the node also has access to the state at the previous time-step, h_{t-1} and either:
- the ground truth (expected) output from the previous time step during training when using teacher forcing
- the predicted output from the previous time step during inference.
This becomes a recursive algorithm, in which f_theta is repeatedly called to update the state.
h_t = f_{theta}(h_{t-1}, x_t}
Describe the steps of backpropagation through an RNN
- Run the network and compute the outputs.
- Compute the loss (typically a function of all outputs).
- Perform the backward step to compute gradients.
- For models with a large number of time-steps (and thus layers), we can use truncated back-prop through time.
In an RNN model, effectively you have as many layers as [ ].
In an RNN model, effectively you have as many layers as time-steps.
What is the formal definition of an RNN?
A neural network whose information flow does not follow a directed acyclic graph.
What is the equation to update the state (h_t) using an Elman (Vanilla) RNN?
h_t = activation_function(U_theta x_t + V_theta h_{t-1} + bias_theta)
h_t = next state x_t = input U_theta = learned matrix we multiply the input by to perform affine transformation V_theta = learned matrix we multiply previous state by to perform affine transformation h_{t-1} = previous state bias_theta = learned bias term
What is the equation to update the output (y_t) using an Elman (Vanilla) RNN?
y_t = activation_function(W_theta h_t + Beta_theta)
y_t = output h_t = recently update state. W_theta = learned matrix we multiply the recently updated state by to perform an affine transformation Beta_theta = learned bias term.
Activations used for RNN
Sigmoid, Tanh, other nonlinear functions
RNNs can be difficult to train due to [ ] and [ ].
RNNs can be difficult to train due to vanishing gradients and exploding gradients.
Example - simple RNN that updates hidden state as follows:
h_t = sigmoid(w_theta h_{t-1})
With the chain rule:
dh_t/dh_{t-1} = sigmoid(w_theta)(1-sigmoid(w_theta))*w_theta
So generally,
dh_t / dh_0 is proportional to w_theta^t
If |w_theta| > 1, this explode because of the t-exponent
If |w_theta| < 1, this vanishes
The LSTM architecture was created to attempt to alleviate [ ] and [ ] [ ] .
The LSTM architecture was created to attempt to alleviate vanishing and exploding gradients.
What are the three gates used in an LSTM called and what is the purpose of each?
- f_t : forget gate - this gate decides how much of the previous cell state we want to keep around. Value = 0 means forget everything. Value = 1 means remember everything.
- i_t : the input gate - how much we let that particular input impact the cell state.
- o_t : the output gate - decides how much of the cell state we want to surface.
What does LSTM introduce to avoid the vanishing gradient problem?
A summation to calculate a “cell state” which is used to update the state at a time-step t.
c_t = dot(f_t, c_{t-1}) + dot(i_t, u_t) h_t = dot(o_t, tanh(c_t))
How do you calculate the forget gate in an LSTM
sigmoid(w_theta [x_t, h_{t-1}] + b_theta) aka sigmoid( U x_t + V h_{t-1} + b_theta) where w_theta = [U,V]
What is one of the consequence of vanishing gradients in RNNs?
Because of vanishing gradients, RNNs have a difficult time learning relationships over a larger number of time-steps. Additionally, information in the deepest layers of the network has a difficult time percolating to the first layers.
What are the equations used to calculate the forget, input, and output gates in an LSTM?
f_t = sigmoid(W_f * [h_{t-1},x_t] + b_f)
i_t = sigmoid(W_i * [h_{t-1},x_t] + b_i)
o_t = sigmoid(W_o * [h_{t-1},x_t] + b_o)
What does LSTM in LSTM Networks stand for?
Long Short-Term Memory
What is the equation used to update the new cell state in an LSTM node?
c_t = dot(f_t, c_{t-1}) + dot(i_t, u_t)
What is the equation used to update the LSTM hidden state?
h_t = o_t * tanh(c_t)
When unrolled, RNN’s are essentially feed forward neural networks with affine transformations and non-linearities? (T/F)
True
What is the equation for per-word cross-entropy and what does it calculate?
This calculates the cross entropy averaged over all of the words in the sequence. The referenced distribution is the empirical distribution of the words in the sequence.
H = -(1/N) Sum_i^N log( p(w_i | w_{i-1}, … ) )
This is a way to measure how good the model is at estimating probabilities.
What is perplexity (definition and intuitive explanation)
The geometric mean of the inverse probability of a sequence of words according to the model.
The perplexity of a discrete uniform distribution over k events is k - if you split a fair coin, the perplexity is 2. If you roll a fair die, the perplexity is 6.
The higher the perplexity, the better the language model is. (T/F)
False. The lower the perplexity, the better the language model is.
What is the perplexity of flipping a normal coin?
2
When training and RNN on a many-to-many problem, loss is calculated at each time step? (T/F)
True
RNN training process with many-to-many tasks.
- Feed a vector-representation (like one-hot encoded) of each word to a node of the RNN. Use a symbol to mark the start of the sentence.
- After every time step, project our hidden state to a high dimensional space that has the same length as the number of words in our vocabulary.
- Turn that into a probability distribution using softmax.
- Calculate the loss using cross entropy.
- At the next time step, feed the next node the next word in the sequence and the ground-truth word from the previous node (teacher forcing).
In an RNN, when using teacher forcing during training, a node receives the predict word from the previous node. (T/F and why)
False. At each time step, a node receives the actual word that was used as input to the previous time step and is present in the training data.
Learn more about teacher forcing: https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/
What is teacher forcing?
The practice of feeding the previous word (input from training data) to the next layer in an RNN.
How is the overall loss calculated for an RNN.
x-to-many: the overall loss is calculated by aggregating (i.e. averaging) the losses calculated at each time-step.
many-to-one: loss is calculated at the final time-step when the prediction is made.
Describe the steps an RNN uses during inference.
- We feed all of the words in our history into our model until we run out of history.
- At the time step, t, when we want to make a prediction, we take our hidden state h and perform a transformation to project it into a high-dimension space that is the size of our entire vocabulary.
- We normalize this vector using softmax, and this gives us a probability distribution over all words in the vocab.
- We select the word with the highest probability.
What is a pre-training task?
An auxiliary task, different from the final task we’re really interested in, and one that can help us achieve better performance by finding good initial parameters for the model.
How do mask language models process input data?
- They take in a sequence of words and mark the beginning and send of the sentence with a special character (i.e. s).
- They cover up certain words with a mask (i.e. mask).
- Words are embedded, with an added positional embedding.
- The final result is fed into a transformer encoder.
What predictions does a masked language model make?
It tries to predict the words that were masked in the input data.
Why do we train masked language models?
A model that learns to solve this problem well will learn about the structure of language and common sense knowledge. If we train this model to perform a specific task we’re interested in, it will retain some of the knowledge it learned to perform masked language modeling. This can boost performance on our final task.
Why do we add positional embeddings to the words in the input word-sequence used to train a masked language model?
Because we feed the input into a transformer encoder, which does have an inherent notion of the position of inputs, and this information is important in determining masked words and other NLP tasks.
What is a token-level task?
For each output, we want to perform a classification - i.e. named entity recognition.
How would you train a pre-trained masked language model to perform named entity recognition?
- Input a sentence with no masked tokens.
2. For the outputs at each position, train the network to perform the right classification (i.e. person, date, etc.)
What are sentence-level tasks?
Tasks where we are interested in the global meaning of the task i.e. sentence classification.
How would you train a pre-trained masked language model to perform sentence classification (sentiment analysis)?
Take the first output of the transformer encoder in the top layer, and use that to classify the sentence.
What is cross-lingual masked language modeling?
When you create a masked language model input that consists of a phrase in two languages. The languages are separated by a token (), and the phrases are marked by special symbols ().
We mask certain words in both of the languages, and model learns to look at both translations simultaneously and learn what the masked words are.
What is a strength of cross-lingual masked language models?
A strength of these models is that is they can perform cross-lingual tasks well.
Examples:
- classifying phrases in different languages using only English classifications during training.
- you can then train the model on a natural language inference dateset in one language, and the model can perform inference on a variety of other languages.
What is natural language inference?
Given two sentences, the task is to imply if the first sentence implies the second, if it contradicts it, or if they are unrelated.
How does knowledge distillation work in model training?
Idea: We use a larger, pre-trained model to teach a smaller model.
Training Process:
- The input text is passed to both the pretrained “teacher” model and the smaller “student” model.
- We encourage the student model to align predictions using both a standard loss function and a pre-trained teacher using a distillation loss (a loss that penalizes differences between student and teacher predictions).
Knowledge distillation can help reduce model size (T/F).
True
How can knowledge distillation be used to augment training data?
We can take any unlabeled piece of text we have, and have the pre-trained model make a prediction on the text. We can use that prediction and text to augment training data.
List two loss functions commonly used for distillation.
Cross entropy: - Sum_i (t_i log(s_i))
KL divergence: D_{KL) (t||s)
s_i = student prediction for input data object i t_i = teacher prediction for input data object i
What is distillation loss?
This measures the difference between a student and teacher model’s prediction on a given piece of text (input data).
When using knowledge distillation during training, how do you combine the distillation loss and the student loss to arrive at a final, total loss for the student model?
Take a linear combination of the two losses:
L = a * L_dist + b * L_student
Where:
a = weight for distillation loss (L_dist)
b = weight for student loss (L_student)
What are distributional semantics or distributional similarity?
This is the idea that the meaning of a word comes from it’s context, or the other nearby words that frequently appear around it.
In Collobert and Weston vectors, what is a positive and negative example?
Given a sample context, a positive example is one in which all words make sense in their context. A negative example is one in which a random word appears in that context.
What is the difference between intrinsic and extrinsic evaluation of word embeddings?
Intrinsic: evaluation on a specific/intermediate sub-task of actually performing the word embedding
Extrinsic: evaluation on a task the word embeddings are used in, for example, text classification.
Most of the complexity of Feed-forward NN and RNN language models is caused by non-linear hidden layers. (T/F)
True - see https://arxiv.org/pdf/1709.03856.pdf
Skip-gram predicts context (surrounding) words given the target word. (T/F)
True - see https://arxiv.org/pdf/1709.03856.pdf
Continuous bag of words predicts the context (surrounding) words based on the target word. (T/F)
False. CBOW predicts the target word from the context words. See https://arxiv.org/pdf/1709.03856.pdf
What is the goal of the word2vec objective function?
To cause words that occur in similar contexts to have similar embedding.
What two algorithms does word2vec use to generate vectors from words?
- Continuous bag of words (CBOW)
- Skip Gram
There is no natural notion of similarity between embedded words in a set of one-hot encoded vectors. (T/F)
True
In word2vec, words are represented by [ ] [ ], and two vectors with similar context have a high [ ] [ ] (like dot product between the vectors)
In word2vec, words are represented by dense vectors, and two vectors with similar context have a high similarity measure (like dot product between the vectors)
What objective function do we minimize in word2vec?
J(theta) = - 1/T Sum_{t=1}^T Sum_{-m<=j<=m} log( p(w_{t+j} | w_t)
T = number of words in vocab m = window size w_t = target word at position t in the text.
The input to the skipgram model can be a [ ] encoding of the [ ] word.
The input to the skipgram model can be a one-hot encoding of the target (center) word.
For a visual, see: https://youtu.be/ERibwqs9p38?t=2331
What are graph embedding?
Graph embeddings are a specific type of embedding that translates graphs, or parts of graphs, into a fixed length vector
When you train a skip gram model, the hidden layer contains the word embeddings for your target words. (T/F)
True. For a visualization, see https://youtu.be/oQPCxwmBiWo?t=667
Graph embeddings are a form of [ ] learning on graphs.
Graph embeddings are a form of unsupervised learning on graphs.
List three innovations the inventors of word2vec proposed to improve training of the algorithm used in word2vec (i.e. Skip gram)
- Treating common word pairs or phrases as single “words” in their model.
- Sub- sampling frequent words to decrease the number of training examples. For each word we encounter in the training text, there is a chance it will be deleted from the text, and this probability is related to the word’s frequency.
- Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights. The probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples.
The StarSpace model consists of learning [ ], each of which is described by a set of discrete [ ] coming from a fixed-length dictionary.
The StarSpace model consists of learning entities, each of which is described by a set of discrete features (bag-of- features) coming from a fixed-length dictionary.
StarSpace model cannot be used to compare entities of different kinds. For example, a user entity cannot be compared with an item entity (recommendation), or a document entity with label entities (text classification), and so on.
False. One of the important features of StarSpace is that the model can be used to compare entities of different kinds.
What distribution does t-SNE use to measure distance between points in high dimensions?
A normal distribution.
What distribution does t-SNE use to measure distances between points in lower dimension (i.e. 2 dimensions)
A t-distribution. This is the “T” in t-SNE.
What is t-SNE used for?
t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
It is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. In simpler terms, t-SNE gives you a feel or intuition of how the data is arranged in a high-dimensional space
What does perplexity balance in t-SNE?
Perplexity balances the attention t-SNE gives to local and global aspects of the data and can have large effects on the resulting plot.
Perplexity is roughly a guess of the number of close neighbors each point has. Thus, a denser dataset usually requires a higher perplexity value.
Graph Embedding
Optimize the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent.