Quiz #4 Flashcards

Question

RNNs can be difficult to train due to [ ] and [ ].

Answer 1

RNNs can be difficult to train due to vanishing gradients and exploding gradients. Example - simple RNN that updates hidden state as follows: h_t = sigmoid(w_theta h_{t-1}) With the chain rule: dh_t/dh_{t-1} = sigmoid(w_theta)(1-sigmoid(w_theta))*w_theta So generally, dh_t / dh_0 is proportional to w_theta^t If |w_theta| > 1, this explode because of the t-exponent If |w_theta| < 1, this vanishes

Answer 2

The LSTM architecture was created to attempt to alleviate vanishing and exploding gradients.

Answer 3

- f_t : forget gate - this gate decides how much of the previous cell state we want to keep around. Value = 0 means forget everything. Value = 1 means remember everything. - i_t : the input gate - how much we let that particular input impact the cell state. - o_t : the output gate - decides how much of the cell state we want to surface.

Answer 4

A summation to calculate a "cell state" which is used to update the state at a time-step t. ``` c_t = dot(f_t, c_{t-1}) + dot(i_t, u_t) h_t = dot(o_t, tanh(c_t)) ```

Answer 5

sigmoid(w_theta [x_t, h_{t-1}] + b_theta) aka sigmoid( U x_t + V h_{t-1} + b_theta) where w_theta = [U,V]

Answer 6

Because of vanishing gradients, RNNs have a difficult time learning relationships over a larger number of time-steps. Additionally, information in the deepest layers of the network has a difficult time percolating to the first layers.

Answer 7

f_t = sigmoid(W_f * [h_{t-1},x_t] + b_f) i_t = sigmoid(W_i * [h_{t-1},x_t] + b_i) o_t = sigmoid(W_o * [h_{t-1},x_t] + b_o)

Answer 8

Long Short-Term Memory

Answer 9

c_t = dot(f_t, c_{t-1}) + dot(i_t, u_t)

Answer 10

h_t = o_t * tanh(c_t)

Answer 11

This calculates the cross entropy averaged over all of the words in the sequence. The referenced distribution is the empirical distribution of the words in the sequence. H = -(1/N) Sum_i^N log( p(w_i | w_{i-1}, ... ) ) This is a way to measure how good the model is at estimating probabilities.

Answer 12

The geometric mean of the inverse probability of a sequence of words according to the model. The perplexity of a discrete uniform distribution over k events is k - if you split a fair coin, the perplexity is 2. If you roll a fair die, the perplexity is 6.

Answer 13

False. The lower the perplexity, the better the language model is.

Answer 14

1. Feed a vector-representation (like one-hot encoded) of each word to a node of the RNN. Use a symbol to mark the start of the sentence. 3. After every time step, project our hidden state to a high dimensional space that has the same length as the number of words in our vocabulary. 4. Turn that into a probability distribution using softmax. 5. Calculate the loss using cross entropy. 6. At the next time step, feed the next node the next word in the sequence and the ground-truth word from the previous node (teacher forcing).

Answer 15

False. At each time step, a node receives the actual word that was used as input to the previous time step and is present in the training data. Learn more about teacher forcing: https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/

Answer 16

The practice of feeding the previous word (input from training data) to the next layer in an RNN.

Answer 17

x-to-many: the overall loss is calculated by aggregating (i.e. averaging) the losses calculated at each time-step. many-to-one: loss is calculated at the final time-step when the prediction is made.

Answer 18

1. We feed all of the words in our history into our model until we run out of history. 2. At the time step, t, when we want to make a prediction, we take our hidden state h and perform a transformation to project it into a high-dimension space that is the size of our entire vocabulary. 3. We normalize this vector using softmax, and this gives us a probability distribution over all words in the vocab. 4. We select the word with the highest probability.

Answer 19

An auxiliary task, different from the final task we're really interested in, and one that can help us achieve better performance by finding good initial parameters for the model.

Answer 20

1. They take in a sequence of words and mark the beginning and send of the sentence with a special character (i.e. s). 2. They cover up certain words with a mask (i.e. mask). 3. Words are embedded, with an added positional embedding. 4. The final result is fed into a transformer encoder.

Answer 21

It tries to predict the words that were masked in the input data.

Answer 22

A model that learns to solve this problem well will learn about the structure of language and common sense knowledge. If we train this model to perform a specific task we're interested in, it will retain some of the knowledge it learned to perform masked language modeling. This can boost performance on our final task.

Answer 23

Because we feed the input into a transformer encoder, which does have an inherent notion of the position of inputs, and this information is important in determining masked words and other NLP tasks.

Answer 24

For each output, we want to perform a classification - i.e. named entity recognition.

Answer 25

1. Input a sentence with no masked tokens. | 2. For the outputs at each position, train the network to perform the right classification (i.e. person, date, etc.)

Answer 26

Tasks where we are interested in the global meaning of the task i.e. sentence classification.

Answer 27

Take the first output of the transformer encoder in the top layer, and use that to classify the sentence.

Answer 28

When you create a masked language model input that consists of a phrase in two languages. The languages are separated by a token (), and the phrases are marked by special symbols (). We mask certain words in both of the languages, and model learns to look at both translations simultaneously and learn what the masked words are.

Answer 29

A strength of these models is that is they can perform cross-lingual tasks well. Examples: - classifying phrases in different languages using only English classifications during training. - you can then train the model on a natural language inference dateset in one language, and the model can perform inference on a variety of other languages.

Answer 30

Given two sentences, the task is to imply if the first sentence implies the second, if it contradicts it, or if they are unrelated.

Answer 31

Idea: We use a larger, pre-trained model to teach a smaller model. Training Process: 1. The input text is passed to both the pretrained "teacher" model and the smaller "student" model. 2. We encourage the student model to align predictions using both a standard loss function and a pre-trained teacher using a distillation loss (a loss that penalizes differences between student and teacher predictions).

Answer 32

We can take any unlabeled piece of text we have, and have the pre-trained model make a prediction on the text. We can use that prediction and text to augment training data.

Answer 33

Cross entropy: - Sum_i (t_i log(s_i)) KL divergence: D_{KL) (t||s) ``` s_i = student prediction for input data object i t_i = teacher prediction for input data object i ```

Answer 34

This measures the difference between a student and teacher model's prediction on a given piece of text (input data).

Answer 35

Take a linear combination of the two losses: L = a * L_dist + b * L_student Where: a = weight for distillation loss (L_dist) b = weight for student loss (L_student)

Answer 36

This is the idea that the meaning of a word comes from it's context, or the other nearby words that frequently appear around it.

Answer 37

Given a sample context, a positive example is one in which all words make sense in their context. A negative example is one in which a random word appears in that context.

Answer 38

Intrinsic: evaluation on a specific/intermediate sub-task of actually performing the word embedding Extrinsic: evaluation on a task the word embeddings are used in, for example, text classification.

Answer 39

True - see https://arxiv.org/pdf/1709.03856.pdf

Answer 40

True - see https://arxiv.org/pdf/1709.03856.pdf

Answer 41

False. CBOW predicts the target word from the context words. See https://arxiv.org/pdf/1709.03856.pdf

Answer 42

To cause words that occur in similar contexts to have similar embedding.

Answer 43

- Continuous bag of words (CBOW) | - Skip Gram

Answer 44

In word2vec, words are represented by dense vectors, and two vectors with similar context have a high similarity measure (like dot product between the vectors)

Answer 45

J(theta) = - 1/T Sum_{t=1}^T Sum_{-m<=j<=m} log( p(w_{t+j} | w_t) ``` T = number of words in vocab m = window size w_t = target word at position t in the text. ```

Answer 46

The input to the skipgram model can be a one-hot encoding of the target (center) word. For a visual, see: https://youtu.be/ERibwqs9p38?t=2331

Answer 47

Graph embeddings are a specific type of embedding that translates graphs, or parts of graphs, into a fixed length vector

Answer 48

True. For a visualization, see https://youtu.be/oQPCxwmBiWo?t=667

Answer 49

Graph embeddings are a form of unsupervised learning on graphs.

Answer 50

- Treating common word pairs or phrases as single “words” in their model. - Sub- sampling frequent words to decrease the number of training examples. For each word we encounter in the training text, there is a chance it will be deleted from the text, and this probability is related to the word's frequency. - Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights. The probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples.

Answer 51

The StarSpace model consists of learning entities, each of which is described by a set of discrete features (bag-of- features) coming from a fixed-length dictionary.

Answer 52

False. One of the important features of StarSpace is that the model can be used to compare entities of different kinds.

Answer 53

A normal distribution.

Answer 54

A t-distribution. This is the "T" in t-SNE.

Answer 55

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. In simpler terms, t-SNE gives you a feel or intuition of how the data is arranged in a high-dimensional space

Answer 56

Perplexity balances the attention t-SNE gives to local and global aspects of the data and can have large effects on the resulting plot. Perplexity is roughly a guess of the number of close neighbors each point has. Thus, a denser dataset usually requires a higher perplexity value.

Answer 57

Optimize the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent.

Quiz #4 Flashcards

(86 cards)