Quiz #4 Flashcards
What is an embedding?
A mapping between objects to vectors through a trainable function. Generally, we want that function to create a map such that similar objects are grouped together. Examples: Word Embeddings: Word –> Vector Graph Embeddings: Node –> Vector
How is a graph embedding learned?
We optimize the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent.
When representing structured information, what three things are important?
- State: compactly representing all the data we have processed thus far 2. “Neighborhoods”: What other elements to incorporate? (e.g. spatial, part-of-speech, etc.) * Can be seen as selecting from a set of elements * Typically use some similarity measure or attention 3. Propagation of Information: How to update information given selected elements.
In a fully connected network the weights that are applied to the input are data-dependent? (True/False)
False. In a FCN, the weights are learned and applied to the input regardless of the input values. This is an important driver behind the use of non-local style neural networks. The idea is that instead of outputting a simple dot product of the weights and the input, we actually use a similarity function ‘f’ (for instance, the exponentiated dot product exp(xi.T, xj)) and use that to modulate a representation of input element j, such as Wg*xj. This is a powerful concept because it allows us to make the WEIGHTS of the network DATA-DEPENDENT, since we’re modulating our feature representation by the similarity representation of two features. It allows the network to LEARN for each piece of data, what is SALIENT. This is really the main idea behind ATTENTION MECHANISMS. See 14:00/16:19 mark in Module 3 Lesson 11 “Structures and Representations” for review of this concept.
What is a conditional language model?
It’s just like the standard language model (i.e. the probability of word given all the previously occurring words) but conditioned on an extra context ‘c’. Examples: * Topic-aware language model * Text summarization * Machine Translation * Image Captioning * Optical Character Recognition * Speech recognition
What are four problems that arise if you try to use MLPs/FC networks for modeling sequential data?
- Cannot easily support variable sized sequences as inputs or outputs 2. No inherent temporal structure 3. No practical way of holding state 4. The size of the network grows with the maximum allowed size of the input or output sequences.
The lower the PERPLEXITY score, the better a model is? (True/False)
True.
What are language models fundamentally used for?
To estimate the probability of a sequences of words given all the preceding words.
What is Masked Language Modeling?
It is an auxiliary task, different from the final task we’re interested in, but which can help us achieve better performance by finding good initial parameters for the model.
A recurrent network unfolded in time is really just a very deep feedforward network with shared weights? (True/False)
True. (Bengio, 1994)
Gradient descent becomes increasingly inefficient when the temporal span of the dependencies increases? (True/False).
True. See Bengio, 1994 for good discussion of the problems associated with training NN with long term dependencies.
What are the four main components of an LSTM network?
- Input gate: This decides what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state. 2. Forget gate: responsible for deciding what information is to be thrown away or kept from the last step. This is done by the first sigmoid layer. 3. Cell State: Essentially the memory of an LSTM, and the key that makes them much more performant at long sequences than vanilla RNNs. At each time-step the previous cell state (C_t-1) combines with the forget gate to decide what information is to be carried forward which in turn combines with the input gate (i_t and c~t) to form the new cell state or the new memory of the cell. 4. Output gate: The final output of the LSTM cell. The cell state obtained from above is passed through a hyperbolic function called tanh so that the cell state values are filtered between -1 and 1. https://towardsdatascience.com/lstm-gradients-b3996e6a0296
What is the ‘Cell State’ in an LSTM network?
Essentially the memory of an LSTM, and the key that makes them much more performant at long sequences than vanilla RNNs. The cell state acts as a transport highway that transfers relative information all the way down the sequence chain. You can think of it as the “memory” of the network. The cell state, in theory, can carry relevant information throughout the processing of the sequence. So even information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory. As the cell state goes on its journey, information get’s added or removed to the cell state via gates. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training. https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
What is the output range of the tanh function?
[-1, 1]
What is the output range of the sigmoid function? Why is this output range significant in the context of recurrent style NNs?
[0, 1]. It can be useful, for example, as part of the “forget gate” structure. If the output of the sigmoid is 0, then we could multiply it with some other input vector in order to zero out the elements in the vector, i.e. “forgetting” it.
The price recurrent networks pay for their reduced number of parameters is that optimizing the parameters may be difficult? (True/False)
True. See page 379 of DL book.
What are the four major components of an RNN?
- Input 2. Hidden State 3. Weights/Parameters 4. Output
Why is the use of fully connected layers/MLPs problematic for sequential/time-series data?
Since each of the weights and biases in a fully-connected network is INDEPENDENT, there’s no real way of maintaining the structure and order in the data. You could in theory make the network so large that it would have the capacity to memorize the order/structure information, but this would be so brittle and prone to overfitting that it doesn’t work in any practical setting.