Quiz #4 Flashcards
(86 cards)
Name five types of neural network architectures.
- Fully connected Neural Networks
- Convolution Neural Networks
- Recurrent Neural Networks
- Attention-Based Networks
- Graph-Based Networks
What is an embedding?
A learned map from entities to vectors of numbers that encode similarities. For example you can have a word embedding that maps word -> vector, or a graph embedding that maps a node -> vector.
Why were architectures like RNN, Attention-Based Networks, and Graph based networks developed?
Generally, we want to develop models that can learn relationships between objects:
- We want to be able to model hierarchical composition between additional types of data like speech data and natural language.
- Additionally, we want to model the relationship between elements in a scene (like a scene graph)
- We may want to model inter-relationships between things like words, or concepts.
What are three important things needed to represent structural information?
- State: compactly representing all the data we’ve processed so far. These are the nodes in a graph.
- Neighborhoods: These are the edges in the graph. They represent relationships and can be calculated using something like a similarity measure or attention.
- Propagation of information: creating states, or vectors, that represent concepts
Given a set of vectors U = {u_1, … , u_n}, provide an equation you can use to find the most similar vector (p) to a given vector q.
You can select a most similar vector using softmax:
p = Softmax(Uq)
Language models allows us to [1.] and [2.].
- estimate probabilities of sequences of words such as p(“I eat an apple”).
- perform comparisons: p(“I eat an apple”) > p(“Dromiceiomimus does yoga with T-Rex”)
how can you describe a sentence, s, as a product of probabilities?
Expressing the probability of s as p(x) = p(w_1,w_2,..,w_n) where w_1 is the first word in the sentence, we can use the chain rule to express this probability as:
p(w_1)p(w_2|w_1)…p(w_n|w_{n-1},…,w_1)
Or more generally:
Product_i p(w_i | w_{i-1}, … , w_1)
Language models are generative models of language. We can generate new sequences of words from it given a history of past words. (T/F)
True. We can generate new words by randomly sampling from the conditional probability of the next word, given the history:
Product_i p(w_i | w_{i-1}, ... , w_1) Here, w_i is the next word, w_{i-1}, ... , w_1 is the history.
List 3 applications of language modeling and give an example for each.
Predictive typing:
- Search fields (i.e. google)
- text completion on phone
- assisted typing (i.e. sentence completion)
Automatic speech recognition:
- How likely is the user to have said “my hair is wet” vs “my hairy sweat”?
Basic grammar correction:
- p(They’re happy together) > p(Their happy together)
What is the product you can use to calculate p(s|c) where s is a sentence, and c is a provided context.
p(s|c) = Product_i p(w_i | c, w_{i-1}, … , w_1).
Note, this is like a standard language model, but with the added parameter, c.
Provide 3 examples of how one can use condition language models in NLP tasks
- Topic-aware language model: c = topic, s = text
- Text summarization: c = long document, s = summary
- Machine translation: c = French text, s = English text
Provide 3 examples of how one can use condition language models in non-NLP tasks
- Image captioning: c = an image, s = its caption
- Optical character recognition: c = image of a line of text, s = its context
- Speech recognition: c = a recording, s = its content
Speech recognition and optical character recognition are [ ] -> [ ] sequence models. These types of models are also referred to as [ ] -> [ ]
many to many
encoder -> decoder
Sentiment analysis and topic classification are [ ] -> [ ] sequence models.
many -> one
Image captioning models are [ ] -> [ ] sequence models.
one -> many
How can you use one-hot encoding to represent words in a vocabulary?
The vector is the length of the vocabulary, and a vector is created for each word in the vocabulary. All elements in each vector are zeros, except for the index that corresponds to the location of the word in the vocabulary (i.e. where it shows up in the sentence of document)
The dog barks:
The: [1,0,0]
dog: [0,1,0]
barks: [0,0,1]
What are the pain points of using a multi-layer perceptrons to process sequence data?
- cannot easily support variable-sized sequences as inputs or outputs
- no inherent temporal structure. The multi-layer perceptron doesn’t keep track of which element in the sequence came first, second, third, etc.
- There is no practical way of holding state. There is not a memory of which words came before. I.e. bank appearing after “river” vs “money at the”.
- Size of the network grows with the maximum allowed size of the input and output sequence we want to support
Describe how a single RNN node works at a given time-step, t during training vs inference.
At time-step t, the node receives an input, which it uses to update a state, h_t.
To update the state, the node also has access to the state at the previous time-step, h_{t-1} and either:
- the ground truth (expected) output from the previous time step during training when using teacher forcing
- the predicted output from the previous time step during inference.
This becomes a recursive algorithm, in which f_theta is repeatedly called to update the state.
h_t = f_{theta}(h_{t-1}, x_t}
Describe the steps of backpropagation through an RNN
- Run the network and compute the outputs.
- Compute the loss (typically a function of all outputs).
- Perform the backward step to compute gradients.
- For models with a large number of time-steps (and thus layers), we can use truncated back-prop through time.
In an RNN model, effectively you have as many layers as [ ].
In an RNN model, effectively you have as many layers as time-steps.
What is the formal definition of an RNN?
A neural network whose information flow does not follow a directed acyclic graph.
What is the equation to update the state (h_t) using an Elman (Vanilla) RNN?
h_t = activation_function(U_theta x_t + V_theta h_{t-1} + bias_theta)
h_t = next state x_t = input U_theta = learned matrix we multiply the input by to perform affine transformation V_theta = learned matrix we multiply previous state by to perform affine transformation h_{t-1} = previous state bias_theta = learned bias term
What is the equation to update the output (y_t) using an Elman (Vanilla) RNN?
y_t = activation_function(W_theta h_t + Beta_theta)
y_t = output h_t = recently update state. W_theta = learned matrix we multiply the recently updated state by to perform an affine transformation Beta_theta = learned bias term.
Activations used for RNN
Sigmoid, Tanh, other nonlinear functions