Quiz #4 Flashcards

1
Q

What is an embedding?

A

A mapping between objects to vectors through a trainable function. Generally, we want that function to create a map such that similar objects are grouped together. Examples: Word Embeddings: Word –> Vector Graph Embeddings: Node –> Vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is a graph embedding learned?

A

We optimize the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When representing structured information, what three things are important?

A
  1. State: compactly representing all the data we have processed thus far 2. “Neighborhoods”: What other elements to incorporate? (e.g. spatial, part-of-speech, etc.) * Can be seen as selecting from a set of elements * Typically use some similarity measure or attention 3. Propagation of Information: How to update information given selected elements.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In a fully connected network the weights that are applied to the input are data-dependent? (True/False)

A

False. In a FCN, the weights are learned and applied to the input regardless of the input values. This is an important driver behind the use of non-local style neural networks. The idea is that instead of outputting a simple dot product of the weights and the input, we actually use a similarity function ‘f’ (for instance, the exponentiated dot product exp(xi.T, xj)) and use that to modulate a representation of input element j, such as Wg*xj. This is a powerful concept because it allows us to make the WEIGHTS of the network DATA-DEPENDENT, since we’re modulating our feature representation by the similarity representation of two features. It allows the network to LEARN for each piece of data, what is SALIENT. This is really the main idea behind ATTENTION MECHANISMS. See 14:00/16:19 mark in Module 3 Lesson 11 “Structures and Representations” for review of this concept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a conditional language model?

A

It’s just like the standard language model (i.e. the probability of word given all the previously occurring words) but conditioned on an extra context ‘c’. Examples: * Topic-aware language model * Text summarization * Machine Translation * Image Captioning * Optical Character Recognition * Speech recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are four problems that arise if you try to use MLPs/FC networks for modeling sequential data?

A
  1. Cannot easily support variable sized sequences as inputs or outputs 2. No inherent temporal structure 3. No practical way of holding state 4. The size of the network grows with the maximum allowed size of the input or output sequences.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The lower the PERPLEXITY score, the better a model is? (True/False)

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are language models fundamentally used for?

A

To estimate the probability of a sequences of words given all the preceding words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Masked Language Modeling?

A

It is an auxiliary task, different from the final task we’re interested in, but which can help us achieve better performance by finding good initial parameters for the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A recurrent network unfolded in time is really just a very deep feedforward network with shared weights? (True/False)

A

True. (Bengio, 1994)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Gradient descent becomes increasingly inefficient when the temporal span of the dependencies increases? (True/False).

A

True. See Bengio, 1994 for good discussion of the problems associated with training NN with long term dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the four main components of an LSTM network?

A
  1. Input gate: This decides what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state. 2. Forget gate: responsible for deciding what information is to be thrown away or kept from the last step. This is done by the first sigmoid layer. 3. Cell State: Essentially the memory of an LSTM, and the key that makes them much more performant at long sequences than vanilla RNNs. At each time-step the previous cell state (C_t-1) combines with the forget gate to decide what information is to be carried forward which in turn combines with the input gate (i_t and c~t) to form the new cell state or the new memory of the cell. 4. Output gate: The final output of the LSTM cell. The cell state obtained from above is passed through a hyperbolic function called tanh so that the cell state values are filtered between -1 and 1. https://towardsdatascience.com/lstm-gradients-b3996e6a0296
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the ‘Cell State’ in an LSTM network?

A

Essentially the memory of an LSTM, and the key that makes them much more performant at long sequences than vanilla RNNs. The cell state acts as a transport highway that transfers relative information all the way down the sequence chain. You can think of it as the “memory” of the network. The cell state, in theory, can carry relevant information throughout the processing of the sequence. So even information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory. As the cell state goes on its journey, information get’s added or removed to the cell state via gates. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training. https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the output range of the tanh function?

A

[-1, 1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the output range of the sigmoid function? Why is this output range significant in the context of recurrent style NNs?

A

[0, 1]. It can be useful, for example, as part of the “forget gate” structure. If the output of the sigmoid is 0, then we could multiply it with some other input vector in order to zero out the elements in the vector, i.e. “forgetting” it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The price recurrent networks pay for their reduced number of parameters is that optimizing the parameters may be difficult? (True/False)

A

True. See page 379 of DL book.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the four major components of an RNN?

A
  1. Input 2. Hidden State 3. Weights/Parameters 4. Output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is the use of fully connected layers/MLPs problematic for sequential/time-series data?

A

Since each of the weights and biases in a fully-connected network is INDEPENDENT, there’s no real way of maintaining the structure and order in the data. You could in theory make the network so large that it would have the capacity to memorize the order/structure information, but this would be so brittle and prone to overfitting that it doesn’t work in any practical setting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What role does the hidden state h(t) play in an RNN, and what is it a function of?

A

It’s a contextual vector at time t that acts as a “memory” of the past state(s) of the network. It is calculated as a function of the current input x(t) and the previous hidden state h(t-1).

20
Q

The hidden state and the inputs use the same weights and biases in an RNN? (True/False)

A

False. This is really a trick question. RNNs DO share weights/biases, but they are shared across time. However, the hidden state and the input weights aren’t just copies of the same weights, they each use their own set. More concretely, say we have the input weight matrix U and the hidden state weight matrix V. if we have the sentence fragment “The quick brown fox.”, we would feed each word in one at a time, and U and V would be applied to each word. Then in the backprop update, the weights for both U and V would be updated based on the gradient.

21
Q

RNNs share the same weights across time? (True/False)

A

True. This parameter sharing is one of the primary benefits of RNNs, as it allows the model to retain information about structure and temporal ordering.

22
Q

What is the “Distributional Semantics” concept?

A

It is the idea that if you understand the context that a word is used in, then it means you must have some understanding of the word itself. “You shall know a word by the company it keeps.” (Firth, 1957) It is one of the most successful ideas of modern statistical NLP.

23
Q

What is the CONTEXT of a word ‘w’ that appears in a text?

A

It is the words that appear nearby, i.e. within some fixed-size window. We can use the many different contexts that w appears in to build up a representation of the word.

24
Q

What is the idea behind Collobert & Weston vectors?

A

A word and its context comprise a POSITIVE training sample; a random word inserted in that sample context comprises a NEGATIVE training sample.

25
Q

What is the idea behind Word2Vec (i.e. the skip-gram model)?

A

Using words to predict their context words. The context here is a fixed window of size 2m. It creates a probability distribution for each of the words within the context window from -2m to +2m conditioned on the center word at position ‘t’.

26
Q

What is the objective function of the Word2Vec model?

A

Average negative log likelihood (where the likelihood is calculated by simply multiplying all the probabilities within the context window conditioned on the center word at position ‘t’ and the paramters theta that are being optimized.

27
Q

Word2Vec is fast to compute compared to earlier NN based models.? (True/False)

A

True. This is because it is only optimizing the word vectors, unlike the earlier NN based models

28
Q

How many sets of vectors are used for each word in the vocabulary for the Word2Vec model?

A

Two. U_w when w is a center word, and V_o when o is a context word. To measure how likely the word w appears with context word o, we use the inner product between U_w and V_o (with a softmax formulation so that the output is a probability score in range [0,1])

29
Q

How many parameters are used in the Word2Vec model (in terms of their center word and context word vectors U_w and V_o)?

A

The total parameters would the set of all center word vectors {U_w} and set of all context word vectors {V_o}.

30
Q

Why is SGD expensive to compute for the Word2Vec model (when using the basic softmax formulation)? What are two approaches for dealing with this?

A

Because the size of our entire vocabulary is huge. Two approaches mentioned that can be used to mitigate this: 1. Hierarchical Softmax 2. Negative Sampling

31
Q

What are the two main ways of evaluating word embeddings?

A
  1. Intrinsic * Evaluation on a specific/intermediate subtask * Fast to compute * Helps to understand the system * Not clear if really helpful unless correlation to real task is established. 2. Extrinsic * Evaluation on real task (i.e. text classification) * Can take a long time to compute * Difficult to debug: unclear if subsystem is the problem or its interaction * If exactly replacing exactly one subsystem with another improves accuracy –> Winning!
32
Q

What is the idea behind ‘intrinsic’ word embedding evaluation?

A

Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions. Example: Man:Woman::King:?

33
Q

Graph embeddings can be thought of as a generalization of word embeddings? (True/False)

A

True. In a graph context, an embedding is a learned map from entities to vectors of numbers that encodes similarity. This is similar to word embeddings, except that instead of mapping from word –> vector we map from node –> vector, and we want similar nodes to have similar vectors.

34
Q

Graph embeddings are a form of unsupervised learning on graphs? (True/False)

A

True.

35
Q

What are three reasons graph embeddings are useful?

A
  1. They are task-agnostic entity representations. 2. Features are useful on downstream tasks without much data. 3. Nearest neighbors are semantically meaningful.
36
Q

The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word? (True/False)

A

True. See page 5 of https://arxiv.org/pdf/1301.3781.pdf for a very useful diagram of this.

37
Q

What is t-SNE?

A

t-Distributed Stochastic Neighbor Embedding. It’s a dimensionality reduction tool that is useful for visualizing high-dimensional datasets.

38
Q

How does t-SNE differ from PCA?

A

t-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance.

39
Q

t-SNE is good for clustering?

A

I would say this is more False than True, although it isn’t totally black and white. While t-SNE might appear to be a good candidate for clustering because of it’s frequent use in dimensionality reduction to facilitate visualization of high dimensional data, it’s important to remember that t-SNE is a non-linear transformation that DOES NOT preserve distance nor density. See this link for a good discussion of this: https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

40
Q

How is t-SNE performed?

A

The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. It then tries to optimize these two similarity measures using a cost function (KL Divergence). https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1

41
Q

What is a language model?

A

Fundamentally, it’s a way of estimating the probability of a sequence occurring, for instance: P(“I ate an apple”) intuitively has a much higher probability of occurring than P(“I ate an Hawaii”). The ability to perform comparisons like this is one of the most important reasons we use language models! Mathematically, we could say something like: Prob(sequence) = product(Prob(next word | history))

42
Q

What is one main reason vanilla RNNs are difficult to train?

A

Vanishing/exploding gradient problems. For backprop the derivative of the output at time t with respect to to t0 is going to be proportional to the weight vector raised to to the power of t. This repeated application of the weights means that if the magnitude of W is > 1, we end up with exploding gradients, and if magnitude of W is < 1 we get vanishing gradients.

43
Q

What problem do LSTMs aim to remedy that are inherent to vanilla RNNs?

A

Vanishing gradient issues.

44
Q

What is the update rule for the forward pass of vanilla RNNs?

A

https://towardsdatascience.com/recurrent-neural-networks-rnns-3f06d7653a85

45
Q

What are the update equations for an LSTM?

A
46
Q

What is teacher/student forcing?

A

Teacher forcing is a strategy for training recurrent neural networks that uses ground truth as input, instead of model output from a prior time step as an input.

https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/

47
Q

What is knowledge distillation?

A

Knowledge distillation refers to the idea of model compression by teaching a smaller network, step by step, exactly what to do using a bigger already trained network. The ‘soft labels’ refer to the output feature maps by the bigger network after every convolution layer. The smaller network is then trained to learn the exact behavior of the bigger network by trying to replicate it’s outputs at every level (not just the final loss).

https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764