Quiz #4 Flashcards

Question

What is the idea behind Word2Vec (i.e. the skip-gram model)?

Answer 1

Using words to predict their context words. The context here is a fixed window of size 2m. It creates a probability distribution for each of the words within the context window from -2m to +2m conditioned on the center word at position 't'.

Answer 2

Average negative log likelihood (where the likelihood is calculated by simply multiplying all the probabilities within the context window conditioned on the center word at position 't' and the paramters theta that are being optimized.

Answer 3

True. This is because it is only optimizing the word vectors, unlike the earlier NN based models

Answer 4

Two. U\_w when w is a center word, and V\_o when o is a context word. To measure how likely the word w appears with context word o, we use the inner product between U\_w and V\_o (with a softmax formulation so that the output is a probability score in range [0,1])

Answer 5

The total parameters would the set of all center word vectors {U\_w} and set of all context word vectors {V\_o}.

Answer 6

Because the size of our entire vocabulary is huge. Two approaches mentioned that can be used to mitigate this: 1. Hierarchical Softmax 2. Negative Sampling

Answer 7

1. Intrinsic \* Evaluation on a specific/intermediate subtask \* Fast to compute \* Helps to understand the system \* Not clear if really helpful unless correlation to real task is established. 2. Extrinsic \* Evaluation on real task (i.e. text classification) \* Can take a long time to compute \* Difficult to debug: unclear if subsystem is the problem or its interaction \* If exactly replacing exactly one subsystem with another improves accuracy --\> Winning!

Answer 8

Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions. Example: Man:Woman::King:?

Answer 9

True. In a graph context, an embedding is a learned map from entities to vectors of numbers that encodes similarity. This is similar to word embeddings, except that instead of mapping from word --\> vector we map from node --\> vector, and we want similar nodes to have similar vectors.

Answer 10

1. They are task-agnostic entity representations. 2. Features are useful on downstream tasks without much data. 3. Nearest neighbors are semantically meaningful.

Answer 11

True. See page 5 of https://arxiv.org/pdf/1301.3781.pdf for a very useful diagram of this.

Answer 12

t-Distributed Stochastic Neighbor Embedding. It's a dimensionality reduction tool that is useful for visualizing high-dimensional datasets.

Answer 13

t-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance.

Answer 14

I would say this is more False than True, although it isn't totally black and white. While t-SNE might appear to be a good candidate for clustering because of it's frequent use in dimensionality reduction to facilitate visualization of high dimensional data, it's important to remember that t-SNE is a non-linear transformation that DOES NOT preserve distance nor density. See this link for a good discussion of this: https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

Answer 15

The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. It then tries to optimize these two similarity measures using a cost function (KL Divergence). https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1

Answer 16

Fundamentally, it's a way of estimating the probability of a sequence occurring, for instance: P("I ate an apple") intuitively has a much higher probability of occurring than P("I ate an Hawaii"). The ability to perform comparisons like this is one of the most important reasons we use language models! Mathematically, we could say something like: Prob(sequence) = product(Prob(next word | history))

Answer 17

Vanishing/exploding gradient problems. For backprop the derivative of the output at time t with respect to to t0 is going to be proportional to the weight vector raised to to the power of t. This repeated application of the weights means that if the magnitude of W is \> 1, we end up with exploding gradients, and if magnitude of W is \< 1 we get vanishing gradients.

Answer 18

Vanishing gradient issues.

Answer 19

https://towardsdatascience.com/recurrent-neural-networks-rnns-3f06d7653a85

Answer 20

Teacher forcing is a strategy for training recurrent neural networks that uses ground truth as input, instead of model output from a prior time step as an input. https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/

Answer 21

Knowledge distillation refers to the idea of model compression by teaching a smaller network, step by step, exactly what to do using a bigger already trained network. The ‘soft labels’ refer to the output feature maps by the bigger network after every convolution layer. The smaller network is then trained to learn the exact behavior of the bigger network by trying to replicate it’s outputs at every level (not just the final loss). https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764

Quiz #4 Flashcards

(47 cards)