Revision Flashcards

Question 1

Q

Discuss three types of ambiguities that make automatic processing of language complex

Answer

A

lexical ambiguity (e.g. ‘pen’)
grammatical ambiguity (e.g. ‘I saw the boy with the telescope’)
referential ambiguity (‘the cat ate the rat; it was sleeping’)

Question 2

Q

Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages

Answer

A

Bag of words
Handcrafted features
Word embeddings

Question 3

Q

Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages

Elaborate on Bag of words

Answer

A

One hot vector of frequency vector
Very sparse (English has 171K words - Oxford Dictionary)
Cannot model relatedness btw. words, e.g. cat & kitten

Question 4

Q

Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages

Elaborate on handcrafted features

Answer

A

Requires domain knowledge (pos/neg words), task-specific

Question 5

Q

Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages

Elaborate on word embeddings

Answer

A

By modelling similarity between words in corpora
Instead of counting co-occurrences, predict context words in context
Computationally efficient
Possible to add new words to the model: it scales with corpus size

Question 6

Q

In a skip-gram word2vec network using negative sampling, which hyperparameters define the number of parameters to optimise?

Answer

A

vocabulary size
hidden layer dimensionality
number of negative/contrastive words

Question 7

Q

Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.
Provide the input and output shapes for each convolutional layer for the following example:

she can also speak french and german.

Answer

A

(1) 8x100 -> 6x300 -> 6x30

(2) 8x100 -> 4x500 -> 4x30

Question 8

Q

Setting: Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.

Given the window size hyperparameters from the previous question: is padding necessary for the following training example?

he was right.

Answer

A

Padding is necessary for this training example (to match the length in words of 5).

he was right . < pad >

The actual length in words of this example (N=4) is shorter than the largest window size of the given CNN (n=5). Each text input length has to be at least as long as the largest CNN window size to ensure feature extraction.

Question 9

Q

Setting: Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.

Given the hyperparameters from the previous question, provide the filter sizes for both convolutional layers?

Answer

A

The filter size is 1 X 300 for the convolutional layer with the window size 3. The filter size is 1 X 500 for the convolutional layer with the window size 5. The filter size is computed as follows:

(1 X (window_size * embedding_dim)).

Question 10

Q

Briefly describe the functioning of the convolution and pooling architecture for language tasks?

Answer

A

The main operation a CNN performs is convolving with a filter. The size of the filter depends on the window size and the embedding size (for example, for a 1d convolution the filter size is 1 X (window_size * embedding_dim)). The filter is applied as many times as needed to each window in the sequence. In the result for each window we have a scalar. Then, the pooling operation takes, for example, the max over these per window scalars. The result is a single d-dimensional vector (for example, for max-pooling d is equal to the number of filters).

Question 11

Q

What is the dimensionality of the output layer for a neural network (any type) built for a binary classification task? Explain how predictions are computed given the output of the network.

Answer

A

The dimensionality of the output layer for a binary prediction task is 1. The output layer of the network provides a real-valued number z, further passed through a sigmoid function:

P(y=1) = sigmoid(w*x+b)

The sigmoid function maps this value into the range [0,1]. The following decision rule is then applied:

y_pred = 1 if P(y=1 | x) > 0.5 else 0.

Question 12

Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

NAIVE BAYES, logistic regression, feed-forward neural network (ffnn), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn

Answer

A

we have data with cancer-related terms annotated by humans

Question 13

Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, LOGISTIC REGRESSION, feed-forward neural network (ffnn), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn

Answer

A

we have data with medical terms annotated by humans (some of the terms are relevant to the problem, some are not)

Question 14

Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, logistic regression, FEED-FORWARD NEURAL NETWORK (FFNN), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn

Answer

A

we have data with medical terms and jargon annotated by humans (again we don’t know which annotated words are relevant, but even more complex
problem)

Question 15

Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, logistic regression, feed-foward nn (ffnn), CONVOLUTIONAL NN (CNN), recurrent nn (rnn), cnn & rnn

Answer

A

we assume that the cancer diagnosis depends on facts in the patient history (we need to detect relevant facts)

Question 16

Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, logistic regression, feed-foward nn (ffnn), convolutional nn (cnn), RECURRENT NN (RNN), cnn & rnn

Answer

A

we assume that the cancer diagnosis depends on the sequence of facts in the patient history (we need to detect relevant sequences of facts)

Question 17

Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, logistic regression, feed-foward nn (ffnn), convolutional nn (cnn), recurrent nn (rnn), CNN & RNN

Answer

A

we have a training corpus in German and a test set in English (character and subword level solutions are required).

Question 18

Q

Which problems of Naive Bayes are addressed by each classifier type?

Answer

A

Logistic Regression, FFNNs: all features are equally important; conditional independence assumption
CNNs, RNNs: all features are equally important; conditional independence assumption; context not taken into account; unknown words (for character- and subword-level architectures)

Question 19

Q

Provide examples of pre-processing and architectural solutions that would be helpful to decrease the number of unknown words during test time?

Answer

A

For pre-processing, for example, remove hyphens from hyphenated words: the word ‘above-mentioned’ will be separated into two words ‘above’ and ‘mentioned’
character- and subword-level CNN and RNN architectures

Question 20

Q

Consider a binary classification task for the highly unbalanced data (1:100 positive to negative examples ratio). What metric you will use to evaluate a classifier built for this problem (F-measure or accuracy)?

Answer

A

F-measure, which is the harmonic mean of precision and recall. Recall computes the percentage of relevant instances detected by the classifier over the total amount of gold relevant instances. Precision computes the percentage of items the classifier correctly detected as relevant from all the examples marked by the classifier as relevant. Accuracy is usually not fit for highly unbalanced cases. It computes percentage of all the observations the classifier labeled correctly. This could be 90% in our case if our classifier simply assigns the negative label to all the examples.

Question 21

Q

Language Model, LM Score

Answer

A

P(start_token, I, like, cream, end_token) = P(I | start_token) P(like | I) P(cream | like) P(end_token | cream)

Question 22

Q

Language Model, Perplexity Score

Answer

A

PP(I, love, Japanese, food) = P(I, love, Japanese, food)^{-1/4}

PP = LM_score^{-1/N}

Question 23

Q

POS tagging

Answer

A

P(T | W) = P(W | T)P(T)

Question 24

Q

What are the advantages of RNN based Models compared to n-gram & CNN based models?

Answer

A

Compared to n-gram: RNNs can process inputs of any length with almost any history.
Compared to CNNs: The model size remains constant for any input length due to recurrent units and weight sharing

Question 25

Q

We saw that we can apply a simple RNN model to machine translation by feeding a French sentence and obtaining the English sentence using a simple encoder RNN - feeding to an RNN decoder. Would using ‘neural sequence modelling’ in place of RNN work? Why or why not?

Answer

A

Neural sequence modelling would not work as well, since the neural sequence modelling only uses previous n-states (2 shown in class) in making its predictions. Once it starts prediction, it will not have access to all the context for the French source sentence but only the last few words, so it would be suboptimal.

Question 26

Q

How does a bi-directional RNN differ from regular RNNs. Discuss the potential advantages.

Answer

A

A bi-directional RNN reads the sequence from left to right and from right to left. The hidden states are the concatenation of the left to right and right to left RNNs at each time step. It has access to more context than a regular RNN.

Question 27

Q

How is the Transformer-based architecture different from RNN-based architecture?

Answer

A

An RNN-based architecture depends strictly on sequential computations where the future prediction is dependent on previous sequence information. Transformer-based architectures can compute the entire sequence information in parallel as they are based on finding the compatibility of each word with all the words in the sequence, which is a parallel operation.

Question 28

Q

Given a query vector q and a set of key vectors {k1, … , kn}, can you compute a simple attention distribution?

Answer

A

We first obtain the scores using a simple dot product and then we obtain the distribution using the softmax function:

score = {q^T k1, …, q^T kn}

attention distribution = softmax(score)