Revision Flashcards

1
Q

Discuss three types of ambiguities that make automatic processing of language complex

A
  • lexical ambiguity (e.g. ‘pen’)
  • grammatical ambiguity (e.g. ‘I saw the boy with the telescope’)
  • referential ambiguity (‘the cat ate the rat; it was sleeping’)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages

A
  • Bag of words
  • Handcrafted features
  • Word embeddings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages

Elaborate on Bag of words

A
  • One hot vector of frequency vector
  • Very sparse (English has 171K words - Oxford Dictionary)
  • Cannot model relatedness btw. words, e.g. cat & kitten
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages

Elaborate on handcrafted features

A

Requires domain knowledge (pos/neg words), task-specific

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages

Elaborate on word embeddings

A
  • By modelling similarity between words in corpora
  • Instead of counting co-occurrences, predict context words in context
  • Computationally efficient
  • Possible to add new words to the model: it scales with corpus size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In a skip-gram word2vec network using negative sampling, which hyperparameters define the number of parameters to optimise?

A
  • vocabulary size
  • hidden layer dimensionality
  • number of negative/contrastive words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.
Provide the input and output shapes for each convolutional layer for the following example:

she can also speak french and german.

A

(1) 8x100 -> 6x300 -> 6x30

(2) 8x100 -> 4x500 -> 4x30

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Setting: Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.

Given the window size hyperparameters from the previous question: is padding necessary for the following training example?

he was right.

A

Padding is necessary for this training example (to match the length in words of 5).

he was right . < pad >

The actual length in words of this example (N=4) is shorter than the largest window size of the given CNN (n=5). Each text input length has to be at least as long as the largest CNN window size to ensure feature extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Setting: Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.

Given the hyperparameters from the previous question, provide the filter sizes for both convolutional layers?

A

The filter size is 1 X 300 for the convolutional layer with the window size 3. The filter size is 1 X 500 for the convolutional layer with the window size 5. The filter size is computed as follows:

(1 X (window_size * embedding_dim)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Briefly describe the functioning of the convolution and pooling architecture for language tasks?

A

The main operation a CNN performs is convolving with a filter. The size of the filter depends on the window size and the embedding size (for example, for a 1d convolution the filter size is 1 X (window_size * embedding_dim)). The filter is applied as many times as needed to each window in the sequence. In the result for each window we have a scalar. Then, the pooling operation takes, for example, the max over these per window scalars. The result is a single d-dimensional vector (for example, for max-pooling d is equal to the number of filters).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the dimensionality of the output layer for a neural network (any type) built for a binary classification task? Explain how predictions are computed given the output of the network.

A

The dimensionality of the output layer for a binary prediction task is 1. The output layer of the network provides a real-valued number z, further passed through a sigmoid function:

P(y=1) = sigmoid(w*x+b)

The sigmoid function maps this value into the range [0,1]. The following decision rule is then applied:

y_pred = 1 if P(y=1 | x) > 0.5 else 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

NAIVE BAYES, logistic regression, feed-forward neural network (ffnn), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn

A

we have data with cancer-related terms annotated by humans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, LOGISTIC REGRESSION, feed-forward neural network (ffnn), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn

A

we have data with medical terms annotated by humans (some of the terms are relevant to the problem, some are not)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, logistic regression, FEED-FORWARD NEURAL NETWORK (FFNN), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn

A

we have data with medical terms and jargon annotated by humans (again we don’t know which annotated words are relevant, but even more complex
problem)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, logistic regression, feed-foward nn (ffnn), CONVOLUTIONAL NN (CNN), recurrent nn (rnn), cnn & rnn

A

we assume that the cancer diagnosis depends on facts in the patient history (we need to detect relevant facts)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, logistic regression, feed-foward nn (ffnn), convolutional nn (cnn), RECURRENT NN (RNN), cnn & rnn

A

we assume that the cancer diagnosis depends on the sequence of facts in the patient history (we need to detect relevant sequences of facts)

17
Q

Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.

naive bayes, logistic regression, feed-foward nn (ffnn), convolutional nn (cnn), recurrent nn (rnn), CNN & RNN

A

we have a training corpus in German and a test set in English (character and subword level solutions are required).

18
Q

Which problems of Naive Bayes are addressed by each classifier type?

A
  • Logistic Regression, FFNNs: all features are equally important; conditional independence assumption
  • CNNs, RNNs: all features are equally important; conditional independence assumption; context not taken into account; unknown words (for character- and subword-level architectures)
19
Q

Provide examples of pre-processing and architectural solutions that would be helpful to decrease the number of unknown words during test time?

A
  • For pre-processing, for example, remove hyphens from hyphenated words: the word ‘above-mentioned’ will be separated into two words ‘above’ and ‘mentioned’
  • character- and subword-level CNN and RNN architectures
20
Q

Consider a binary classification task for the highly unbalanced data (1:100 positive to negative examples ratio). What metric you will use to evaluate a classifier built for this problem (F-measure or accuracy)?

A

F-measure, which is the harmonic mean of precision and recall. Recall computes the percentage of relevant instances detected by the classifier over the total amount of gold relevant instances. Precision computes the percentage of items the classifier correctly detected as relevant from all the examples marked by the classifier as relevant. Accuracy is usually not fit for highly unbalanced cases. It computes percentage of all the observations the classifier labeled correctly. This could be 90% in our case if our classifier simply assigns the negative label to all the examples.

21
Q

Language Model, LM Score

A

P(start_token, I, like, cream, end_token) = P(I | start_token) P(like | I) P(cream | like) P(end_token | cream)

22
Q

Language Model, Perplexity Score

A

PP(I, love, Japanese, food) = P(I, love, Japanese, food)^{-1/4}

PP = LM_score^{-1/N}

23
Q

POS tagging

A

P(T | W) = P(W | T)P(T)

For example:
P(start_token, PRON, VERB, NOUN | I, hate, ice) =
P(I, hate, ice | PRON, VERB, NOUN) * P(start_token, PRON, VERB, NOUN) =
P(I | PRON) P(hate | VERB) P(ice | NOUN)
P(PRON | start_token) P(VERB | PRON) P(NOUN | VERB)

24
Q

What are the advantages of RNN based Models compared to n-gram & CNN based models?

A
  • Compared to n-gram: RNNs can process inputs of any length with almost any history.
  • Compared to CNNs: The model size remains constant for any input length due to recurrent units and weight sharing
25
Q

We saw that we can apply a simple RNN model to machine translation by feeding a French sentence and obtaining the English sentence using a simple encoder RNN - feeding to an RNN decoder. Would using ‘neural sequence modelling’ in place of RNN work? Why or why not?

A

Neural sequence modelling would not work as well, since the neural sequence modelling only uses previous n-states (2 shown in class) in making its predictions. Once it starts prediction, it will not have access to all the context for the French source sentence but only the last few words, so it would be suboptimal.

26
Q

How does a bi-directional RNN differ from regular RNNs. Discuss the potential advantages.

A

A bi-directional RNN reads the sequence from left to right and from right to left. The hidden states are the concatenation of the left to right and right to left RNNs at each time step. It has access to more context than a regular RNN.

27
Q

How is the Transformer-based architecture different from RNN-based architecture?

A

An RNN-based architecture depends strictly on sequential computations where the future prediction is dependent on previous sequence information. Transformer-based architectures can compute the entire sequence information in parallel as they are based on finding the compatibility of each word with all the words in the sequence, which is a parallel operation.

28
Q

Given a query vector q and a set of key vectors {k1, … , kn}, can you compute a simple attention distribution?

A

We first obtain the scores using a simple dot product and then we obtain the distribution using the softmax function:

score = {q^T k1, …, q^T kn}

attention distribution = softmax(score)