Revision Flashcards
Discuss three types of ambiguities that make automatic processing of language complex
- lexical ambiguity (e.g. ‘pen’)
- grammatical ambiguity (e.g. ‘I saw the boy with the telescope’)
- referential ambiguity (‘the cat ate the rat; it was sleeping’)
Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages
- Bag of words
- Handcrafted features
- Word embeddings
Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages
Elaborate on Bag of words
- One hot vector of frequency vector
- Very sparse (English has 171K words - Oxford Dictionary)
- Cannot model relatedness btw. words, e.g. cat & kitten
Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages
Elaborate on handcrafted features
Requires domain knowledge (pos/neg words), task-specific
Give three examples of representations we can use for NLP tasks and discuss their advantages and disadvantages
Elaborate on word embeddings
- By modelling similarity between words in corpora
- Instead of counting co-occurrences, predict context words in context
- Computationally efficient
- Possible to add new words to the model: it scales with corpus size
In a skip-gram word2vec network using negative sampling, which hyperparameters define the number of parameters to optimise?
- vocabulary size
- hidden layer dimensionality
- number of negative/contrastive words
Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.
Provide the input and output shapes for each convolutional layer for the following example:
she can also speak french and german.
(1) 8x100 -> 6x300 -> 6x30
(2) 8x100 -> 4x500 -> 4x30
Setting: Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.
Given the window size hyperparameters from the previous question: is padding necessary for the following training example?
he was right.
Padding is necessary for this training example (to match the length in words of 5).
he was right . < pad >
The actual length in words of this example (N=4) is shorter than the largest window size of the given CNN (n=5). Each text input length has to be at least as long as the largest CNN window size to ensure feature extraction.
Setting: Consider a Convolutional Neural Network with these hyperparameters: word embedding dimensionality = 100, two 1d convolution layers with 30 filters, with respective window sizes 3 and 5 applied in parallel.
Given the hyperparameters from the previous question, provide the filter sizes for both convolutional layers?
The filter size is 1 X 300 for the convolutional layer with the window size 3. The filter size is 1 X 500 for the convolutional layer with the window size 5. The filter size is computed as follows:
(1 X (window_size * embedding_dim)).
Briefly describe the functioning of the convolution and pooling architecture for language tasks?
The main operation a CNN performs is convolving with a filter. The size of the filter depends on the window size and the embedding size (for example, for a 1d convolution the filter size is 1 X (window_size * embedding_dim)). The filter is applied as many times as needed to each window in the sequence. In the result for each window we have a scalar. Then, the pooling operation takes, for example, the max over these per window scalars. The result is a single d-dimensional vector (for example, for max-pooling d is equal to the number of filters).
What is the dimensionality of the output layer for a neural network (any type) built for a binary classification task? Explain how predictions are computed given the output of the network.
The dimensionality of the output layer for a binary prediction task is 1. The output layer of the network provides a real-valued number z, further passed through a sigmoid function:
P(y=1) = sigmoid(w*x+b)
The sigmoid function maps this value into the range [0,1]. The following decision rule is then applied:
y_pred = 1 if P(y=1 | x) > 0.5 else 0.
Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.
NAIVE BAYES, logistic regression, feed-forward neural network (ffnn), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn
we have data with cancer-related terms annotated by humans
Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.
naive bayes, LOGISTIC REGRESSION, feed-forward neural network (ffnn), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn
we have data with medical terms annotated by humans (some of the terms are relevant to the problem, some are not)
Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.
naive bayes, logistic regression, FEED-FORWARD NEURAL NETWORK (FFNN), convolutional nn (cnn), recurrent nn (rnn), cnn & rnn
we have data with medical terms and jargon annotated by humans (again we don’t know which annotated words are relevant, but even more complex
problem)
Provide examples of scenarios where each classifier type will be good. Take cancer detection as the problem to model.
naive bayes, logistic regression, feed-foward nn (ffnn), CONVOLUTIONAL NN (CNN), recurrent nn (rnn), cnn & rnn
we assume that the cancer diagnosis depends on facts in the patient history (we need to detect relevant facts)