Final Review Flashcards
(T/F) Supervised learning and unsupervised clustering both require at least one input attribute
False
(T/F) Grouping people in a social network is an example of unsupervised machine learning
True
What is topic modelling in natural language processing (NLP)?
Topic modelling is an unsupervised machine learning approach that can scan a series of documents, find word and phrase patterns within them, and automatically cluster word groupings and related expressions that best represent the set
What is a recurrent neural network (RNN)?
Recurrent neural networks are a class of neural networks that are helpful in modelling sequence data.
Derived from feed forward networks, RNNs exhibit similar behaviour to how the human brain functions. Simply put: recurrent neural networks produce predictive results in sequential data that other algorithms can’t
Explain the bias-variance tradeoff
Bias is the degree to which a models predictions vary from the true value. High bias implies a simple model that is not able to capture the complexity of the data and is underfit.
Variance is the degree to which a models predictions vary for different training sets. High variance implies a complex model that overfits to the training data.
The bias variance tradeoff is this the balance of model complexity that will get you the best amounts of bias and variance so as to not overfit or underfit to the training data and make more accurate predictions on new unseen data.
What is lexicon normalization in text preprocessing?
A type of textual noise is about the multiple representations exhibited by a single word. For example - “play”, “player”, “played”, and “plays” are different variations of the word “play”. Though they mean different things contextually they are all similar.
Lexicon normalization converts all of the disparities of a word into their normalized form (also known as the lemma). Normalization is a pivotal feature for feature engineering with text as it converts the high dimensional features to a lower dimensional space which is ideal for any machine learning model. The most common lexicon normalization practices are stemming and lemmatization
Define confusion matrix, accuracy, precision, and recall
A confusion matrix is an NxN matric used for evaluating the performance of a classification model where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kind of errors it is making.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
What are the regularization techniques that can be used for a convolutional neural network?
- L2 & L1 regularization
- Dropout
- Data augmentation
- Early stopping
Explain the steps to create a bag of words
- Tokenization: First, the input text is tokenized. A sentence is represented as a list of its constituent words and it’s done for all the input sentences
- Vocabulary creation: Of all the obtained tokenized words, only unique words are selected to create the vocabulary and then sorted in alphabetical order
- Vector creation: Finally, a sparse matrix is created for the input out of the frequency of vocabulary words. In this sparse matrix, each row sentence vector whose length (the columns of the matrix) is equal to the size of the vocabulary
(T/F) You have classification data with classes Y = {+1, -1} and features Fi = {+1, -1} for i = {1, …, K}. In an attempt to turbocharge your classifier you duplicate each feature so now each example has 2K features with Fk+i = Fi for i = {1, …, K}. The following questions compare the original feature set with the doubled one. You may assume in the case of ties, class +1 is always chosen. Assume there are equal numbers of training examples in each class.
For a Naive Bayes model, which of the following are true:
1. The test accuracy could be higher with the doubled feature set
2. The test accuracy will be the same with either feature set
3. The test accuracy could be higher with the original features
- False
- False
- True
You are training a sum model and you find training loss is near 0 but test loss is very high. Which of the following is expected to reduce test loss? (multi)
- Increase training data size
- Decrease training data size
- Increase model complexity
- Decrease model complexity
- Training on a combination of training and test but only test on test
- Conclude that ML doesn’t work
- Increase training data size
- Decrease model complexity
- Training on a combination of training and test but test only on test (would reduce test loss but is not good practice)
You train a linear classifier on 1000 training points and discover accuracy is only 50%. Which of the following if done in isolation has a good chance of improving training accuracy? (multi)
1. Add new features
2. Train on more data
3. Train on less data
- Add new features
- Train on less data
In supervised learning, training data includes:
1. Output
2. Input
3. Both
4. None
Both
You are given rows of a few Netflix series marked as positive, negative, or neutral. Classifying reviews of a new Netflix series is an example of:
1. Supervised Learning
2. Unsupervised Learning
3. Semisupervised Learning
4. Reinforcement Learning
Supervised learning
Which of the following is the second stage in NLP?
1. Discourse analysis
2. Syntactic analysis
3. Semantic analysis
4. Pragmatic analysis
Syntactic analysis
Text summarization finds the most informative sentences in which of the following:
1. Video
2. Sound
3. Image
4. Document
Document
Why is the XOR problem exceptionally interesting to researchers?
Because it is the simplest linearly inseparable problem that exists
Which of the following gives non-linearity to a neural network?
1. Convolution
2. Stochastic gradient descent
3. Sigmoid activation function
4. Non-zero bias
Sigmoid activation function
A matches the start of the string and B matches the end:
1. A = ^, B = $
2. A = $, B = ^
3. A = $, B = ?
4. A = ?, B = ^
A = ^, B = $
If we use K-means on a finite set of examples, which of the following is true:
1. K-Means is not guaranteed to terminate
2. K-Means is guaranteed to terminate but is not guaranteed to find the optimal clustering
3. K-Means is guaranteed to terminate and find the optimal clustering
4. None of the above
K-Means is guaranteed to terminate but is not guaranteed to find the optimal clustering
Given a sound clip of a person speaking, the textual representation of the speech can be determined by what?
Speech-to-text
Naive Bayes Requires:
1. Categorical Values
2. Numerical Values
3. Either 1 or 2
4. Both 1 and 2
Categorical values
Which of the following are the most widely used metrics and tools to assess a classification model?
1. Confusion matrix
2. Precision
3. Area under the ROC curve
4. All of the above
All of the above
In a classification problem if, according to the hypothesis, output should be positive but it is negative it is said to be:
1. False positive
2. False negative
3. Consistent hypothesis
4. None of the above
False negative
In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden layer, and 1 neuron in the output layer. What is the size of the weight matrices between the hidden output layer and the input hidden layer
1. 1x5, 5x8
2. 5x1 8x5
3. 8x5 5x1
4. 8x5 1x5
5x1 8x5
_____ is a high level API built on tensorflow
1. PyBrain
2. Keras
3. PyTorch
4. Theano
Keras
The classification boundary realized by the perceptron is:
1. Parabola
2. Straight line
3. Circle
4. Ellipse
Parabola
How do we calculate the hidden layer input for a multi-layer perceptron that is to be inserted into the activation function?
hidden_layer_input = matrix_dot_product(X, wh) + bh
Where X is the input matric, wh is the weight matrix, and bh is the bias matrix
What is the sigmoid activation function?
1 / (1 + e^(-x))
How do we calculate the slope or gradient of hidden and output layer neurons?
Calculating the derivatives of non-linear activation functions at each layer for each neuron. Computing the derivative of the function at the output value
how do we calculate the error gradient?
Eg = dEt / dw
Where Eg is the error gradient and dEt/dw is the partial derivative of the total error with respect to weight
How do we calculate the change factor (delta) at the output later?
d_output = Eg * slope_output
Where Eg is the error gradient
What are the steps in one epoch of training a multi-level perceptron?
- Forward propagation
- Compute the loss
- Backward propagation
- Update the weights
What happens in forward propagation?
Input data is passed through the input layer. Each neuron computes the weighted sum of inputs and passes it through its activation function to get an output. This output is passed to the next layer of neurons and the process is repeated until the output layer which is the predicted output of the network
What happens in backward propagation?
Once loss is computed, the error is propagated back through the network. We take the derivative of the loss function with respect to the output of each neuron in the layer. We multiply this by the derivative of the activation function to get the delta value. This value serves as the input to the previous layer and we repeat until we’re back to the input layer
What happens when we update the weights of the MLP after backward propagation?
We update weights using an optimization algorithm like stochastic gradient descent to minimize the loss. The amount that weights are updated depends on the given learning rate which is a hyperparameter
What is pooling in a convolutional neural network?
We take the input matrix and replace each non-overlapping 2x2 block with the maximum in that submatrix. The purpose is to reduce the size of output from the convolutional layer while retaining the most important information. It reduces the number of parameters and prevents overfitting
What happens in a convolutional layer of a convolutional neural network
A filter is moved across the input image to detect patterns and features. The filter consists of weights that are adjusted during training. The convolutional later reduces the size of the input data while extracting important features
What is stochastic gradient descent?
Used to adjust the parameters of a model to make more accurate predictions. It calculates the gradient of the error function (how far off the predictions are) on a small subnet of data and uses this to update the model’s parameters. This is repeated until the error is at a minimum
What is data augmentation?
Adds Gaussian noise around each data point to the inputs while leaving the output unchanged. It improves the generalization ability of the model by exposing it to more variation in the input data so it is more robust to variations
What are the 3 steps to text preprocessing in natural language processing?
Noise removal, lexicon normalization, object standardization
What is noise removal in NLP?
Removing pieces of text which are not relevant to the context of the data. A general approach is to make a dictionary of “noisy” entities and eliminate tokens that are within that dictionary
What are the most common lexicon normalization practices?
Stemming, lemmatization
What is lemmatization?
An organized procedure of obtaining the root form of a word to reduce inflections of variant forms to the base form.
Ex: am, are, is -> be
How do you calculate the inverse document frequency for a set of documents?
IDF = log(total # of docs / # of docs containing word W)
What is the formula for the TF-IDF score?
w = tf * log(N/df)