You are training a sum model and you find training loss is near 0 but test loss is very high. Which of the following is expected to reduce test loss? (multi) 1. Increase training data size 2. Decrease training data size 3. Increase model complexity 4. Decrease model complexity 5. Training on a combination of training and test but only test on test 6. Conclude that ML doesn't work

1. Increase training data size 2. Decrease model complexity 3. Training on a combination of training and test but test only on test (would reduce test loss but is not good practice)

Final Review Flashcards by Branden Wheeler

(T/F) Supervised learning and unsupervised clustering both require at least one input attribute

False

How well did you know this?

Not at all

Perfectly

(T/F) Grouping people in a social network is an example of unsupervised machine learning

True

How well did you know this?

Not at all

Perfectly

What is topic modelling in natural language processing (NLP)?

Topic modelling is an unsupervised machine learning approach that can scan a series of documents, find word and phrase patterns within them, and automatically cluster word groupings and related expressions that best represent the set

How well did you know this?

Not at all

Perfectly

What is a recurrent neural network (RNN)?

Recurrent neural networks are a class of neural networks that are helpful in modelling sequence data.
Derived from feed forward networks, RNNs exhibit similar behaviour to how the human brain functions. Simply put: recurrent neural networks produce predictive results in sequential data that other algorithms can’t

How well did you know this?

Not at all

Perfectly

Explain the bias-variance tradeoff

Bias is the degree to which a models predictions vary from the true value. High bias implies a simple model that is not able to capture the complexity of the data and is underfit.
Variance is the degree to which a models predictions vary for different training sets. High variance implies a complex model that overfits to the training data.
The bias variance tradeoff is this the balance of model complexity that will get you the best amounts of bias and variance so as to not overfit or underfit to the training data and make more accurate predictions on new unseen data.

How well did you know this?

Not at all

Perfectly

What is lexicon normalization in text preprocessing?

A type of textual noise is about the multiple representations exhibited by a single word. For example - “play”, “player”, “played”, and “plays” are different variations of the word “play”. Though they mean different things contextually they are all similar.

Lexicon normalization converts all of the disparities of a word into their normalized form (also known as the lemma). Normalization is a pivotal feature for feature engineering with text as it converts the high dimensional features to a lower dimensional space which is ideal for any machine learning model. The most common lexicon normalization practices are stemming and lemmatization

How well did you know this?

Not at all

Perfectly

Define confusion matrix, accuracy, precision, and recall

A confusion matrix is an NxN matric used for evaluating the performance of a classification model where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kind of errors it is making.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)

How well did you know this?

Not at all

Perfectly

What are the regularization techniques that can be used for a convolutional neural network?

L2 & L1 regularization
Dropout
Data augmentation
Early stopping

How well did you know this?

Not at all

Perfectly

Explain the steps to create a bag of words

Tokenization: First, the input text is tokenized. A sentence is represented as a list of its constituent words and it’s done for all the input sentences
Vocabulary creation: Of all the obtained tokenized words, only unique words are selected to create the vocabulary and then sorted in alphabetical order
Vector creation: Finally, a sparse matrix is created for the input out of the frequency of vocabulary words. In this sparse matrix, each row sentence vector whose length (the columns of the matrix) is equal to the size of the vocabulary

How well did you know this?

Not at all

Perfectly

(T/F) You have classification data with classes Y = {+1, -1} and features Fi = {+1, -1} for i = {1, …, K}. In an attempt to turbocharge your classifier you duplicate each feature so now each example has 2K features with Fk+i = Fi for i = {1, …, K}. The following questions compare the original feature set with the doubled one. You may assume in the case of ties, class +1 is always chosen. Assume there are equal numbers of training examples in each class.
For a Naive Bayes model, which of the following are true:
1. The test accuracy could be higher with the doubled feature set
2. The test accuracy will be the same with either feature set
3. The test accuracy could be higher with the original features

False
False
True

How well did you know this?

Not at all

Perfectly

You are training a sum model and you find training loss is near 0 but test loss is very high. Which of the following is expected to reduce test loss? (multi)

Increase training data size
Decrease training data size
Increase model complexity
Decrease model complexity
Training on a combination of training and test but only test on test
Conclude that ML doesn’t work

Increase training data size
Decrease model complexity
Training on a combination of training and test but test only on test (would reduce test loss but is not good practice)

How well did you know this?

Not at all

Perfectly

You train a linear classifier on 1000 training points and discover accuracy is only 50%. Which of the following if done in isolation has a good chance of improving training accuracy? (multi)
1. Add new features
2. Train on more data
3. Train on less data

Add new features
Train on less data

How well did you know this?

Not at all

Perfectly

In supervised learning, training data includes:
1. Output
2. Input
3. Both
4. None

Both

How well did you know this?

Not at all

Perfectly

You are given rows of a few Netflix series marked as positive, negative, or neutral. Classifying reviews of a new Netflix series is an example of:
1. Supervised Learning
2. Unsupervised Learning
3. Semisupervised Learning
4. Reinforcement Learning

Supervised learning

How well did you know this?

Not at all

Perfectly

Which of the following is the second stage in NLP?
1. Discourse analysis
2. Syntactic analysis
3. Semantic analysis
4. Pragmatic analysis

Syntactic analysis

How well did you know this?

Not at all

Perfectly

Text summarization finds the most informative sentences in which of the following:
1. Video
2. Sound
3. Image
4. Document

Document

How well did you know this?

Not at all

Perfectly

Why is the XOR problem exceptionally interesting to researchers?

Because it is the simplest linearly inseparable problem that exists

How well did you know this?

Not at all

Perfectly

Which of the following gives non-linearity to a neural network?
1. Convolution
2. Stochastic gradient descent
3. Sigmoid activation function
4. Non-zero bias

Sigmoid activation function

How well did you know this?

Not at all

Perfectly

A matches the start of the string and B matches the end:
1. A = ^, B = $
2. A = $, B = ^
3. A = $, B = ?
4. A = ?, B = ^

Study These Flashcards

A = ^, B = $

If we use K-means on a finite set of examples, which of the following is true:
1. K-Means is not guaranteed to terminate
2. K-Means is guaranteed to terminate but is not guaranteed to find the optimal clustering
3. K-Means is guaranteed to terminate and find the optimal clustering
4. None of the above

Study These Flashcards

K-Means is guaranteed to terminate but is not guaranteed to find the optimal clustering

Given a sound clip of a person speaking, the textual representation of the speech can be determined by what?

Study These Flashcards

Speech-to-text

Naive Bayes Requires:
1. Categorical Values
2. Numerical Values
3. Either 1 or 2
4. Both 1 and 2

Study These Flashcards

Categorical values

Which of the following are the most widely used metrics and tools to assess a classification model?
1. Confusion matrix
2. Precision
3. Area under the ROC curve
4. All of the above

Study These Flashcards

All of the above

In a classification problem if, according to the hypothesis, output should be positive but it is negative it is said to be:
1. False positive
2. False negative
3. Consistent hypothesis
4. None of the above

Study These Flashcards

False negative

In a simple MLP model with 8 neurons in the input layer, 5 neurons in the hidden layer, and 1 neuron in the output layer. What is the size of the weight matrices between the hidden output layer and the input hidden layer 1. 1x5, 5x8 2. 5x1 8x5 3. 8x5 5x1 4. 8x5 1x5

5x1 8x5

_____ is a high level API built on tensorflow 1. PyBrain 2. Keras 3. PyTorch 4. Theano

Keras

The classification boundary realized by the perceptron is: 1. Parabola 2. Straight line 3. Circle 4. Ellipse

Parabola

How do we calculate the hidden layer input for a multi-layer perceptron that is to be inserted into the activation function?

hidden_layer_input = matrix_dot_product(X, wh) + bh Where X is the input matric, wh is the weight matrix, and bh is the bias matrix

What is the sigmoid activation function?

1 / (1 + e^(-x))

How do we calculate the slope or gradient of hidden and output layer neurons?

Calculating the derivatives of non-linear activation functions at each layer for each neuron. Computing the derivative of the function at the output value

how do we calculate the error gradient?

Eg = dEt / dw Where Eg is the error gradient and dEt/dw is the partial derivative of the total error with respect to weight

How do we calculate the change factor (delta) at the output later?

d_output = Eg * slope_output Where Eg is the error gradient

What are the steps in one epoch of training a multi-level perceptron?

1. Forward propagation 2. Compute the loss 3. Backward propagation 4. Update the weights

What happens in forward propagation?

Input data is passed through the input layer. Each neuron computes the weighted sum of inputs and passes it through its activation function to get an output. This output is passed to the next layer of neurons and the process is repeated until the output layer which is the predicted output of the network

What happens in backward propagation?

Once loss is computed, the error is propagated back through the network. We take the derivative of the loss function with respect to the output of each neuron in the layer. We multiply this by the derivative of the activation function to get the delta value. This value serves as the input to the previous layer and we repeat until we're back to the input layer

What happens when we update the weights of the MLP after backward propagation?

We update weights using an optimization algorithm like stochastic gradient descent to minimize the loss. The amount that weights are updated depends on the given learning rate which is a hyperparameter

What is pooling in a convolutional neural network?

We take the input matrix and replace each non-overlapping 2x2 block with the maximum in that submatrix. The purpose is to reduce the size of output from the convolutional layer while retaining the most important information. It reduces the number of parameters and prevents overfitting

What happens in a convolutional layer of a convolutional neural network

A filter is moved across the input image to detect patterns and features. The filter consists of weights that are adjusted during training. The convolutional later reduces the size of the input data while extracting important features

What is stochastic gradient descent?

Used to adjust the parameters of a model to make more accurate predictions. It calculates the gradient of the error function (how far off the predictions are) on a small subnet of data and uses this to update the model's parameters. This is repeated until the error is at a minimum

What is data augmentation?

Adds Gaussian noise around each data point to the inputs while leaving the output unchanged. It improves the generalization ability of the model by exposing it to more variation in the input data so it is more robust to variations

What are the 3 steps to text preprocessing in natural language processing?

Noise removal, lexicon normalization, object standardization

What is noise removal in NLP?

Removing pieces of text which are not relevant to the context of the data. A general approach is to make a dictionary of "noisy" entities and eliminate tokens that are within that dictionary

What are the most common lexicon normalization practices?

Stemming, lemmatization

What is lemmatization?

An organized procedure of obtaining the root form of a word to reduce inflections of variant forms to the base form. Ex: am, are, is -> be

How do you calculate the inverse document frequency for a set of documents?

IDF = log(total # of docs / # of docs containing word W)

What is the formula for the TF-IDF score?

w = tf * log(N/df)

Final Review Flashcards

(46 cards)