ML Flashcards

Nao chumbar

1
Q
  1. High entropy means that the partitions in classification are

a) pure

b) not pure

c) useful

d) useless

A

(b) Not pure

Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.

It is a measure of disorder or purity or unpredictability or uncertainty.

Low entropy means less uncertain and high entropy means more uncertain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. Which of the following is NOT supervised learning?

a) PCA

b) Decision Tree

c) Linear Regression

d) Naive Bayesian

A

a) PCA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. Which of the following statements about Naive Bayes is incorrect?

a) Attributes are equally important.

b) Attributes are statistically dependent of one another given the class value.

c) Attributes are statistically independent of one another given the class value.

d) Attributes can be nominal or numeric

A

b) Attributes are statistically dependent of one another given the class value

Attributes are statistically independent of one another given the class value.

Naïve Bayes

Naïve Bayes classifier assumes conditional independence between attributes and assigns the MAP class to new instances.

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. A and B are two events. If P(A, B) decreases while P(A) increases, which of the following is true?

a) P(A|B) decreases

b) P(B|A) decreases

c) P(B) decreases

d) All of above

A

(b) P(B|A) decreases

The conditional probability equation for joint probability distribution;

P(A, B) = P(A|B)P(B) = P(B|A)P(A).

Let us take the second one;

P(A, B) = P(B|A)P(A).

In this equation, if P(A) increases then, only the decrease in P(B|A) will result in decrease of P(A, B).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively high negative value. This suggests that

a) This feature has a strong effect on the model (should be retained)

b) This feature does not have a strong effect on the model (should be ignored)

c) It is not possible to comment on the importance of this feature without additional information

d) Nothing can be determined.

A

(c) It is not possible to comment on the importance of this feature without additional information

A high magnitude suggests that the feature is important. However, it may be the case that another feature is highly correlated with this feature and it’s coefficient also has a high magnitude with the opposite sign, in effect cancelling out the effect of the former. Thus, we cannot really remark on the importance of a feature just because it’s coefficient has a relatively large magnitude.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. As the number of training examples goes to infinity, your model trained on that data will have:

a) Lower variance

b) Higher variance

c) Same variance

d) None of the above

A

Answer: (a) Lower variacce

Once you have more training examples you’ll have lower test-error (variance of the model decrease, meaning we are less overfitting).

Refer here for more details: In Machine Learning, What is Better: More Data or better Algorithms

High-variance – a model that represent training set well, but at risk of overfitting to noisy or unrepresentative training data.

High bias – a simpler model that doesn’t tend to overfit, but may underfit training data, failing to capture important regularities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. Which of the following is/are true regarding an SVM?

a) For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line.

b) In theory, a Gaussian kernel SVM cannot model any complex separating hyperplane.

c) For every kernel function used in a SVM, one can obtain an equivalent closed form basis expansion.

d) Overfitting in an SVM is not a function of number of support vectors.

A

a) For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line

SVM or Support Vector Machine is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The algorithm creates a line or a hyperplane which separates the data into classes.

A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. Which of the following guidelines is applicable to initialization of the weight vector in a fully connected neural network.

a) Should not set it to zero since otherwise it will cause overfitting

b) Should not set it to zero since otherwise (stochastic) gradient descent will explore a very small space

c) Should set it to zero since otherwise it causes a bias

d) Should set it to zero in order to preserve symmetry across all neurons

A

(b) should not set it to zero since otherwise gradient descent will explore a very small space

If we initialize all the weights to zero, the neural network will train but all the neurons will learn the same features during training. Setting all weights to zero makes your model equivalent to a linear model. When you set all weight to 0, the derivative with respect to loss function is the same for every w in weight matrix, thus, all the weights have the same values in the subsequent iteration. Hence, they must be initialized to random numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. For a neural network, which one of these structural assumptions is the one that most affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e. a high variance model):

a) The number of hidden nodes

b) The learning rate

c) The initial choice of weights

d) The use of a constant-term unit input

A

(a) The number of hidden nodes

The number of hidden nodes. 0 will result in a linear model, which many (with non-linear activation) significantly increases the variance of the model. A feed forward neural network without hidden nodes can only find linear decision boundaries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. You’ve just finished training a decision tree for spam classification, and it is getting abnormally bad performance on both your training and test sets. You know that your implementation has no bugs, so what could be causing the problem?

a) Your decision trees are too shallow.

b) You need to increase the learning rate.

c) You are overfitting.

d) None of the above.

A

(a) your decision trees are too shallow

Shallow decision trees - trees that are too shallow might lead to overly simple models that can’t fit the data.

A model that is underfit will have high training and high testing error. Hence, bad performance on training and test sets indicates underfitting which means the set of hypotheses are not complex enough (decision trees that are shallow ) to include the true but unknown prediction function.

The shallower the tree the less variance we have in our predictions; however, at some point we can start to inject too much bias as shallow trees (e.g., stumps) are not able to capture interactions and complex patterns in our data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. ___________ refers to a model that can neither model the training data nor generalize to new data.

a) good fitting

b) overfitting

c) underfitting

d) all of the above

A

c) underfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. Which among the following prevents overfitting when we perform bagging?

a) The use of sampling with replacement as the sampling technique

b) The use of weak classifiers

c) The use of classification algorithms which are not prone to overfitting

d) The practice of validation performed on every classifier trained

A

(b) the use of weak classifiers

The presence of over-training (which leads to overfitting) is not generally a problem with weak classifiers. For example, in decision stumps, i.e., decision trees with only one node (the root node), there is no real scope for overfitting. This helps the classifier which combines the outputs of weak classifiers in avoiding overfitting..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. Averaging the output of multiple decision trees helps ________.

a) Increase bias

b) Decrease bias

c) Increase variance

d) Decrease variance

A

(d) decrease variance

Averaging out the predictions of multiple classifiers will drastically reduce the variance.

Averaging is not specific to decision trees; it can work with many different learning algorithms. But it works particularly well with decision trees.

Why averaging?

If two trees pick different features for the very first split at the top of the tree, then it’s quite common for the trees to be completely different. So decision trees tend to have high variance. To fix this, we can reduce the variance of decision trees by taking an average answer of a bunch of decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. If N is the number of instances in the training dataset, nearest neighbors has a classification run time of

a) O(1)

b) O( N )

c) O(log N )

d) O( N 2 )

A

(b) O(N)

Nearest neighbors needs to compute distances to each of the N training instances. Hence, the classification run time complexity is O(N).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. Which of the following is more appropriate to do feature selection?

a) Ridge

b) Lasso

c) both (a) and (b)

d) neither (a) nor (b)

A

Answer: (b) lasso

For feature selection, we would prefer to use lasso since solving the optimization problem when using lasso will cause some of the coefficients to be exactly zero (depending of course on the data) whereas with ridge regression, the magnitude of the coefficients will be reduced, but won’t go down to zero.

Ridge and Lasso

Ridge and Lasso are types of regularization techniques. They are the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. The number of test examples needed to get statistically significant results should be _________

a) Larger if the error rate is larger.

b) Larger if the error rate is smaller.

c) Smaller if the error rate is smaller.

d) It does not matter.

A

Answer: (b) Larger if the error rate is smaller

Tests for statistical significance tell us what the probability is that the relationship we think we have found is due only to random chance. They tell us what the probability is that we would be making an error if we assume that we have found that a relationship exists.

Statistical significance is a way of mathematically proving that a certain statistic is reliable. When you make decisions based on the results of experiments that you’re running, you will want to make sure that a relationship actually exists.

Your statistical significance level reflects your risk tolerance and confidence level. For example, if you run an A/B testing experiment with a significance level of 95%, this means that if you determine a winner, you can be 95% confident that the observed results are real and not an error caused by randomness. It also means that there is a 5% chance that you could be wrong.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
  1. Neural networks:

a) Optimize a convex objective function

b) Can only be trained with stochastic gradient descent

c) Can use a mix of different activation functions

d) None of the above

A

Answer: (c) Can use a mix of different activation functions

Neural networks can use a mix of different activation functions like sigmoid, tanh, and ReLu functions.

Activation function

In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer. The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.

[Source: Role of the Activation Function in a Neural Network Model]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
  1. Which one of the following is the main reason for pruning a Decision Tree?
    a) To save computing time during testing
    b) To save space for storing the Decision Tree
    c) To make the training set error smaller
    d) To avoid overfitting the training set
A

Answer: (d) to avoid overfitting the training set
The reason for pruning is that the trees prepared by the base algorithm can be prone to overfitting as they become incredibly large and complex.
Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting. [Wikipedia]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
  1. Which of the following methods can achieve zero training error on any linearly separable dataset?

a) Decision tree

b) 15-nearest neighbors

c) Perceptron

d) Logistic regression

A

Answer: (a) Decision tree (b) Perceptron

Decision tree – Standard decision trees are having no learning biased. The training set error is always zero in decision trees if there is no label noise.

Perceptron - Since the data set is linearly separable, any subset of the data is also linearly separable. Thus, the perceptron is guaranteed to converge to a perfect solution on the training set. This may not be always true for testing dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q
  1. Consider a point that is correctly classified and distant from the decision boundary. Which of the following methods will be unaffected by this point?

a) Nearest neighbor

b) SVM

c) Logistic regression

d) Linear regression

A

Answer: (b) SVM

The hinge loss used by SVMs gives zero weight to these points. Hence, they are unaffected by this point. Whereas, the log-loss used by logistic regression gives a little bit of weight to these points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
  1. Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting?

a) Increase the amount of training data.

b) Improve the optimization algorithm being used for error minimization.

c) Decrease the model complexity.

d) Reduce the noise in the training data.

A

Answer: (b) Improve the optimization algorithm being used for error minimization.

Increase the amount of training data that are noisy would help in reducing overfit problem.

Increased complexity of the underlying model may increase the overfitting problem. Decreasing the complexity may help in reducing the overfitting problem.

Noise in the training data can increase the possibility for overfitting. Noise reduction can help in reducing the overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
  1. The error function most suited for gradient descent using logistic regression is

a) The entropy function.

b) The squared error.

c) The cross-entropy function.

d) The number of mistakes.

A

Answer: (c) The cross-entropy function

For logistic regression, the cross-entropy function (loss function or cost function) is convex. A convex function has just one minimum; there are no local minima to get stuck in, so gradient descent starting from any point is guaranteed to find the minimum.

Since the Cross Entropy cost function is convex a variety of local optimization schemes can be more easily used to properly minimize it. For this reason the Cross Entropy cost is used more often in practice for logistic regression than is the logistic Least Squares cost.

The cost function return value that representing how well your model perform. It’s like a function that gives you the amount of error rate.

To find the optimal model that has minimum error rate (cost function) we use gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
  1. You are given a labeled binary classification data set with N data points and D features. Suppose that N < D. In training an SVM on this data set, which of the following kernels is likely to be most appropriate?

a) Linear kernel

b) Quadratic kernel

c) Higher-order polynomial kernel

d) RBF kernel

A

Answer: (a) Linear kernel

Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single Line. It is one of the most common kernels to be used. It is mostly used when there are a Large number of Features in a particular Data Set.

When number of examples is less in comparison to number of features you would not have enough data to fit a non linear SVM i.e SVM with non linear kernel. SVM with linear kernel (or without a kernel) is one way to go.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
  1. You are increasing the size of the layers (more hidden units per layer) in your neural network. What kind of impact it will have on bias and variance?

a) increases, increases

b) increases, decreases

c) decreases, increases

d) decreases, decreases.

A

Answer: (c) decreases, increases

Increasing the size of layers will result in decreasing bias and increasing variance.

Increasing the size of layers result in increased complexity. High variance means, the model is performing great on training data and poor performance on test data. Low bias means the model is fitting well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q
  1. What is the biggest weakness of decision trees compared to logistic regression classifiers?
    a) Decision trees are more likely to overfit the data
    b) Decision trees are more likely to underfit the data

c) Decision trees do not assume independence of the input features

d) None of the mentioned

A

a) Decision trees are more likely to overfit the data

Decision trees are more likely to overfit the data since they can split on many different combination of features whereas in logistic regression we associate only one parameter with each feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q
  1. Which of the following classifiers can generate linear decision boundary?
    a) Linear SVM

b) Random forest

c) Logistic regression

d) k-NN

A

Answer: (a) Linear SVM and (c) Logistic regression

Linear SVM and Logistic regression are the linear classifiers. Random forest and k-NN are the non-linear classifiers. They cannot linearly classify.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q
  1. If we increase the k value in k-nearest neighbor, the model will _____ the bias and ______ the variance.

a) Decrease, Decrease

b) Increase, Decrease

c) Decrease, Increase

d) Increase, Increase

A

Answer: (b) Increase, Decrease

When K increases to a large value, the model becomes simplest. All test data point will belong to the same class: the majority class. This is under-fit, that is, high bias and low variance.

Bias-Variance tradeoff
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs. In other words, model with high bias pays very little attention to the training data and oversimplifies the model.

The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs. In other words, model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. [Source: Refer here]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q
  1. For a large k value the k-nearest neighbor model becomes _____ and ______ .

a) Complex model, Overfit

b) Complex model, Underfit

c) Simple model, Underfit

d) Simple model, Overfit

A

(c) Simple model, Underfit

When K increases to inf, the model is simplest. All test data point will belong to the same class: the majority class. This is under-fit, that is, high bias and low variance.

knn classification is an averaging operation. To come to a decision, the labels of K nearest neighbour samples are averaged. The standard deviation (or the variance) of the output of averaging decreases as the number of samples increases. In the case K==N (you select K as large as the size of the dataset), variance becomes zero.

Underfitting means the model does not fit, in other words, does not predict, the (training) data very well.
Overfitting means that the model predicts the (training) data too well. It is too good to be true. If the new data point comes in, the prediction may be wrong.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q
  1. When we have a real-valued input attribute during decision-tree learning, what would be the impact multi-way split with one branch for each of the distinct values of the attribute?

a) It is too computationally expensive.

b) It would probably result in a decision tree that scores badly on the training set and a test set.

c) It would probably result in a decision tree that scores well on the training set but badly on a test set.

d) It would probably result in a decision tree that scores well on a test set but badly on a training set.

A

(c) It would probably result in a decision tree that scores well on the training set but badly on a test set

It is usual to make only binary splits because multiway splits break the data into small subsets too quickly. This causes a bias towards splitting predictors with many classes since they are more likely to produce relatively pure child nodes, which results in overfitting. [For more, refer here]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q
  1. The VC dimension of a Perceptron is _____ the VC dimension of a simple linear SVM.

a) Larger than

b) Smaller than

c) Same as

d) Not at all related

A

(c) Same as

Both Perceptron and linear SVM are linear discriminators (i.e. a line in 2D space or a plane in 3D space.), so they should have the same VC dimension.
VC dimension

The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a space of functions that can be learned by a statistical binary classification algorithm. It is defined as the cardinality of the largest set of points that the algorithm can shatter. [Wikipedia]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q
  1. A measure of goodness of fit for the estimated regression equation is the
    (a) Multiple coefficient of determination

(b) Mean square due to error

(c) Mean square due to regression
(d) All of the above

A

(c) Mean square due to regression (MSR)

Mean square due to regression or regression mean square (MSR) is obtained by dividing the regression sum of squares by its degree of freedom. The regression sum of squares (SSR) and the regression mean square (MSR) are always identical for the simple linear regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q
  1. A regression model in which more than one independent variable is used to predict the dependent variable is called
    (a) simple linear regression model

(b) multiple regression model

(c) independent model

(d) none of the above

A

(b) Multiple regression model

Regressions based on more than one independent variable are called multiple regressions. Multiple linear regression is an extension of simple linear regression. Here, a dependent variable is modeled as a function of several independent variables with corresponding coefficients, along with the constant term. Multiple regression requires a minimum of two or more predictor variables, and this is why it is called multiple regression.

Multiple regression will be good at explaining the relationship of the independent variables to the dependent variables if those relationships are linear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q
  1. The average positive difference between computed and desired outcome values is ______ .

(a) Root mean squared error

(b) Mean squared error

(c) Mean absolute error

(d) Mean positive error

A

(c) Mean absolute error

Absolute Error is the amount of error in your measurements. It is the difference between the measured value and “true” value. Mean absolute error is the average of all absolute errors.

Mean Absolute Error (MAE): MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. [For more, please refer here]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q
  1. Which of the following cross validation versions may not be suitable for very large datasets with hundreds of thousands of samples?

a) k-fold cross-validation

b) Leave-one-out cross-validation

c) Holdout method

d) All of the above

A

(b) Leave-one-out cross-validation

Leave-one-out cross-validation (LOO cross-validation) is not suitable for very large datasets due to the fact that this validation technique requires one model for every sample in the training set to be created and evaluated.
Cross validation

It is a technique to evaluate a machine learning model and it is the basis for whole class of model evaluation methods. The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it. It works by the idea of splitting dataset into number of subsets, keep a subset aside, train the model, and test the model on the holdout subset.

Leave-one-out cross validation

Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation is very expensive to compute at first pass. [For more information on other cross-validation techniques you may refer here]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q
  1. Which of the following cross validation versions is suitable quicker cross-validation for very large datasets with hundreds of thousands of samples?
    a) k-fold cross-validation

b) Leave-one-out cross-validation

c) Holdout method

d) All of the above

A

(c) Holdout method

Holdout cross-validation method is suitable for very large dataset because it is the simplest and quicker to compute version of cross-validation.
What is cross-validation? Refer the answer for question 1 in this page.

Holdout method

In this method, the dataset is divided into two sets namely the training and the test set with the basic property that the training set is bigger than the test set. Later, the model is trained on the training dataset and evaluated using the test dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q
  1. Which of the following is a disadvantage of k-fold cross-validation method?

a) The variance of the resulting estimate is reduced as k is increased.

b) This usually does not take longer time to compute

c) Reduced bias

d) The training algorithm has to rerun from scratch k times

A

Answer: (d) The training algorithm has to rerun from scratch k times

In k-fold cross-validation, the dataset is divided into k subsets. Like in holdout method, these subsets are divided into training and test sets as follows;

a) One of the subsets is chosen as the test set and the other subsets put together forms the training set.

b) Train a model on training set and test using test set

c) Keep the score to calculate the average error.

d) Repeat (a) to (c) for all individual subsets as test sets

Here, as there is a change in the training set in every cycle, the training algorithms has to rerun from scratch k times. Hence, it takes k times as much computation to make an evaluation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

When using stacking, we are (choose the correct option):

Select one:
© a. averaging the predictions of the classifiers in the meta learner

> b. unable to solve regressi due to the of classifiers

© c.using a meta learner over our base models predictions
© d.creating a sequence of classifiers, giving higher influence to more accurate classifiers
© e.Idon’t want to answer this question

A

c.using a meta learner over our base models predictions

38
Q

You finished training a Random Forest, and you are getting abnormally bad performance on your validation set, but good performance on your training set. What might be the

problem?

Select one:

a. You have too few trees in your ensemble

b. You should use all features when you choose a split
c. The learning rate should be decreased

d. Your boosting implementation should use deeper trees

A

c. The learning rate should be decreased

39
Q

The hold-out method:

Select one:

— a.uses all the instances from the original dataset as training and testing
—_b. 1 don’t want to answer this question

© c. splits data into k subsets of equal size and then each subset is used for testing and the remainder for training

© d. splits the data into training data and test data and then builds a classifier using the train data and test it using the test data
le)

e. samples a dataset of n instances n times with replacement to form a new dataset of n instances (training set);

A

© c. splits data into k subsets of equal size and then each subset is used for testing and the remainder for training

40
Q

Which of the following statements is more accurate about random forest?

Select one:
© a. Allof the trees in the ensemble are independent of each other and It uses bootstrap aggregation (bagging)

b. I don’t want to answer this question

c. Arandom forest i it’s per by ing the results of strong learners.

d. It uses bootstrap aggregation (bagging)

Oo000

e. Ituses staking and random selection of features

A

© a. Allof the trees in the ensemble are independent of each other and It uses bootstrap aggregation (bagging)

41
Q

In Linear Regression, what does the betal represent?

Select one:

a. | don’t want to answer this question

b. The estimated change in average Y per unit change in X1.
c. The variation around the line of regression.

d. The predicted value of Y1.

O000 ©

e. The predicted value of Y when X1 = 0.

A

b. The estimated change in average Y per unit change in X1.

42
Q

The confidence on the performance of a classifier increases with

Select one:

a. decreasing the test dataset size

b. decreasing the training dataset size
cc. | don’t want to answer this question
d. increasing the test dataset size

©0000

e. increasing the training dataset size

A

e. increasing the training dataset size

43
Q

The repeated hold-out method:

Select one:
a. All options are incorrect
b. splits the data into training data and test data and then builds a classifier using the train data and test it using the test data
c. splits data into k subsets of equal size and then each subset is used for testing and the remainder for training
d. | don’t want to answer this question

e. samples a dataset of n instances n times with replacement to form a new dataset of n instances (training set); the instances from the original dataset that don’t occur
in the training set are used in testing

A

a. All options are incorrect

44
Q

Support Vector machines can be classified in the following machine learning sub-area:
Select one:

© a. Evolutionaries

b. Statistical

cc. Analogizers

d. | don’t want to answer this question

000 ©

e. Bayesian

A

cc. Analogizers

45
Q

Regarding k-NN, the most accurate sentence is:

Select one:
© a.it performs better if features have the same scale

b. itis an eager learner and it requires a value for k

le)

— ¢. | don’t want to answer this question

© d. itis considered an eager learner it performs better if features are in the same scale
®

e. it performs better if features are in the same scale and it requires a value for k

A

e. it performs better if features are in the same scale and it requires a value for k

46
Q

For k-NN fiers, which of the it is true?

Select one:
© a. The decision boundary is smoother with larger values of k

b. The smoothness of the decision boundary doesn’t depend on the value of K

le)

©. The decision boundary is smoother with smaller values of k
© d. The classification accuracy is better with larger values of k
le)

e. | don’t want to answer this question

A

a. The decision boundary is smoother with larger values of k

47
Q

Aneuron with 4 inputs has a weight vector w = [1, 2, 1, 2] and a bias b = 1. The activation function is given by f (net) = sqrt(net). If the input vector is x = [2, 1, 2, 1] then the
output of the neuron will be:

Select one:

© a3

O 6.20

© ¢.1 don’t want to answer this question
© 4.10

O eg

A

© a3

48
Q

Usually, when using the MAE as a splitting criteria, in a regression tree:

Select one:

© a. The predicted value in a terminal node is equal to the median value of all the samples included in that node
b. We are not able to apply models with more than two predictors.
c. | don’t want to answer this question

© d. The predicted value in a terminal node is influenced by the weights associated to pre-defined classes according to the region it belongs to

e. The predicted value in a terminal node is equal to the mean value of all the samples included in that node

A

e. The predicted value in a terminal node is equal to the mean value of all the samples included in that node

49
Q

Ina neural network, the activation functions:
Select one:
© a.are needed to introduce nonlinearity into the network

b. | don’t want to answer this question

© c.are not essential in non-linear problems
© d.are used to simplify the results

le)

e. are used to avoid overfitting

A

© a.are needed to introduce nonlinearity into the network

50
Q

Nearest-Neighbor classifiers require:

Select one:
© a. The number of nearest neighbors to retrieve (k)
b. The set of stored records

ie)

© c.Alloptions are correct

— d. 1 don’t want to answer this question
le)

e, Distance Metric to compute distance between records

A

© c.Alloptions are correct

51
Q

Lasso can be interperted as least-squares linear regression where:

Select one:
© a.| don’t want to answer this question
b. weights can be regularized to zero
cc. the weights will tend to be higher compared to Least Squares
d. weights are regularized with the L2 norm (weights can be close to zero)

e. the cost function considers the square of the coefficients

A

d. weights are regularized with the L2 norm (weights can be close to zero)

52
Q

In a Linear Regression, what does the betaO represent?

Select one:
a. The predicted value of Y.
b. The estimated change in average Y per unit change in X.
cc. The variation around the line of regression.
d. The predicted value of Y when X = 0.

e. | don’t want to answer this question

A

d. The predicted value of Y when X = 0.

53
Q

In general, a decision tree:

Select one:

© a. Is considered a parametric method
b. can be used as a feature_selection technique
¢. | don’t want to answer this question

© d.ls sensitive to scale factors
e. Is suffering from underfitting when all leaves are pure

A

—_b. can be used as a feature_selection technique

54
Q

In Decision Trees, the information Gain related to an independent variable V1 is higher:

Select one:

© a. When the entropy of V1 is higher
b. When the entropy of V1 is lower

c. When V1 has only 2 distinct values

d. When V1 has more than 2 distinct values

A

b. When the entropy of V1 is lower

55
Q

When using the information gain as the attribute selection measure in decision trees:

Select one:

© a.we select the attribute with the highest delta gini

b. we select the attribute with the highest information gain
c. we select the attribute with near zero delta gini

d. | don’t want to answer this question

e. we select the attribute with near zero information gain

A

b. we select the attribute with the highest information gain

56
Q

The most widely used metrics for the modelling of predicting the sales value in a company is:

Select one:

© a. Area under the ROC curve

b. Precision

c. | don’t want to answer this question

d. Accuracy

e. RMSE

A

e. RMSE

57
Q

Wrapper methods:

Select one:

© a. Use statistiques to evaluate the relationship b each input variable and the target variable
b. Evaluate and compare different feature combinations
c. | don’t want to answer this question

d. Introduce additional constraints into the optimization of a predictive algorithm

O00 ©

e. Learn the features that better contribute to the accuracy of the model while the model is being created

A

b. Evaluate and compare different feature combinations

58
Q

Which of the following is a widely used and effective machine learning algorithm based on the idea of boostraping aggregating?

Select one:

© a. Decision Tree

© b, AdaBoost

© ¢. Suport Vector Machine,

© d.1 don’t want to answer this question
© — e. Random Forest

A

© — e. Random Forest

59
Q

We should apply RMSE instead of MAE:

Select one:

a. When we want to consider that all individual differences have equal weight
b. When all the errors have the same magnitude

c. When large errors are particularly undesirable and we want to penalize those

d. | don’t want to answer this question

lomo Ome)

e. None of the options is correct

A

c. When large errors are particularly undesirable and we want to penalize those

60
Q

Back propagation is a learning technique that adjusts the weights in the neural network by propagating the weight changes:

Select one:

fe)

a. | don’t want to answer this question
b. Backward from output to hidden layer
c. Backward from output to input layer

d. Forward from input to hidden layer

OO0@O0

e. Forward from input to output layer

A

c. Backward from output to input layer

61
Q
  1. Ridge and Lasso regression are simple techniques to ________ the complexity of the model and prevent over-fitting which may result from simple linear regression.
    a) Increase

b) Decrease

c) Eliminate

d) None of the above

A

Answer: (b) Decrease

Both techniques are used to reduce the complexity of the model.

The Ridge and Lasso regression techniques aim to lower the sizes of the coefficients to avoid over-fitting.

Ridge regression shrinks the regression coefficients that have little contribution to the outcome. This takes the little contributing coefficients close to zero. Whereas, Lasso regression forces the little contributing coefficients to be zero (exactly).

Linear regression = min(Sum of squared errors)

Ridge regression = min(Sum of squared errors + alpha * slope)square)

Lasso Regression = min(sum of squared error + alpha * | slope| )

62
Q
  1. How does the bias-variance decomposition of a ridge regression estimator compare with that of ordinary least squares regression?
    a) Ridge has larger bias, larger variance

b) Ridge has larger bias, smaller variance

c) Ridge has smaller bias, larger variance
d) Ridge has smaller bias, smaller variance

A

(b) Ridge has larger bias, smaller variance

Ridge regression’s advantage over ordinary least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias

63
Q

. When compared with Lasso regression, the Ridge regression works well in cases where we

a) if we have more features

b) if we have less features

c) if features have high correlation

d) if features have low correlation

A

(b) if we have less features and (c) if features have high correlation

Ridge Regression works better when you have less features or when you have features with high correlation.

It performs better in cases where there may be high multi-colinearity, or high correlation between certain features. This is because it reduces variance in exchange for bias. [Please refer here for more]

64
Q
  1. The classifier’s behavior is determined by the coefficients. These coefficients are usually referred as ________.

a) Weights

b) Tasks

c) Values

d) Behaviors

A

(a) Weights

The classifier’s behavior is determined by the coefficients, wi.These are usually called weights.

65
Q
  1. Null and alternative hypotheses are statements about:

a) population parameters.

b) sample parameters.

c) sample statistics.

d) it depends - sometimes population parameters and sometimes sample statistics.

A

(a) Population parameters

The null and alternative hypotheses are two mutually exclusive statements about a population. A hypothesis test uses sample data to determine whether to reject the null hypothesis.
Null hypothesis (H0) - The null hypothesis states that a population parameter (such as the mean, the standard deviation, and so on) is equal to a hypothesized value.

Alternative Hypothesis (H1) - The alternative hypothesis states that a population parameter is smaller, greater, or different than the hypothesized value in the null hypothesis.

66
Q
  1. In hypothesis testing, a Type 2 error occurs when

a) The null hypothesis is not rejected when the null hypothesis is true.
b) The null hypothesis is rejected when the null hypothesis is true.

c) The null hypothesis is not rejected when the alternative hypothesis is true.

d) The null hypothesis is rejected when the alternative hypothesis is true.

A

(c) The null hypothesis is not rejected when the alternative hypothesis is true

Type 2 error is caused when the null hypothesis is false and we fail to reject it.

67
Q
  1. What type of penalty is used on regression weights in Ridge regression?

a) L0

b) L1

c) L2

d) None of the above

A

(c) L2

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. L2 regularization adds an L2 penalty, which equals the square of the magnitude of coefficients.

Ridge regression shrinks the regression coefficients, so that variables, with minor contribution to the outcome, have their coefficients close to zero.
The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called L2-norm, which is the sum of the squared coefficients. L2 regularization is used to avoid overfitting of data.

When do we use L2 regularization?
L2 regularization is best used in non-sparse outputs, when no feature selection needs to be done, or if you need to predict a continuous output.

68
Q
  1. Which of the following of the coefficient is added as the penalty term to the loss function in Lasso regression?

a) Squared magnitude

b) Absolute value of magnitude

c) Number of non-zero entries

d) None of the above

A

(b) Absolute value of magnitude

Lasso regression adds “absolute value of magnitude” of coefficient as penalty term to the loss function.
Lasso regression shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called L1-norm, which is the sum of the absolute coefficients.

69
Q
  1. Which of the following is a disadvantage of non-parametric machine learning algorithms?

a) Capable of fitting a large number of functional forms (Flexibility)

b) Very fast to learn (Speed)

c) More of a risk to overfit the training data (Overfitting)

d) They do not require much training data

A

(c) More of a risk to overfit the training data

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data.

70
Q
  1. A decision tree has low training error and a large test error. What is the possible problem?

a) Decision tree is too shallow

b) Learning rate too high

c) There is too much training data

d) Decision tree is overfitting

A

(d) Decision tree is overfitting

Overfitting causes low training error. Overfitting means that the model predicts the (training) data too well. It is too good to be true. If the new data point comes in, the prediction may be wrong.
Pruning can help in reducing the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

71
Q
  1. Suppose we have a regularized linear regression model. What is the effect of increasing λ on bias and variance?

a) Increases bias, increases variance

b) Increases bias, decreases variance

c) Decreases bias, increases variance

d) Decreases bias, decreases variance

A

b) Increases bias, decreases variance

Increasing λ increases bias and decreases variance

Regularized regression

It is a type of regression where the coefficient estimates are constrained to zero. The magnitude (size) of coefficients, as well as the magnitude of the error term are penalized. Complex models are discouraged, primarily to avoid overfitting. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. [For more refer here – regularized regression, ] and [Refer here - regularization ]

Type of regularized regression

Ridge regression (L2 regularization)

Lasso regression (L1 regularization)

72
Q
  1. What strategies can help reduce over-fitting in decision trees?

a) Pruning

b) Make sure each leaf node is one pure class

c) Enforce a maximum depth for the tree

d) Enforce a maximum number of samples in leaf nodes

A

(a) Pruning and (c) Enforce a maximum depth for the tree

Over-fitting is a significant practical difficulty for decision tree models and many other predictive models. Over-fitting happens when the learning algorithm continues to develop hypotheses that reduce training set error at the cost of an
increased test set error.

Unlike other regression models, decision tree doesn’t use regularization to fight against over-fitting. Instead, it employs tree pruning. Selecting the right hyper-parameters (tree depth and leaf size) also requires experimentation, e.g. doing cross-validation with a hyper-parameter matrix.

73
Q
  1. Neural networks

a) cannot be used in ensemble

b) can be used for regression

c) can be used for classification

d) always output values will be between 0 and 1

A

(b) can be used for regression and (c) can be used for classification

Regression refers to predictive modeling problems that involve predicting a numeric value given an input.

Classification refers to predictive modeling problems that involve predicting a class label or probability of class labels for a given input.

Neural networks can be used for either regression or classification. Under regression model a single value is outputted which may be mapped to a set of real numbers meaning that only one output neuron is required. Under classification model an output neuron is required for each potentially class to which the pattern may belong. If the classes are unknown unsupervised neural network techniques such as self organizing maps should be used.

74
Q
  1. Lasso can be interpreted as least-squares linear regression where

a) weights are regularized with the l1 norm

b) the weights have a Gaussian prior

c) weights are regularized with the l2 norm

d) the solution algorithm is simpler

A

(a) weights are regularized with the l1 norm

Regularization is a technique to deal with over-fitting problem.

Lasso regression

Lasso regression is a regularization technique. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). A sparse solution could avoid over-fitting.

Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can result in sparse models with few coefficients; Some coefficients can become zero and eliminated from the model.

Why l1 norm?

By L1 regularization, you essentially make the vector smaller (sparse), as most of its components are useless (zeros), and at the same time, the remaining non-zero components are very “useful”.

75
Q
  1. In Lasso regression, if the tuning parameter (lambda) increases ______ increases.

a) Variance

b) Bias

c) Both variance and bias

d) Neither variance nor bias

A

b) Bias

The regularization (tuning or penalty) parameter (lambda) is an input to your model. Lambda is the tuning parameter that controls the bias-variance tradeoff and we estimate its best value via cross-validation. The regularization parameter reduces over-fitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda results in less over-fitting but also greater bias.

Large values of lambda pull weight parameters to zero leading to large bias. It leads to under-fitting.

76
Q
  1. In Lasso regression, if the tuning parameter (lambda) decreases ______ increases.

a) Variance

b) Bias

c) Both variance and bias

d) Neither variance nor bias

A

(a) Variance

Small values of λ allow model to become finely tuned to noise leading to large variance. It leads to over-fitting.

77
Q
  1. “Less important parameters goes close to zero when we increase the value of tuning parameters” in which of the following regressions?

a) Ridge

b) Lasso

c) Both ridge and lasso

A

(b) Lasso

With Lasso, when we increase the value of Lambda the most important parameters shrink a little bit and the less important parameters goes closed to zero. So, Lasso is able to exclude silly parameters from the model.

78
Q
  1. Which of the following is true about generative models?

a) They capture the joint probability

b) The perceptron is a generative model

c) Generative models can be used for classification

d) They capture the conditional probability

A

(a) They capture the joint probability and (c) Generative models can be used for classification

Generative models are useful for unsupervised learning tasks. A generative model learns parameters by maximizing the joint probability P(X,Y). Generative models encode full probability distributions and specify how to generate data that fit such distributions. Bayesian networks are well-known examples of such models. Refer here for more information.

Generative Classifiers tries to model class, i.e., what are the features of the class. In short, it models how a particular class would generate input data. When a new observation is given to these classifiers, it tries to predict which class would have most likely generated the given observation.

79
Q
  1. Which of the following are true about subset selection?

a) Subset selection can substantially decrease the bias of support vector machines

b) Ridge regression frequently eliminates some of the features
c) Finding the true best subset takes exponential time

d) Subset selection can reduce overfitting

A

(d) Subset selection can reduce overfitting

A classifier is said to overfit to a dataset if it models the training data too closely and gives poor predictions on new data. This occurs when there is insufficient data to train the classifier and the data does not fully cover the concept being learned.

Subset selection reduces over-fitting.

Feature subset selection is the process of identifying and removing as much of the irrelevant and redundant information as possible. This reduces the dimensionality of the data and allows learning algorithms to operate faster and more effectively.

80
Q
  1. What can help to reduce overfitting in an SVM classifier?

a) High-degree polynomial features

b) Setting a very low learning rate

c) Use of slack variables

d) Normalizing the data

A

(c) Use of slack variables

The reason that SVMs tend to be resistant to over-fitting, even in cases where the number of attributes is greater than the number of observations, is that it uses regularization. The key to avoid over-fitting lies in careful tuning of the regularization parameter, C, and in the case of non-linear SVMs, careful choice of kernel and tuning of the kernel parameters.

Without slack variables the SVM would be forced into always fitting the data exactly and would often overfit as a result.

81
Q
  1. Given a kNN classifier, which one of the following statements is true?

a) The more examples are used for classifying an example, the higher accuracy we obtain

b) The more attributes we use to describe the examples the more difficult is to obtain high accuracy

c) The most costly part of this method is to learn the model

d) We can use kNN for classification only

A

(b) The more attributes we use to describe the examples the more difficult is to obtain high accuracy

kNN becomes significantly slower as the number of examples (independent variables) increases. When the number of features increases, then it requires more data. When there’s more data, it creates an overfitting problem because no one knows which piece of noise will contribute to the model. kNN performs better with low dimensions (low number of features). For more, you can refer here. https://neptune.ai/blog/knn-algorithm-explanation-opportunities-limitations

82
Q
  1. Decision trees can work with

a) Only numeric values

b) Only nominal values

c) Both numeric and nominal values

d) Neither numeric nor nominal values

A

(c) Both numeric and nominal values

Decision trees can handle both numerical and categorical data. Early decision trees were only capable of handling categorical variables, but more recent versions, such as C4.5, CART do not have this limitation. The categorical data are encoded, if required (eg. one-hot encoding), and used by decision tree algorithms.

83
Q
  1. Which of the following is true about regularized linear regression model?

a) Increase in regularization parameter (lambda) will make the model to underfit the data and the validation error will go up.

b) Decrease in regularization parameter (lambda) will make the model to overfit the data and the training error go up

c) Increase in regularization parameter (lambda) will make the model to underfit the data and the training error go down

d) All of the above are true

A

(a) Increase in regularization parameter (lambda) will make the model to underfit the data and the validation error will go up.

Regularization parameter (tuning parameter) λ, used in the regularization techniques, controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding overfitting), without loosing any important properties in the data. But after certain value, the model starts loosing important properties, giving rise to bias in the model and thus underfitting.

84
Q
  1. Which of the following is a characteristic of decision tree?

a) High variance

b) High bias

c) Smoothness of prediction surfaces

d) Low variance

A

(a) High variance

A model has high variance if it is very sensitive to (small) changes in the training data. Decision trees are generally unstable considering that a small change in the data set can result in a very different set of splits. This results in high variance. This is mainly due to the hierarchical nature of decision trees, since a change in split points in the initial stages will affect all the subsequent splits.

85
Q
  1. Which of the following is/are true about ensemble methods?

a) Ensemble methods can take the form of using different classifiers

b) Ensemble methods are simple and cheap

c) For the data from linear process, ensemble methods performs better than the linear models

d) Using same classification algorithm with different settings is an ensemble method

A

(a) Ensemble methods can take the form of using different classifiers and (d) Using same classification algorithm with different settings is an ensemble method

Ensemble methods can take the form of using different algorithms, using the same algorithm with different settings, or assigning different parts of the dataset to different classifiers.

Ensemble methods - The learning algorithms which construct a set of classifiers and then classify new data points by taking a choice of their predictions are known as Ensemble methods. Random forest is an ensemble model where number of decision trees is used to predict the output.

86
Q
  1. Which of the following is not an example of ensemble method?

a) AdaBoost

b) Decision tree

c) Random Forest

d) Bootstrapping

A

(b) Decision tree

Decision tree is not an ensemble method. It is a single tree used for classification.

Random forest is an ensemble model where we use multiple decision trees to predict outcomes.

AdaBoost is a statistical classification meta-algorithm. It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher weights assigned to incorrectly classified instances.

Bootstrapping generates multiple bootstrap training sets from the original training set and uses each of them to generate a classifier for inclusion in the ensemble.

87
Q
  1. Which of the following is an example of sequential ensemble model?

a) AdaBoost

b) Bootstrapping

c) Random forest

d) All of the above

A

(a) AdaBoost

AdaBoost is an example of sequential ensemble model.

Boosting is an ensemble technique that learns from previous predictor mistakes to make better predictions in the future. The technique combines several weak base learners to form one strong learner, thus significantly improving the predictability of models. Boosting works by arranging weak learners in a sequence, such that weak learners learn from the next learner in the sequence to create better predictive models.

What is sequential ensemble?

Sequential ensemble: base learners are generated sequentially. The basic motivation of sequential methods is to exploit the dependence between the base learners. Overall performance may be improved by weighing previously mislabeled examples with higher weight.

88
Q
  1. Which of the following ensemble model helps in reducing variance?

a) Boosting

b) Bagging

c) Stacking

d) Voting

A

(b) Bagging

Bagging (also called as Bootstrap Aggregation) is an ensemble method which is the application of Bootstrap procedure to a high variance ML algorithm. Averaging reduces variance. Bagging uses bootstrap to generate L training sets, trains L base-learners using an unstable learning procedure, and then, during testing, takes an average.

What is an ensemble model in machine learning?

An ensemble method is a technique which uses multiple independent similar or different models/weak learners to derive an output or make some predictions.

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

89
Q
  1. Which of the following helps in avoiding overfitting in decision trees?

a) Adding more irrelevant attributes

b) Generating a tree with fewer branches

c) Generating a complete tree then getting rid of some branches

d) All of the above

A

(b)Generating a tree with fewer branches and (c) Generating a complete tree then getting rid of some branches

Two approaches to avoiding overfitting are distinguished: pre-pruning (generating a tree with fewer branches than would otherwise be the case) and post-pruning (generating a tree in full and then removing parts of it). Results are given for pre-pruning using either a size or a maximum depth cutoff. A method of post-pruning a decision tree based on comparing the static and backed-up estimated error rates at each node is also described.

We need to remove irrelevant attributes.

90
Q
A