ML Flashcards
Nao chumbar
- High entropy means that the partitions in classification are
a) pure
b) not pure
c) useful
d) useless
(b) Not pure
Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.
It is a measure of disorder or purity or unpredictability or uncertainty.
Low entropy means less uncertain and high entropy means more uncertain.
- Which of the following is NOT supervised learning?
a) PCA
b) Decision Tree
c) Linear Regression
d) Naive Bayesian
a) PCA
- Which of the following statements about Naive Bayes is incorrect?
a) Attributes are equally important.
b) Attributes are statistically dependent of one another given the class value.
c) Attributes are statistically independent of one another given the class value.
d) Attributes can be nominal or numeric
b) Attributes are statistically dependent of one another given the class value
Attributes are statistically independent of one another given the class value.
Naïve Bayes
Naïve Bayes classifier assumes conditional independence between attributes and assigns the MAP class to new instances.
Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.
It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on
- A and B are two events. If P(A, B) decreases while P(A) increases, which of the following is true?
a) P(A|B) decreases
b) P(B|A) decreases
c) P(B) decreases
d) All of above
(b) P(B|A) decreases
The conditional probability equation for joint probability distribution;
P(A, B) = P(A|B)P(B) = P(B|A)P(A).
Let us take the second one;
P(A, B) = P(B|A)P(A).
In this equation, if P(A) increases then, only the decrease in P(B|A) will result in decrease of P(A, B).
- In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively high negative value. This suggests that
a) This feature has a strong effect on the model (should be retained)
b) This feature does not have a strong effect on the model (should be ignored)
c) It is not possible to comment on the importance of this feature without additional information
d) Nothing can be determined.
(c) It is not possible to comment on the importance of this feature without additional information
A high magnitude suggests that the feature is important. However, it may be the case that another feature is highly correlated with this feature and it’s coefficient also has a high magnitude with the opposite sign, in effect cancelling out the effect of the former. Thus, we cannot really remark on the importance of a feature just because it’s coefficient has a relatively large magnitude.
- As the number of training examples goes to infinity, your model trained on that data will have:
a) Lower variance
b) Higher variance
c) Same variance
d) None of the above
Answer: (a) Lower variacce
Once you have more training examples you’ll have lower test-error (variance of the model decrease, meaning we are less overfitting).
Refer here for more details: In Machine Learning, What is Better: More Data or better Algorithms
High-variance – a model that represent training set well, but at risk of overfitting to noisy or unrepresentative training data.
High bias – a simpler model that doesn’t tend to overfit, but may underfit training data, failing to capture important regularities.
- Which of the following is/are true regarding an SVM?
a) For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line.
b) In theory, a Gaussian kernel SVM cannot model any complex separating hyperplane.
c) For every kernel function used in a SVM, one can obtain an equivalent closed form basis expansion.
d) Overfitting in an SVM is not a function of number of support vectors.
a) For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line
SVM or Support Vector Machine is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The algorithm creates a line or a hyperplane which separates the data into classes.
A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.
- Which of the following guidelines is applicable to initialization of the weight vector in a fully connected neural network.
a) Should not set it to zero since otherwise it will cause overfitting
b) Should not set it to zero since otherwise (stochastic) gradient descent will explore a very small space
c) Should set it to zero since otherwise it causes a bias
d) Should set it to zero in order to preserve symmetry across all neurons
(b) should not set it to zero since otherwise gradient descent will explore a very small space
If we initialize all the weights to zero, the neural network will train but all the neurons will learn the same features during training. Setting all weights to zero makes your model equivalent to a linear model. When you set all weight to 0, the derivative with respect to loss function is the same for every w in weight matrix, thus, all the weights have the same values in the subsequent iteration. Hence, they must be initialized to random numbers.
- For a neural network, which one of these structural assumptions is the one that most affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e. a high variance model):
a) The number of hidden nodes
b) The learning rate
c) The initial choice of weights
d) The use of a constant-term unit input
(a) The number of hidden nodes
The number of hidden nodes. 0 will result in a linear model, which many (with non-linear activation) significantly increases the variance of the model. A feed forward neural network without hidden nodes can only find linear decision boundaries.
- You’ve just finished training a decision tree for spam classification, and it is getting abnormally bad performance on both your training and test sets. You know that your implementation has no bugs, so what could be causing the problem?
a) Your decision trees are too shallow.
b) You need to increase the learning rate.
c) You are overfitting.
d) None of the above.
(a) your decision trees are too shallow
Shallow decision trees - trees that are too shallow might lead to overly simple models that can’t fit the data.
A model that is underfit will have high training and high testing error. Hence, bad performance on training and test sets indicates underfitting which means the set of hypotheses are not complex enough (decision trees that are shallow ) to include the true but unknown prediction function.
The shallower the tree the less variance we have in our predictions; however, at some point we can start to inject too much bias as shallow trees (e.g., stumps) are not able to capture interactions and complex patterns in our data.
- ___________ refers to a model that can neither model the training data nor generalize to new data.
a) good fitting
b) overfitting
c) underfitting
d) all of the above
c) underfitting
- Which among the following prevents overfitting when we perform bagging?
a) The use of sampling with replacement as the sampling technique
b) The use of weak classifiers
c) The use of classification algorithms which are not prone to overfitting
d) The practice of validation performed on every classifier trained
(b) the use of weak classifiers
The presence of over-training (which leads to overfitting) is not generally a problem with weak classifiers. For example, in decision stumps, i.e., decision trees with only one node (the root node), there is no real scope for overfitting. This helps the classifier which combines the outputs of weak classifiers in avoiding overfitting..
- Averaging the output of multiple decision trees helps ________.
a) Increase bias
b) Decrease bias
c) Increase variance
d) Decrease variance
(d) decrease variance
Averaging out the predictions of multiple classifiers will drastically reduce the variance.
Averaging is not specific to decision trees; it can work with many different learning algorithms. But it works particularly well with decision trees.
Why averaging?
If two trees pick different features for the very first split at the top of the tree, then it’s quite common for the trees to be completely different. So decision trees tend to have high variance. To fix this, we can reduce the variance of decision trees by taking an average answer of a bunch of decision trees.
- If N is the number of instances in the training dataset, nearest neighbors has a classification run time of
a) O(1)
b) O( N )
c) O(log N )
d) O( N 2 )
(b) O(N)
Nearest neighbors needs to compute distances to each of the N training instances. Hence, the classification run time complexity is O(N).
- Which of the following is more appropriate to do feature selection?
a) Ridge
b) Lasso
c) both (a) and (b)
d) neither (a) nor (b)
Answer: (b) lasso
For feature selection, we would prefer to use lasso since solving the optimization problem when using lasso will cause some of the coefficients to be exactly zero (depending of course on the data) whereas with ridge regression, the magnitude of the coefficients will be reduced, but won’t go down to zero.
Ridge and Lasso
Ridge and Lasso are types of regularization techniques. They are the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.
- The number of test examples needed to get statistically significant results should be _________
a) Larger if the error rate is larger.
b) Larger if the error rate is smaller.
c) Smaller if the error rate is smaller.
d) It does not matter.
Answer: (b) Larger if the error rate is smaller
Tests for statistical significance tell us what the probability is that the relationship we think we have found is due only to random chance. They tell us what the probability is that we would be making an error if we assume that we have found that a relationship exists.
Statistical significance is a way of mathematically proving that a certain statistic is reliable. When you make decisions based on the results of experiments that you’re running, you will want to make sure that a relationship actually exists.
Your statistical significance level reflects your risk tolerance and confidence level. For example, if you run an A/B testing experiment with a significance level of 95%, this means that if you determine a winner, you can be 95% confident that the observed results are real and not an error caused by randomness. It also means that there is a 5% chance that you could be wrong.
- Neural networks:
a) Optimize a convex objective function
b) Can only be trained with stochastic gradient descent
c) Can use a mix of different activation functions
d) None of the above
Answer: (c) Can use a mix of different activation functions
Neural networks can use a mix of different activation functions like sigmoid, tanh, and ReLu functions.
Activation function
In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer. The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
[Source: Role of the Activation Function in a Neural Network Model]
- Which one of the following is the main reason for pruning a Decision Tree?
a) To save computing time during testing
b) To save space for storing the Decision Tree
c) To make the training set error smaller
d) To avoid overfitting the training set
Answer: (d) to avoid overfitting the training set
The reason for pruning is that the trees prepared by the base algorithm can be prone to overfitting as they become incredibly large and complex.
Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting. [Wikipedia]
- Which of the following methods can achieve zero training error on any linearly separable dataset?
a) Decision tree
b) 15-nearest neighbors
c) Perceptron
d) Logistic regression
Answer: (a) Decision tree (b) Perceptron
Decision tree – Standard decision trees are having no learning biased. The training set error is always zero in decision trees if there is no label noise.
Perceptron - Since the data set is linearly separable, any subset of the data is also linearly separable. Thus, the perceptron is guaranteed to converge to a perfect solution on the training set. This may not be always true for testing dataset.
- Consider a point that is correctly classified and distant from the decision boundary. Which of the following methods will be unaffected by this point?
a) Nearest neighbor
b) SVM
c) Logistic regression
d) Linear regression
Answer: (b) SVM
The hinge loss used by SVMs gives zero weight to these points. Hence, they are unaffected by this point. Whereas, the log-loss used by logistic regression gives a little bit of weight to these points.
- Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting?
a) Increase the amount of training data.
b) Improve the optimization algorithm being used for error minimization.
c) Decrease the model complexity.
d) Reduce the noise in the training data.
Answer: (b) Improve the optimization algorithm being used for error minimization.
Increase the amount of training data that are noisy would help in reducing overfit problem.
Increased complexity of the underlying model may increase the overfitting problem. Decreasing the complexity may help in reducing the overfitting problem.
Noise in the training data can increase the possibility for overfitting. Noise reduction can help in reducing the overfitting.
- The error function most suited for gradient descent using logistic regression is
a) The entropy function.
b) The squared error.
c) The cross-entropy function.
d) The number of mistakes.
Answer: (c) The cross-entropy function
For logistic regression, the cross-entropy function (loss function or cost function) is convex. A convex function has just one minimum; there are no local minima to get stuck in, so gradient descent starting from any point is guaranteed to find the minimum.
Since the Cross Entropy cost function is convex a variety of local optimization schemes can be more easily used to properly minimize it. For this reason the Cross Entropy cost is used more often in practice for logistic regression than is the logistic Least Squares cost.
The cost function return value that representing how well your model perform. It’s like a function that gives you the amount of error rate.
To find the optimal model that has minimum error rate (cost function) we use gradient descent
- You are given a labeled binary classification data set with N data points and D features. Suppose that N < D. In training an SVM on this data set, which of the following kernels is likely to be most appropriate?
a) Linear kernel
b) Quadratic kernel
c) Higher-order polynomial kernel
d) RBF kernel
Answer: (a) Linear kernel
Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single Line. It is one of the most common kernels to be used. It is mostly used when there are a Large number of Features in a particular Data Set.
When number of examples is less in comparison to number of features you would not have enough data to fit a non linear SVM i.e SVM with non linear kernel. SVM with linear kernel (or without a kernel) is one way to go.
- You are increasing the size of the layers (more hidden units per layer) in your neural network. What kind of impact it will have on bias and variance?
a) increases, increases
b) increases, decreases
c) decreases, increases
d) decreases, decreases.
Answer: (c) decreases, increases
Increasing the size of layers will result in decreasing bias and increasing variance.
Increasing the size of layers result in increased complexity. High variance means, the model is performing great on training data and poor performance on test data. Low bias means the model is fitting well.
- What is the biggest weakness of decision trees compared to logistic regression classifiers?
a) Decision trees are more likely to overfit the data
b) Decision trees are more likely to underfit the data
c) Decision trees do not assume independence of the input features
d) None of the mentioned
a) Decision trees are more likely to overfit the data
Decision trees are more likely to overfit the data since they can split on many different combination of features whereas in logistic regression we associate only one parameter with each feature.
- Which of the following classifiers can generate linear decision boundary?
a) Linear SVM
b) Random forest
c) Logistic regression
d) k-NN
Answer: (a) Linear SVM and (c) Logistic regression
Linear SVM and Logistic regression are the linear classifiers. Random forest and k-NN are the non-linear classifiers. They cannot linearly classify.
- If we increase the k value in k-nearest neighbor, the model will _____ the bias and ______ the variance.
a) Decrease, Decrease
b) Increase, Decrease
c) Decrease, Increase
d) Increase, Increase
Answer: (b) Increase, Decrease
When K increases to a large value, the model becomes simplest. All test data point will belong to the same class: the majority class. This is under-fit, that is, high bias and low variance.
Bias-Variance tradeoff
The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs. In other words, model with high bias pays very little attention to the training data and oversimplifies the model.
The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs. In other words, model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. [Source: Refer here]
- For a large k value the k-nearest neighbor model becomes _____ and ______ .
a) Complex model, Overfit
b) Complex model, Underfit
c) Simple model, Underfit
d) Simple model, Overfit
(c) Simple model, Underfit
When K increases to inf, the model is simplest. All test data point will belong to the same class: the majority class. This is under-fit, that is, high bias and low variance.
knn classification is an averaging operation. To come to a decision, the labels of K nearest neighbour samples are averaged. The standard deviation (or the variance) of the output of averaging decreases as the number of samples increases. In the case K==N (you select K as large as the size of the dataset), variance becomes zero.
Underfitting means the model does not fit, in other words, does not predict, the (training) data very well.
Overfitting means that the model predicts the (training) data too well. It is too good to be true. If the new data point comes in, the prediction may be wrong.
- When we have a real-valued input attribute during decision-tree learning, what would be the impact multi-way split with one branch for each of the distinct values of the attribute?
a) It is too computationally expensive.
b) It would probably result in a decision tree that scores badly on the training set and a test set.
c) It would probably result in a decision tree that scores well on the training set but badly on a test set.
d) It would probably result in a decision tree that scores well on a test set but badly on a training set.
(c) It would probably result in a decision tree that scores well on the training set but badly on a test set
It is usual to make only binary splits because multiway splits break the data into small subsets too quickly. This causes a bias towards splitting predictors with many classes since they are more likely to produce relatively pure child nodes, which results in overfitting. [For more, refer here]
- The VC dimension of a Perceptron is _____ the VC dimension of a simple linear SVM.
a) Larger than
b) Smaller than
c) Same as
d) Not at all related
(c) Same as
Both Perceptron and linear SVM are linear discriminators (i.e. a line in 2D space or a plane in 3D space.), so they should have the same VC dimension.
VC dimension
The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a space of functions that can be learned by a statistical binary classification algorithm. It is defined as the cardinality of the largest set of points that the algorithm can shatter. [Wikipedia]
- A measure of goodness of fit for the estimated regression equation is the
(a) Multiple coefficient of determination
(b) Mean square due to error
(c) Mean square due to regression
(d) All of the above
(c) Mean square due to regression (MSR)
Mean square due to regression or regression mean square (MSR) is obtained by dividing the regression sum of squares by its degree of freedom. The regression sum of squares (SSR) and the regression mean square (MSR) are always identical for the simple linear regression model.
- A regression model in which more than one independent variable is used to predict the dependent variable is called
(a) simple linear regression model
(b) multiple regression model
(c) independent model
(d) none of the above
(b) Multiple regression model
Regressions based on more than one independent variable are called multiple regressions. Multiple linear regression is an extension of simple linear regression. Here, a dependent variable is modeled as a function of several independent variables with corresponding coefficients, along with the constant term. Multiple regression requires a minimum of two or more predictor variables, and this is why it is called multiple regression.
Multiple regression will be good at explaining the relationship of the independent variables to the dependent variables if those relationships are linear.
- The average positive difference between computed and desired outcome values is ______ .
(a) Root mean squared error
(b) Mean squared error
(c) Mean absolute error
(d) Mean positive error
(c) Mean absolute error
Absolute Error is the amount of error in your measurements. It is the difference between the measured value and “true” value. Mean absolute error is the average of all absolute errors.
Mean Absolute Error (MAE): MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. [For more, please refer here]
- Which of the following cross validation versions may not be suitable for very large datasets with hundreds of thousands of samples?
a) k-fold cross-validation
b) Leave-one-out cross-validation
c) Holdout method
d) All of the above
(b) Leave-one-out cross-validation
Leave-one-out cross-validation (LOO cross-validation) is not suitable for very large datasets due to the fact that this validation technique requires one model for every sample in the training set to be created and evaluated.
Cross validation
It is a technique to evaluate a machine learning model and it is the basis for whole class of model evaluation methods. The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it. It works by the idea of splitting dataset into number of subsets, keep a subset aside, train the model, and test the model on the holdout subset.
Leave-one-out cross validation
Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation is very expensive to compute at first pass. [For more information on other cross-validation techniques you may refer here]
- Which of the following cross validation versions is suitable quicker cross-validation for very large datasets with hundreds of thousands of samples?
a) k-fold cross-validation
b) Leave-one-out cross-validation
c) Holdout method
d) All of the above
(c) Holdout method
Holdout cross-validation method is suitable for very large dataset because it is the simplest and quicker to compute version of cross-validation.
What is cross-validation? Refer the answer for question 1 in this page.
Holdout method
In this method, the dataset is divided into two sets namely the training and the test set with the basic property that the training set is bigger than the test set. Later, the model is trained on the training dataset and evaluated using the test dataset.
- Which of the following is a disadvantage of k-fold cross-validation method?
a) The variance of the resulting estimate is reduced as k is increased.
b) This usually does not take longer time to compute
c) Reduced bias
d) The training algorithm has to rerun from scratch k times
Answer: (d) The training algorithm has to rerun from scratch k times
In k-fold cross-validation, the dataset is divided into k subsets. Like in holdout method, these subsets are divided into training and test sets as follows;
a) One of the subsets is chosen as the test set and the other subsets put together forms the training set.
b) Train a model on training set and test using test set
c) Keep the score to calculate the average error.
d) Repeat (a) to (c) for all individual subsets as test sets
Here, as there is a change in the training set in every cycle, the training algorithms has to rerun from scratch k times. Hence, it takes k times as much computation to make an evaluation.