ML Flashcards
Nao chumbar
- High entropy means that the partitions in classification are
a) pure
b) not pure
c) useful
d) useless
(b) Not pure
Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.
It is a measure of disorder or purity or unpredictability or uncertainty.
Low entropy means less uncertain and high entropy means more uncertain.
- Which of the following is NOT supervised learning?
a) PCA
b) Decision Tree
c) Linear Regression
d) Naive Bayesian
a) PCA
- Which of the following statements about Naive Bayes is incorrect?
a) Attributes are equally important.
b) Attributes are statistically dependent of one another given the class value.
c) Attributes are statistically independent of one another given the class value.
d) Attributes can be nominal or numeric
b) Attributes are statistically dependent of one another given the class value
Attributes are statistically independent of one another given the class value.
Naïve Bayes
Naïve Bayes classifier assumes conditional independence between attributes and assigns the MAP class to new instances.
Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.
It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on
- A and B are two events. If P(A, B) decreases while P(A) increases, which of the following is true?
a) P(A|B) decreases
b) P(B|A) decreases
c) P(B) decreases
d) All of above
(b) P(B|A) decreases
The conditional probability equation for joint probability distribution;
P(A, B) = P(A|B)P(B) = P(B|A)P(A).
Let us take the second one;
P(A, B) = P(B|A)P(A).
In this equation, if P(A) increases then, only the decrease in P(B|A) will result in decrease of P(A, B).
- In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively high negative value. This suggests that
a) This feature has a strong effect on the model (should be retained)
b) This feature does not have a strong effect on the model (should be ignored)
c) It is not possible to comment on the importance of this feature without additional information
d) Nothing can be determined.
(c) It is not possible to comment on the importance of this feature without additional information
A high magnitude suggests that the feature is important. However, it may be the case that another feature is highly correlated with this feature and it’s coefficient also has a high magnitude with the opposite sign, in effect cancelling out the effect of the former. Thus, we cannot really remark on the importance of a feature just because it’s coefficient has a relatively large magnitude.
- As the number of training examples goes to infinity, your model trained on that data will have:
a) Lower variance
b) Higher variance
c) Same variance
d) None of the above
Answer: (a) Lower variacce
Once you have more training examples you’ll have lower test-error (variance of the model decrease, meaning we are less overfitting).
Refer here for more details: In Machine Learning, What is Better: More Data or better Algorithms
High-variance – a model that represent training set well, but at risk of overfitting to noisy or unrepresentative training data.
High bias – a simpler model that doesn’t tend to overfit, but may underfit training data, failing to capture important regularities.
- Which of the following is/are true regarding an SVM?
a) For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line.
b) In theory, a Gaussian kernel SVM cannot model any complex separating hyperplane.
c) For every kernel function used in a SVM, one can obtain an equivalent closed form basis expansion.
d) Overfitting in an SVM is not a function of number of support vectors.
a) For two dimensional data points, the separating hyperplane learnt by a linear SVM will be a straight line
SVM or Support Vector Machine is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The algorithm creates a line or a hyperplane which separates the data into classes.
A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.
- Which of the following guidelines is applicable to initialization of the weight vector in a fully connected neural network.
a) Should not set it to zero since otherwise it will cause overfitting
b) Should not set it to zero since otherwise (stochastic) gradient descent will explore a very small space
c) Should set it to zero since otherwise it causes a bias
d) Should set it to zero in order to preserve symmetry across all neurons
(b) should not set it to zero since otherwise gradient descent will explore a very small space
If we initialize all the weights to zero, the neural network will train but all the neurons will learn the same features during training. Setting all weights to zero makes your model equivalent to a linear model. When you set all weight to 0, the derivative with respect to loss function is the same for every w in weight matrix, thus, all the weights have the same values in the subsequent iteration. Hence, they must be initialized to random numbers.
- For a neural network, which one of these structural assumptions is the one that most affects the trade-off between underfitting (i.e. a high bias model) and overfitting (i.e. a high variance model):
a) The number of hidden nodes
b) The learning rate
c) The initial choice of weights
d) The use of a constant-term unit input
(a) The number of hidden nodes
The number of hidden nodes. 0 will result in a linear model, which many (with non-linear activation) significantly increases the variance of the model. A feed forward neural network without hidden nodes can only find linear decision boundaries.
- You’ve just finished training a decision tree for spam classification, and it is getting abnormally bad performance on both your training and test sets. You know that your implementation has no bugs, so what could be causing the problem?
a) Your decision trees are too shallow.
b) You need to increase the learning rate.
c) You are overfitting.
d) None of the above.
(a) your decision trees are too shallow
Shallow decision trees - trees that are too shallow might lead to overly simple models that can’t fit the data.
A model that is underfit will have high training and high testing error. Hence, bad performance on training and test sets indicates underfitting which means the set of hypotheses are not complex enough (decision trees that are shallow ) to include the true but unknown prediction function.
The shallower the tree the less variance we have in our predictions; however, at some point we can start to inject too much bias as shallow trees (e.g., stumps) are not able to capture interactions and complex patterns in our data.
- ___________ refers to a model that can neither model the training data nor generalize to new data.
a) good fitting
b) overfitting
c) underfitting
d) all of the above
c) underfitting
- Which among the following prevents overfitting when we perform bagging?
a) The use of sampling with replacement as the sampling technique
b) The use of weak classifiers
c) The use of classification algorithms which are not prone to overfitting
d) The practice of validation performed on every classifier trained
(b) the use of weak classifiers
The presence of over-training (which leads to overfitting) is not generally a problem with weak classifiers. For example, in decision stumps, i.e., decision trees with only one node (the root node), there is no real scope for overfitting. This helps the classifier which combines the outputs of weak classifiers in avoiding overfitting..
- Averaging the output of multiple decision trees helps ________.
a) Increase bias
b) Decrease bias
c) Increase variance
d) Decrease variance
(d) decrease variance
Averaging out the predictions of multiple classifiers will drastically reduce the variance.
Averaging is not specific to decision trees; it can work with many different learning algorithms. But it works particularly well with decision trees.
Why averaging?
If two trees pick different features for the very first split at the top of the tree, then it’s quite common for the trees to be completely different. So decision trees tend to have high variance. To fix this, we can reduce the variance of decision trees by taking an average answer of a bunch of decision trees.
- If N is the number of instances in the training dataset, nearest neighbors has a classification run time of
a) O(1)
b) O( N )
c) O(log N )
d) O( N 2 )
(b) O(N)
Nearest neighbors needs to compute distances to each of the N training instances. Hence, the classification run time complexity is O(N).
- Which of the following is more appropriate to do feature selection?
a) Ridge
b) Lasso
c) both (a) and (b)
d) neither (a) nor (b)
Answer: (b) lasso
For feature selection, we would prefer to use lasso since solving the optimization problem when using lasso will cause some of the coefficients to be exactly zero (depending of course on the data) whereas with ridge regression, the magnitude of the coefficients will be reduced, but won’t go down to zero.
Ridge and Lasso
Ridge and Lasso are types of regularization techniques. They are the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.
- The number of test examples needed to get statistically significant results should be _________
a) Larger if the error rate is larger.
b) Larger if the error rate is smaller.
c) Smaller if the error rate is smaller.
d) It does not matter.
Answer: (b) Larger if the error rate is smaller
Tests for statistical significance tell us what the probability is that the relationship we think we have found is due only to random chance. They tell us what the probability is that we would be making an error if we assume that we have found that a relationship exists.
Statistical significance is a way of mathematically proving that a certain statistic is reliable. When you make decisions based on the results of experiments that you’re running, you will want to make sure that a relationship actually exists.
Your statistical significance level reflects your risk tolerance and confidence level. For example, if you run an A/B testing experiment with a significance level of 95%, this means that if you determine a winner, you can be 95% confident that the observed results are real and not an error caused by randomness. It also means that there is a 5% chance that you could be wrong.
- Neural networks:
a) Optimize a convex objective function
b) Can only be trained with stochastic gradient descent
c) Can use a mix of different activation functions
d) None of the above
Answer: (c) Can use a mix of different activation functions
Neural networks can use a mix of different activation functions like sigmoid, tanh, and ReLu functions.
Activation function
In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer. The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
[Source: Role of the Activation Function in a Neural Network Model]
- Which one of the following is the main reason for pruning a Decision Tree?
a) To save computing time during testing
b) To save space for storing the Decision Tree
c) To make the training set error smaller
d) To avoid overfitting the training set
Answer: (d) to avoid overfitting the training set
The reason for pruning is that the trees prepared by the base algorithm can be prone to overfitting as they become incredibly large and complex.
Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting. [Wikipedia]
- Which of the following methods can achieve zero training error on any linearly separable dataset?
a) Decision tree
b) 15-nearest neighbors
c) Perceptron
d) Logistic regression
Answer: (a) Decision tree (b) Perceptron
Decision tree – Standard decision trees are having no learning biased. The training set error is always zero in decision trees if there is no label noise.
Perceptron - Since the data set is linearly separable, any subset of the data is also linearly separable. Thus, the perceptron is guaranteed to converge to a perfect solution on the training set. This may not be always true for testing dataset.
- Consider a point that is correctly classified and distant from the decision boundary. Which of the following methods will be unaffected by this point?
a) Nearest neighbor
b) SVM
c) Logistic regression
d) Linear regression
Answer: (b) SVM
The hinge loss used by SVMs gives zero weight to these points. Hence, they are unaffected by this point. Whereas, the log-loss used by logistic regression gives a little bit of weight to these points.
- Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting?
a) Increase the amount of training data.
b) Improve the optimization algorithm being used for error minimization.
c) Decrease the model complexity.
d) Reduce the noise in the training data.
Answer: (b) Improve the optimization algorithm being used for error minimization.
Increase the amount of training data that are noisy would help in reducing overfit problem.
Increased complexity of the underlying model may increase the overfitting problem. Decreasing the complexity may help in reducing the overfitting problem.
Noise in the training data can increase the possibility for overfitting. Noise reduction can help in reducing the overfitting.
- The error function most suited for gradient descent using logistic regression is
a) The entropy function.
b) The squared error.
c) The cross-entropy function.
d) The number of mistakes.
Answer: (c) The cross-entropy function
For logistic regression, the cross-entropy function (loss function or cost function) is convex. A convex function has just one minimum; there are no local minima to get stuck in, so gradient descent starting from any point is guaranteed to find the minimum.
Since the Cross Entropy cost function is convex a variety of local optimization schemes can be more easily used to properly minimize it. For this reason the Cross Entropy cost is used more often in practice for logistic regression than is the logistic Least Squares cost.
The cost function return value that representing how well your model perform. It’s like a function that gives you the amount of error rate.
To find the optimal model that has minimum error rate (cost function) we use gradient descent
- You are given a labeled binary classification data set with N data points and D features. Suppose that N < D. In training an SVM on this data set, which of the following kernels is likely to be most appropriate?
a) Linear kernel
b) Quadratic kernel
c) Higher-order polynomial kernel
d) RBF kernel
Answer: (a) Linear kernel
Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single Line. It is one of the most common kernels to be used. It is mostly used when there are a Large number of Features in a particular Data Set.
When number of examples is less in comparison to number of features you would not have enough data to fit a non linear SVM i.e SVM with non linear kernel. SVM with linear kernel (or without a kernel) is one way to go.
- You are increasing the size of the layers (more hidden units per layer) in your neural network. What kind of impact it will have on bias and variance?
a) increases, increases
b) increases, decreases
c) decreases, increases
d) decreases, decreases.
Answer: (c) decreases, increases
Increasing the size of layers will result in decreasing bias and increasing variance.
Increasing the size of layers result in increased complexity. High variance means, the model is performing great on training data and poor performance on test data. Low bias means the model is fitting well.