ML Flashcards

Question

4. What is the biggest weakness of decision trees compared to logistic regression classifiers? a) Decision trees are more likely to overfit the data b) Decision trees are more likely to underfit the data c) Decision trees do not assume independence of the input features d) None of the mentioned

Answer 1

a) Decision trees are more likely to overfit the data Decision trees are more likely to overfit the data since they can split on many different combination of features whereas in logistic regression we associate only one parameter with each feature.

Answer 2

Answer: (a) Linear SVM and (c) Logistic regression Linear SVM and Logistic regression are the linear classifiers. Random forest and k-NN are the non-linear classifiers. They cannot linearly classify.

Answer 3

Answer: (b) Increase, Decrease When K increases to a large value, the model becomes simplest. All test data point will belong to the same class: the majority class. This is under-fit, that is, high bias and low variance. Bias-Variance tradeoff The bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs. In other words, model with high bias pays very little attention to the training data and oversimplifies the model. The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs. In other words, model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. [Source: Refer here]

Answer 4

(c) Simple model, Underfit When K increases to inf, the model is simplest. All test data point will belong to the same class: the majority class. This is under-fit, that is, high bias and low variance. knn classification is an averaging operation. To come to a decision, the labels of K nearest neighbour samples are averaged. The standard deviation (or the variance) of the output of averaging decreases as the number of samples increases. In the case K==N (you select K as large as the size of the dataset), variance becomes zero. Underfitting means the model does not fit, in other words, does not predict, the (training) data very well. Overfitting means that the model predicts the (training) data too well. It is too good to be true. If the new data point comes in, the prediction may be wrong.

Answer 5

(c) It would probably result in a decision tree that scores well on the training set but badly on a test set It is usual to make only binary splits because multiway splits break the data into small subsets too quickly. This causes a bias towards splitting predictors with many classes since they are more likely to produce relatively pure child nodes, which results in overfitting. [For more, refer here]

Answer 6

(c) Same as Both Perceptron and linear SVM are linear discriminators (i.e. a line in 2D space or a plane in 3D space.), so they should have the same VC dimension. VC dimension The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a space of functions that can be learned by a statistical binary classification algorithm. It is defined as the cardinality of the largest set of points that the algorithm can shatter. [Wikipedia]

Answer 7

(c) Mean square due to regression (MSR) Mean square due to regression or regression mean square (MSR) is obtained by dividing the regression sum of squares by its degree of freedom. The regression sum of squares (SSR) and the regression mean square (MSR) are always identical for the simple linear regression model.

Answer 8

(b) Multiple regression model Regressions based on more than one independent variable are called multiple regressions. Multiple linear regression is an extension of simple linear regression. Here, a dependent variable is modeled as a function of several independent variables with corresponding coefficients, along with the constant term. Multiple regression requires a minimum of two or more predictor variables, and this is why it is called multiple regression. Multiple regression will be good at explaining the relationship of the independent variables to the dependent variables if those relationships are linear.

Answer 9

(c) Mean absolute error Absolute Error is the amount of error in your measurements. It is the difference between the measured value and “true” value. Mean absolute error is the average of all absolute errors. Mean Absolute Error (MAE): MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. [For more, please refer here]

Answer 10

(b) Leave-one-out cross-validation Leave-one-out cross-validation (LOO cross-validation) is not suitable for very large datasets due to the fact that this validation technique requires one model for every sample in the training set to be created and evaluated. Cross validation It is a technique to evaluate a machine learning model and it is the basis for whole class of model evaluation methods. The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it. It works by the idea of splitting dataset into number of subsets, keep a subset aside, train the model, and test the model on the holdout subset. Leave-one-out cross validation Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation is very expensive to compute at first pass. [For more information on other cross-validation techniques you may refer here]

Answer 11

(c) Holdout method Holdout cross-validation method is suitable for very large dataset because it is the simplest and quicker to compute version of cross-validation. What is cross-validation? Refer the answer for question 1 in this page. Holdout method In this method, the dataset is divided into two sets namely the training and the test set with the basic property that the training set is bigger than the test set. Later, the model is trained on the training dataset and evaluated using the test dataset.

Answer 12

Answer: (d) The training algorithm has to rerun from scratch k times In k-fold cross-validation, the dataset is divided into k subsets. Like in holdout method, these subsets are divided into training and test sets as follows; a) One of the subsets is chosen as the test set and the other subsets put together forms the training set. b) Train a model on training set and test using test set c) Keep the score to calculate the average error. d) Repeat (a) to (c) for all individual subsets as test sets Here, as there is a change in the training set in every cycle, the training algorithms has to rerun from scratch k times. Hence, it takes k times as much computation to make an evaluation.

Answer 13

c.using a meta learner over our base models predictions

Answer 14

c. The learning rate should be decreased

Answer 15

Answer 16

Answer 17

b. The estimated change in average Y per unit change in X1.

Answer 18

e. increasing the training dataset size

Answer 19

a. All options are incorrect

Answer 20

cc. Analogizers

Answer 21

e. it performs better if features are in the same scale and it requires a value for k

Answer 22

a. The decision boundary is smoother with larger values of k

Answer 23

e. The predicted value in a terminal node is equal to the mean value of all the samples included in that node

Answer 24

Answer 25

Answer 26

d. weights are regularized with the L2 norm (weights can be close to zero)

Answer 27

d. The predicted value of Y when X = 0.

Answer 28

—_b. can be used as a feature_selection technique

Answer 29

b. When the entropy of V1 is lower

Answer 30

b. we select the attribute with the highest information gain

Answer 31

b. Evaluate and compare different feature combinations

Answer 32

Answer 33

c. When large errors are particularly undesirable and we want to penalize those

Answer 34

c. Backward from output to input layer

Answer 35

Answer: (b) Decrease Both techniques are used to reduce the complexity of the model. The Ridge and Lasso regression techniques aim to lower the sizes of the coefficients to avoid over-fitting. Ridge regression shrinks the regression coefficients that have little contribution to the outcome. This takes the little contributing coefficients close to zero. Whereas, Lasso regression forces the little contributing coefficients to be zero (exactly). Linear regression = min(Sum of squared errors) Ridge regression = min(Sum of squared errors + alpha * slope)square) Lasso Regression = min(sum of squared error + alpha * | slope| )

Answer 36

(b) Ridge has larger bias, smaller variance Ridge regression’s advantage over ordinary least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias

Answer 37

(b) if we have less features and (c) if features have high correlation Ridge Regression works better when you have less features or when you have features with high correlation. It performs better in cases where there may be high multi-colinearity, or high correlation between certain features. This is because it reduces variance in exchange for bias. [Please refer here for more]

Answer 38

(a) Weights The classifier’s behavior is determined by the coefficients, wi.These are usually called weights.

Answer 39

(a) Population parameters The null and alternative hypotheses are two mutually exclusive statements about a population. A hypothesis test uses sample data to determine whether to reject the null hypothesis. Null hypothesis (H0) - The null hypothesis states that a population parameter (such as the mean, the standard deviation, and so on) is equal to a hypothesized value. Alternative Hypothesis (H1) - The alternative hypothesis states that a population parameter is smaller, greater, or different than the hypothesized value in the null hypothesis.

Answer 40

(c) The null hypothesis is not rejected when the alternative hypothesis is true Type 2 error is caused when the null hypothesis is false and we fail to reject it.

Answer 41

(c) L2 Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. L2 regularization adds an L2 penalty, which equals the square of the magnitude of coefficients. Ridge regression shrinks the regression coefficients, so that variables, with minor contribution to the outcome, have their coefficients close to zero. The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called L2-norm, which is the sum of the squared coefficients. L2 regularization is used to avoid overfitting of data. When do we use L2 regularization? L2 regularization is best used in non-sparse outputs, when no feature selection needs to be done, or if you need to predict a continuous output.

Answer 42

(b) Absolute value of magnitude Lasso regression adds “absolute value of magnitude” of coefficient as penalty term to the loss function. Lasso regression shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called L1-norm, which is the sum of the absolute coefficients.

Answer 43

(c) More of a risk to overfit the training data Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data.

Answer 44

(d) Decision tree is overfitting Overfitting causes low training error. Overfitting means that the model predicts the (training) data too well. It is too good to be true. If the new data point comes in, the prediction may be wrong. Pruning can help in reducing the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

Answer 45

b) Increases bias, decreases variance Increasing λ increases bias and decreases variance Regularized regression It is a type of regression where the coefficient estimates are constrained to zero. The magnitude (size) of coefficients, as well as the magnitude of the error term are penalized. Complex models are discouraged, primarily to avoid overfitting. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. [For more refer here – regularized regression, ] and [Refer here - regularization ] Type of regularized regression Ridge regression (L2 regularization) Lasso regression (L1 regularization)

Answer 46

(a) Pruning and (c) Enforce a maximum depth for the tree Over-fitting is a significant practical difficulty for decision tree models and many other predictive models. Over-fitting happens when the learning algorithm continues to develop hypotheses that reduce training set error at the cost of an increased test set error. Unlike other regression models, decision tree doesn’t use regularization to fight against over-fitting. Instead, it employs tree pruning. Selecting the right hyper-parameters (tree depth and leaf size) also requires experimentation, e.g. doing cross-validation with a hyper-parameter matrix.

Answer 47

(b) can be used for regression and (c) can be used for classification Regression refers to predictive modeling problems that involve predicting a numeric value given an input. Classification refers to predictive modeling problems that involve predicting a class label or probability of class labels for a given input. Neural networks can be used for either regression or classification. Under regression model a single value is outputted which may be mapped to a set of real numbers meaning that only one output neuron is required. Under classification model an output neuron is required for each potentially class to which the pattern may belong. If the classes are unknown unsupervised neural network techniques such as self organizing maps should be used.

Answer 48

(a) weights are regularized with the l1 norm Regularization is a technique to deal with over-fitting problem. Lasso regression Lasso regression is a regularization technique. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). A sparse solution could avoid over-fitting. Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can result in sparse models with few coefficients; Some coefficients can become zero and eliminated from the model. Why l1 norm? By L1 regularization, you essentially make the vector smaller (sparse), as most of its components are useless (zeros), and at the same time, the remaining non-zero components are very “useful”.

Answer 49

b) Bias The regularization (tuning or penalty) parameter (lambda) is an input to your model. Lambda is the tuning parameter that controls the bias-variance tradeoff and we estimate its best value via cross-validation. The regularization parameter reduces over-fitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda results in less over-fitting but also greater bias. Large values of lambda pull weight parameters to zero leading to large bias. It leads to under-fitting.

Answer 50

(a) Variance Small values of λ allow model to become finely tuned to noise leading to large variance. It leads to over-fitting.

Answer 51

(b) Lasso With Lasso, when we increase the value of Lambda the most important parameters shrink a little bit and the less important parameters goes closed to zero. So, Lasso is able to exclude silly parameters from the model.

Answer 52

(a) They capture the joint probability and (c) Generative models can be used for classification Generative models are useful for unsupervised learning tasks. A generative model learns parameters by maximizing the joint probability P(X,Y). Generative models encode full probability distributions and specify how to generate data that fit such distributions. Bayesian networks are well-known examples of such models. Refer here for more information. Generative Classifiers tries to model class, i.e., what are the features of the class. In short, it models how a particular class would generate input data. When a new observation is given to these classifiers, it tries to predict which class would have most likely generated the given observation.

Answer 53

(d) Subset selection can reduce overfitting A classifier is said to overfit to a dataset if it models the training data too closely and gives poor predictions on new data. This occurs when there is insufficient data to train the classifier and the data does not fully cover the concept being learned. Subset selection reduces over-fitting. Feature subset selection is the process of identifying and removing as much of the irrelevant and redundant information as possible. This reduces the dimensionality of the data and allows learning algorithms to operate faster and more effectively.

Answer 54

(c) Use of slack variables The reason that SVMs tend to be resistant to over-fitting, even in cases where the number of attributes is greater than the number of observations, is that it uses regularization. The key to avoid over-fitting lies in careful tuning of the regularization parameter, C, and in the case of non-linear SVMs, careful choice of kernel and tuning of the kernel parameters. Without slack variables the SVM would be forced into always fitting the data exactly and would often overfit as a result.

Answer 55

(b) The more attributes we use to describe the examples the more difficult is to obtain high accuracy kNN becomes significantly slower as the number of examples (independent variables) increases. When the number of features increases, then it requires more data. When there’s more data, it creates an overfitting problem because no one knows which piece of noise will contribute to the model. kNN performs better with low dimensions (low number of features). For more, you can refer here. https://neptune.ai/blog/knn-algorithm-explanation-opportunities-limitations

Answer 56

(c) Both numeric and nominal values Decision trees can handle both numerical and categorical data. Early decision trees were only capable of handling categorical variables, but more recent versions, such as C4.5, CART do not have this limitation. The categorical data are encoded, if required (eg. one-hot encoding), and used by decision tree algorithms.

Answer 57

(a) Increase in regularization parameter (lambda) will make the model to underfit the data and the validation error will go up. Regularization parameter (tuning parameter) λ, used in the regularization techniques, controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding overfitting), without loosing any important properties in the data. But after certain value, the model starts loosing important properties, giving rise to bias in the model and thus underfitting.

Answer 58

(a) High variance A model has high variance if it is very sensitive to (small) changes in the training data. Decision trees are generally unstable considering that a small change in the data set can result in a very different set of splits. This results in high variance. This is mainly due to the hierarchical nature of decision trees, since a change in split points in the initial stages will affect all the subsequent splits.

Answer 59

(a) Ensemble methods can take the form of using different classifiers and (d) Using same classification algorithm with different settings is an ensemble method Ensemble methods can take the form of using different algorithms, using the same algorithm with different settings, or assigning different parts of the dataset to different classifiers. Ensemble methods - The learning algorithms which construct a set of classifiers and then classify new data points by taking a choice of their predictions are known as Ensemble methods. Random forest is an ensemble model where number of decision trees is used to predict the output.

Answer 60

(b) Decision tree Decision tree is not an ensemble method. It is a single tree used for classification. Random forest is an ensemble model where we use multiple decision trees to predict outcomes. AdaBoost is a statistical classification meta-algorithm. It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher weights assigned to incorrectly classified instances. Bootstrapping generates multiple bootstrap training sets from the original training set and uses each of them to generate a classifier for inclusion in the ensemble.

Answer 61

(a) AdaBoost AdaBoost is an example of sequential ensemble model. Boosting is an ensemble technique that learns from previous predictor mistakes to make better predictions in the future. The technique combines several weak base learners to form one strong learner, thus significantly improving the predictability of models. Boosting works by arranging weak learners in a sequence, such that weak learners learn from the next learner in the sequence to create better predictive models. What is sequential ensemble? Sequential ensemble: base learners are generated sequentially. The basic motivation of sequential methods is to exploit the dependence between the base learners. Overall performance may be improved by weighing previously mislabeled examples with higher weight.

Answer 62

(b) Bagging Bagging (also called as Bootstrap Aggregation) is an ensemble method which is the application of Bootstrap procedure to a high variance ML algorithm. Averaging reduces variance. Bagging uses bootstrap to generate L training sets, trains L base-learners using an unstable learning procedure, and then, during testing, takes an average. What is an ensemble model in machine learning? An ensemble method is a technique which uses multiple independent similar or different models/weak learners to derive an output or make some predictions. An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

Answer 63

(b)Generating a tree with fewer branches and (c) Generating a complete tree then getting rid of some branches Two approaches to avoiding overfitting are distinguished: pre-pruning (generating a tree with fewer branches than would otherwise be the case) and post-pruning (generating a tree in full and then removing parts of it). Results are given for pre-pruning using either a size or a maximum depth cutoff. A method of post-pruning a decision tree based on comparing the static and backed-up estimated error rates at each node is also described. We need to remove irrelevant attributes.

ML Flashcards

Nao chumbar (90 cards)