05 - ML Flashcards

Question

Can outliers affect Linear regression?

Answer 1

In linear regression, outliers can adversely affect the prediction of the model. A variable with outliers dominates over the other variables in terms of contribution to the model. It causes to increase in the variance in the prediction and the original dataset also.

Answer 2

When we train a machine learning model some parameters are estimated during the training process. These are the model parameters. Along with them, there are some parameters that we need to pass to the model while training it. We have the freedom to pass different values to these parameters and check at what value the model is performing better. Such parameters are called hyperparameters of the model.

Answer 3

When we do regularization of the model using Ridge or Lasso regression then we make the model simpler (if it was originally overfitting). This is done by reducing the values of the coefficients. This causes the standard error to go down.

Answer 4

Sparsity means out of a given number of values vanishing of some of the values. For example in the case of a 10x10 matrix, out of 100 entries, 60 are zero then it is a sparse matrix. In general, if the percentage of such values is high we refer to it as a sparse matrix.

Answer 5

The lasso performs shrinkage so that there are "corners'' in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

Answer 6

A validation set is a subset of the data that is chosen to validate the model (whether it is performing well or not). Validating a model is a process of making the model as accurate and generic as possible, but this does not mean that the accuracy or performance of the model is 100%. Irrespective of the accuracy of the model on the validation set, most of the models are bound to make an error on the test set data. This is because the model can not capture the existing pattern in the test set data properly.

Answer 7

One can choose the validation set randomly if desired, it is not necessary that it has to be the last few records of the dataset. While doing cross-validation in machine learning different validation sets are chosen to validate the model.

Answer 8

Cross-validation is a method to find the best working model out of a set of models over a given set of data. If the average of the performance metrics of a model M1 is better than that for the model M2 then the model M1 will be preferred with the best set of parameters.

Answer 9

Machine learned models exhibit bias, often because the datasets used to train them are biased. This causes the resulting models to perform poorly on records that are minorities within the training set and ultimately present higher risks to them. Computer simulations are used to interrogate and diagnose biases within ML classifiers.

Answer 10

Yes, synthetic data is created from the population. If we are given a certain set of features X with n number of records then the synthetic data is created using these columns and records.

Answer 11

In bootstrapping, we create a certain number of datasets from a given original dataset. This is done with replacement and without replacement of the data. With replacement means when we take one record from the data then we keep it back before extracting the second record. This causes repetition of the records in the bootstrap sample. Without replacement means when we take one record from the data then we do not keep it back into the original dataset before taking the next record.

Answer 12

Bootstrapping is different from K-fold cross-validation. Bootstrapping is a method of creating different samples from the same dataset while K-fold cross-validation is a process of splitting the dataset into a number of equal-sized folds in order to train different models.

Answer 13

Given the original dataset and model parameters, the new datasets are created in simulation. It is used to create a dataset of the desired choice either to make the data more realistic or for some other real-world use purposes.

Answer 14

There is not a hard and fast rule for selecting the number of records while applying to bootstrap. What is the difference between Monte Carlo simulation and bootstrapping? Montecarlo simulation is different from the bootstrapping methods. Bootstrapping uses the original samples as the population from which it extracts samples to create a new dataset whereas Monte Carlo simulation is based on setting up a data generation process (with known values of the parameters).

Answer 15

Bootstrapping can be done on both labeled and unlabelled data both. It is all about creating a new set of data from a given set of records. It does not need labeled data to create a bootstrap model.

Answer 16

When we do bootstrap we have n number of datasets with a certain number of features. While applying multivariate regression we train the model on each of the datasets and make the final prediction as to the majority of predictions done by the models. So for every model parameters are estimated separately.

Answer 17

Bootstrapping is different from simulation. In bootstrapping we create different samples from the same datasets while in simulation-based on certain parameters we create different samples.

Answer 18

In bootstrap, we create a minimum of 20-30 datasets but the size of each dataset is equal to the size of the original dataset.

Answer 19

When we bootstrap a dataset, there are a few records that are duplicated during the process but not all the records. Even if we do bootstrap by sampling with replacement there is a 63% chance of unique records getting selected in each sample dataset. If the dataset is having duplicate records it is preferred to remove those duplicates before applying the bootstrap.

Answer 20

Endogeneity is a phenomenon that can occur in any kind of type. To study it for economics you can refer to the book "Dealing with Endogeneity in Regression Models with Dynamic Coefficients: 6 (Foundations and Trends® in Econometrics)".

Answer 21

It can not be said with certainty. But using the graphical model for a causal relationship gives the benefit of understanding and interpreting the causality well by the visual inspections in the model. It represents the causal relations (if existing) better than the nongraphical models.

Answer 22

Cross-validation is a process to find which model is working better among a given set of models. If you have two models M1 and M2 and we applied cross-validation over them and the average performance metrics of one model are found to be better than the other then that model is supposed to perform better on the unseen data. It can generalize better in an unseen scenario.

Answer 23

The standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation. While MSE is the mean squared error that is used to find the parameters of a model by optimizing mean squared error. Then seem to be two types of errors but they have a different purpose to fulfill.

Answer 24

There are a lot of benefits of using neural networks than the classical machine learning models. What is mentioned here is one of those benefits of using neural nets over the classical machine learning models.

Answer 25

It depends on the computational complexity. Generally larger the data better is the training of the supervised model and better in predicting the result since it has more data to train.

Answer 26

The equation we derive will be correct if we assume normal distribution but even if we do not assume it is approximately normal distribution due to the Central Limit Theorem and if the data is more and say for more than 50 then the assumption of the normal distribution is pretty safe.

Answer 27

Yes, In the wald test, we try to find if the standard error of a particular component in the test is zero. Since this revolves around the standard error, we compute it with zero only.

Answer 28

None, they can be used interchangeably. Both are the same.

Answer 29

The residuals should be normally distributed. This is the basic assumption which should not be violated. Maximum likelihood estimation does not work if this assumption doesn't hold true, in which case we should not use linear regression.

Answer 30

A matrix is said to have full rank if its rank i.e. the number of independent columns equals the largest possible for a matrix of the same dimensions, which is the lesser of the number of rows and columns. A matrix is said to be rank-deficient if it does not have full rank.

Answer 31

We can check by computing the correlation between two desired variables. The correlation coefficient, r, is a number between -1 and 1 and tells us how well a regression line fits the data. It gives the strength and direction of the relationship between two variables. The relationship between two variables is generally considered strong when their r value is larger than 0.7.

Answer 32

R-Square cannot define multicollinearity. R-Square explains the proportion of variance for a target variable explained by the independent variables.

Answer 33

There are many ways one of them is to drop the variable which has less correlation to the target variable or we can also check the Variance inflation factor (VIF) to remove the variable. If VIF gives higher values for those variables then they can be eliminated. It is one of the industrial approaches.

Answer 34

No, model will account for only the variables it sees. Model is maths - the solution needs human intervention - that is where domain knowledge comes into play.

Answer 35

No test can tell causal relation. The only way is to do controlled experiments. You cannot always do controlled experiments so economists also try natural experiments

Answer 36

Yes. The line of best fit has to be generated by regression which gives us the prediction with minimum errors.

Answer 37

That is actually a popular approach in the practical world - segmentation first and do regression separately. But the best approach would be to include as many hidden variables and work by increasing the dimensions of the data.

Answer 38

Yes. Z may contain any hidden information and generally adding more variables can mitigate endogeneity.

Answer 39

Regression is fine if your independent variables are discrete. It does not affect the model in any way.

Answer 40

We never know the variables are completely independent of each other. There may be a slight association with one another. If there is a very strong correlation that leads to collinearity then they should be taken care of.

Answer 41

It is a part of the data transformation technique to improve the model to best fit the data. Using the logarithm of one or more variables improves the fit of the model by transforming the distribution of the features to a more normally-shaped bell curve.

Answer 42

Even though we add data points that are nonlinear the coefficients that are learned in the model will be linear. Hence linear regression can be applied.

Answer 43

Yes. Instead of different frequencies we try to augment and add new variables using interaction, polynomial to best fit the data.

Answer 44

It slightly affects the explainability but it is always a tradeoff between explainability and accuracy of the model.

Answer 45

New features will be added along with the existing features and not replaced in the dataset. Features are only removed if they are not helpful for the model building process.

Answer 46

It can be considered as the combined effect of TV and Radio can bring a bigger effect when compared to them individually since in some cases individually they may not be effective but together will be more effective.

Answer 47

Different values of alpha are evaluated, sometimes too high an alpha makes the prediction equal to a constant value ( either zero or some high value), that is why we should pick an alpha based on the task at hand. K-fold cross-validation can also be used to trial different values of alpha and the one which provides the least error is selected.

Answer 48

Theta hat for lasso represents the coefficients of the variables that are used in the model. So in the case of lasso if we increase the penalty the theta hat tends to move to zero for the variables that are to be dropped.

Answer 49

As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., overfitting. So we need to find out one optimum point in our model where the decrease in bias is equal. We can reduce overfitting by penalizing large coefficients. The higher the value of alpha, the bigger is the penalty and therefore the magnitude of coefficients is reduced.

Answer 50

Yes, the noise would mean a presence of more false positives and true negatives since it affects the performance of the model.

Answer 51

It all depends on the computing power and data available to perform these techniques on statistical software. Ridge regression is faster compared to lasso but then again lasso has the advantage of completely reducing unnecessary parameters in the model.

Answer 52

R-Square explains the proportion of variance for a target variable explained by the independent variables and it is not used as an error metric. MSE, RMSE, etc are used as a metric for measuring the error.

Answer 53

It produces misleading values as we made the model more complex and learn the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. It is not due to computation issues and error is high due to overfitting only.

Answer 54

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. So it leads to high variance and overfitting as it learns more on the training data as we made the model more complex and when trying to predict on a new data it fails and leads to poor performance of the model.

Answer 55

The bias is known as the difference between the prediction of the values by the ML model and the correct value. Being high in biasing gives a large error in training as well as testing data. It is recommended that an algorithm should always be low-biased to avoid the problem of underfitting. In general, bias is too simple and we should increase the complexity of the model by adding more variables.

Answer 56

We use the validation set to test the performance of each method and the test set to test the performance of the final model/method that is chosen. The test set is untouched till the final testing of the model.

Answer 57

Yes, the test dataset will be smaller in size. The main dataset can be split for example like 80:10:10 for train, validation, and test dataset. Then the test is finally used for testing the selected model. This third dataset split i.e test dataset is kept hidden from the training and validation process. The other more useful approach for small datasets is to use to cross-validation.

Answer 58

Yes, we can think about doing that. We can maybe form a grid of values for training/validation split, then whichever one gives us the best result we go with that

Answer 59

Generally, the training dataset will be larger in size since the model is trained on this training data and the validation dataset is used for evaluating the model. Then the test dataset which is kept hidden is used to test the final selected model to see how well the model performs. Generally, they are kept in the ratio of 70:20:10 or 80:10:10. The split depends on the size of the dataset and the train-validation-test split is an ideal method.

Answer 60

When the data set is small and of the order of a few thousand, it can lead to overfitting.

Answer 61

Not really. Cross-validation is only used to check the performance of the model. After that regression is done on the training dataset to train the model.

Answer 62

Irreducible noise cannot be constant even if we know everything about the variable, there will be still some noise or randomness in the behavior of the variable data.

Answer 63

No - cross-validation is to test the performance and not to build the model. You have to run the regression again with a different set of variables, parameters to improve performance

Answer 64

It is chosen randomly for non-time series data. For Time-series data we need to use Sequentially

Answer 65

Ideally, it should be the same. They can be unequal but there is no reason to not divide them equally.

Answer 66

Yes. MSE accounts for bias as well as variance. MSE tells us about the error in our prediction which accounts for both bias and variance.

Answer 67

K-Fold is used for supervised learning. Once the model is trained on K-1 folds, it can be tested on the hold-out set. In unsupervised learning, we cannot test the model trained on K-1 folds because we don't have labels.

Answer 68

Because the sample size of a random subset is different than or original data set, then the amount of noise will vary, (noise will be more in less data, and less in more data). We want to avoid the difference in the estimator's performance.

Answer 69

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a population parameter. It means a data point in a drawn sample can reappear in future drawn samples as well.

Answer 70

Sum of all data points in the sample divided by the number of data points in that sample.

Answer 71

To randomly pick a sample means every data point is equally likely to be picked.

Answer 72

No, the sample generated using bootstrap doesn't have to be of the same size as the original data but usually, it is kept of the same size, and the samples not selected while sampling can be used to validate the model.

Answer 73

Yes, it should ideally if one is to believe CLT gives a true estimate of the population parameters.

Answer 74

These are adjustable parameters that must be tuned in order to obtain a model with optimal performance. For example, alpha in regularization is a hyperparameter.

Answer 75

Yes, if the model performs equally overall folds or subsets we can go ahead with the model.

Answer 76

There is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance. However, it is a heuristic to choose k.

Answer 77

Lasso selects only some features while reducing the coefficients of others to zero. This property is known as feature selection and is absent in the case of ridge regression. It is generally used when we have more features because it automatically does feature selection. We can Lasso when we want to discover sparsity structure. If we do not care about the sparsity structure, we can use either Lasso or Ridge. Lasso is not very useful when there is strong multicollinearity among independent variables

Answer 78

The regression line seems good in that image but it is difficult to interpret the results using the regression line alone. You must interpret all the results before making final decisions. For example, whether 1. All the variables are independent of each other, 2. All the variables are significant 3. Error terms are normally distributed etc. Consider a dataset with one continuous independent variable and a continuous dependent variable. Imagine, those values are distributed over 2-dimensional space which forms a circular shape. If linear regression is applied to this dataset, the prediction will not be efficient. The residuals will be very high for most of the points.

Answer 79

Multicollinearity - The extent to which independent variables are correlated. A basic assumption of the linear regression model is that the rank of the matrix of observations on independent variables is the same as the number of explanatory variables. In other words, such a matrix is of full column rank. This implies that all the independent variables are independent of each other, and there is no linear relationship among them.

Answer 80

Yes. Multicollinearity happens when independent variables in the regression model are highly correlated to each other. The regression algorithm assumes that the independent variables are not correlated with each other, so this assumption must be met if we are to proceed with building a linear regression model.

Answer 81

No. It will not explain the relative effect. PCA just tells if our data has redundancy. Then we can throw away those redundant variables. PCA only looks at the independent variables. But to explain the relative effect of these two variables, we have to consider the dependent variable as well.

Answer 82

Yes. In the matrix (Xt * X)^(-1), look at the small eigenvalue of that matrix, if the smallest eigenvalue is close to zero, then that is the evidence of multicollinearity. Here, Xt represents the transpose of X matrix and (-1) represents the inverse of the matrix In practice, we can also use Variance Inflation Factor (VIF) to check the multicollinearity among variables.

Answer 83

That is not a problem. If the correlation is significant, it will still give enough evidence for multicollinearity that 'X' is correlated with some other variables.

Answer 84

Correlation refers to the linear relationship between 2 variables while Multicollinearity is defined for regression model, where some features have a strong relationship with other variables or combination of other variables. It may affect the results and the interpretability of the linear regression model.

Answer 85

Yes, correlation does not imply causation. Causation applies to cases when action A causes outcome B. But correlation is simply a relationship, i.e., action A relates to action B.

Answer 86

We can also use regression when we do not have causality. We can use it to make predictions. For example, in a family, there are only two children. The height of one child does not cause the height of the other child. There is no caution relation, but using regression, we can predict the height of one child from the height of the other child.

Answer 87

It is like a heuristic approach. In linear regression, add variables, see whether it improves our prediction in terms of R-squared. If it seems to give significant improvements we can put them in. Also, try to take out the variables, we can observe if there is any damage to the R-squared values. If not, probably those variables may not be relevant. We can take those variables away.

Answer 88

There can be categorical variables with more than one category. We can create dummy variables for each category, where each column represents different categories, and entries are 1 in the columns belonging to the specific category else 0.

Answer 89

Yes, it can be helpful because the R-Squared value never decreases no matter the number of variables we add to our regression model. That is, even if we are adding redundant variables to the data, the value of R-Squared does not decrease. It either remains the same or increases with the addition of new independent variables. On the other hand, the Adjusted R-squared takes into account the number of independent variables used for predicting the target variable. In doing so, we can determine whether adding new variables to the model increases the model fit. It can decrease if that is not the case.

Answer 90

We can observe the adjusted R-squared and see if it improves much by adding new variables. We can use the validation set to get a good estimate of whether new variables add value to the model or not. Note: There is also a trade-off between the computational time and the model performance. If we add, say, 5 variables to the data for the increase of 0.1 Adjusted R-Squared, then depending on the resources and application, we can decide whether it is a significant increase for the dimension and the computation time that we are increasing.

Answer 91

Alpha is a parameter for the regularization term (penalty term) that combats overfitting by constraining the size of the coefficient values. And the alpha value is just a matter of units. It can take any value.

Answer 92

Use Lasso when we want to discover sparsity structure. If we do not care about the sparsity structure, we can use either Lasso or Ridge. These days computers are good enough, we can try both techniques and see which one works best.

Answer 93

It depends on how much data we have. If we have little data, we are not going to believe a model that has 100 coefficients. We may look for a sparse structure that has fewer thetas. But if we have billions of data, then even a structure with 1000 coefficients may be fine. It is relative to the size of the dataset.

Answer 94

No, both data are different. The validation set is the data that is used to validate the results of our trained model and tuning model hyperparameters. The test set is the unseen data that is used to check whether the model is giving a generalized performance or not.

Answer 95

It depends on the amount of data we have. Generally, we consider 60% data as a train set, 20% data as a validation set, and 20% as a test set.

Answer 96

Yes, this is what we do in K-fold cross-validation. Initially, we split the dataset into k groups. For each unique group, iterate the following procedure. -Take one group as a validation dataset -Take the remaining groups as a training dataset -Fit a model on the training set and evaluate it on the validation set We can summarize the performance of the model using the average scores on each group.

Answer 97

Validation - Divides the original training dataset into two different subsets, say training set and validation set. The training set is used for training and the validation set is used for assessing the performance of the model and tune hyperparameters. Here, the validation set is never getting trained by the model. Cross-validation- The original training data is divided into 'k' number of subsets. In one epoch, use 'k-1' subsets of data for training and use the remaining dataset for validating. Like this, for every epoch, the validation dataset will be different. This is also called K-fold cross-validation.

Answer 98

The random_state parameter is used for initializing the internal random number generator. Setting random_state a fixed value will guarantee that the same sequence of random numbers is generated each time we run the code. We can get different training/validation/test sets if we change random states, but we keep it the same throughout our analysis to avoid any bias and use cross-validation to assess the model performance.

Answer 99

The residual sum of squares (RSS) measures the level of variance in the error term of a regression model. The smaller the residual sum of squares, the better the model fits the data; the greater the residual sum of squares, the poorer the model fits the data. The mean squared error (MSE) is used to test the performance of the fitted models. It is related to RSS by the following equation: MSE(Mean Squared Error) = (1/N) * RSS, where 'N' is the number of samples.

Answer 100

Mean Square Error (MSE) is defined as the Mean (or Average) of the square of the difference between actual and estimated values. The smaller value of MSE indicates that the predicted values are closer to the actual values, i.e., a better model.

Answer 101

In cross-validation, if 'k' is equal to the number of data points, then for each fold we have only one data point in the validation set and the rest of the observations are used to train the model. It is called Leave-One-Out-Cross-Validation (LOOCV).

Answer 102

Bootstrapping is a resampling technique that generates a sampling distribution by repeatedly taking random samples from the known sample, with replacement. Here, 'with replacement' means a data point in a drawn sample can reappear in future drawn samples. This makes each drawn sample independent of the samples previously drawn. If we did not sample with a replacement, we would always get the same data (if the number of data points selected is equal to the number of data points in the original data).

Answer 103

Here, 'n' is the size of the original dataset. 'm' is the number of new bootstrap datasets. We can pick any value for 'm'. It can be smaller, equal, or larger than 'n'. In general, we keep it 'm' to be equal to 'n' but in certain scenarios, they can be different. For example, if we want to know details of the tails of the sampling distribution, we would need a bigger 'm'.

Answer 104

No. A permutation would be to choose each one of the data points just once. If every data point is chosen once, then the dataset would be essentially the same as the original dataset. But in bootstrapping, we pick some duplicates (sampling with replacement) to create a new dataset.

Answer 105

Yes, this is where bootstrap helps us. With bootstrap, we get samples out of an approximation of the sampling distribution and when we make a histogram out of those samples, that histogram is an approximation of the true sampling distribution. Once we have a sampling distribution, we see that how wide the standard deviation is and determine the standard error.

Answer 106

Bootstrap chooses samples randomly from the actual dataset with replacement. Bootstrapping is a general approach to statistical inference based on building a sampling distribution for a statistic by resampling from the data that we have.

Answer 107

The covariance matrix is a symmetric matrix that shows covariances of each pair of variables. The diagonal element shows the covariance of a variable with itself which nothing but the variance of that variable. So, each diagonal element represents the variance of the respective variable.

Answer 108

No, the two things are equivalent. If the confidence interval includes zero, that means there is no statistically meaningful or statistically significant proof that the variable helps to predict the target variable. It is the same as saying the p-value is high (>0.05). Similarly, when CI doesn't capture zero the p-value will be low.

Answer 109

Linear regression is a basic model and it can be the first model to try for a regression problem. The linear regression model is widely used in many situations before attempting non-linear and more complicated models. It is the most accomplished theoretical model and helps in interpretability; some key concepts can also be explained well using the linear regression model. In practice, we use multiple models of different kinds and the algorithm that gives the best results depends on the data and the problem on hand.

Answer 110

Lasso regularization can be used for feature selection. If any variable coefficient (theta) is zero, we can remove that variable. We can build the model with the remaining variables which have non-zero coefficient values.

05 - ML Flashcards

(134 cards)