05 - ML Flashcards

1
Q

What is the difference between noise and variance?

A

While collecting data, the data acquisition medium (either human or machine) may make errors. Such errors are called noise. Variance measures the variation of data points from the mean of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is heterosekadascity? What is generally the shape of the heteroscedastic data?

A

Heterosekadascity is the phenomenon of having a different variance of data points along the regression line. Heteroscedastic data is generally having an irregular shape i.e. Cone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does the weight help with making a prediction?

A

When weights are established in a model the relationship between the dependent and independent variables is established. This expression is called the model. To predict unknown data, the values of independent variables are passed in the model to get the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Are the coefficients stand for weights in the model?

A

Yes, the coefficients established after the training process are the weights for each variable in the model. They play a vital role in making the prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between endogeneity and multicollinearity?

A

Endogeneity is the phenomenon of existing correlation between the independent variables and the error terms of the model. While multicollinearity is the process of correlation between independent variables of a model. These two are two different concepts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between correlation and causation?

A

Correlation between two features ensures that there is a relation (strong or weak) existing. It does not tell anything about whether one feature has originated from the other. For example, age and education can be correlated but none of them is originating the other. Causation is the effect of one feature causing to originate the other feature. For example, poverty causing starvation is a causal effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is it necessary to run PCA to resolve multicollinearity?

A

Yes, it can be but not in all cases. PCA is used to reduce the number of features when they are very high in count. This process is called dimensionality reduction. To get rid of multicollinearity one out of two correlated features can be removed for all the pairs of correlated variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the maximum likelihood function?

A

The likelihood function is a function that is the combination of the likelihood of occurrence of all the events of a sample. When this function is maximized to get the parameters of the model then it is called the maximum likelihood function. It is a function of parameters not the variables of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Can autocorrelation cause endogeneity in the data?

A

Autocorrelation is the process of existing correlation between a variable and its lagged version. In such a case the error term may be in correlation with the variable with existing correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you detect endogeneity, to mitigate?

A

To detect endogeneity one can collect the error terms and find whether they are correlated to each other or not? If they are then the feature will be endogenic. This can be tested by visualization also, where if the variable seems to be related to the error terms then it can be declared to be endogenic. One of the possible ways of mitigating endogeneity is to do encoding of categorical variables. Encoding creates additional variables that may lead to removing the correlation between the variable with the error term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is there an overlap between the two terms - heteroscedasticity and endogeneity?

A

No, they are two different phenomena. Heteroscedasticity is the phenomenon of having a different variance of data points along the best fit line of regression. While endogeneity is the phenomenon of existing correlation between the independent variable and the error terms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In regression, is it needed to have a categorical variable?

A

A regression problem is not necessary to have only continuous or categorical variables. It is possible that it can have either only continuous variables, only categorical variables, or a mix of the two also.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Should any boolean or binomial data always be converted to a 1/0?

A

While processing the data in python, it is much needed to convert it into numeric data type. Doing so helps us do mathematical operations on the data. Due to this, it is good to convert a binary independent variable to 0 and 1 using any of the encoding methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Would log(x2) and x2 have a very correlation that could cause multicollinearity issues?

A

Log(x2) is a transformed version of x2 and it is an increasing function. Creating such a column may cause multicollinearity because in the case of log transformation they will have some correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do we need a non-linear function added to the model?

A

It is not always possible to get a linear relationship between the dependent and independent variables. In real life, most of the time a non-linear relation captures the actual relation between the variables better. Due to this, we need non-linear functions added to the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Would the product of two variables be an interaction between the features?

A

Yes, it is an interaction between the two variables. When two variables are multiplied, as an effect of interaction a new feature is generated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Can you take the log of Y and still have linear regression?

A

Yes, it is possible. A linear relationship between variables means that the index/power of the coefficients is not equal to 1. Linearity is seen with the index of the parameter, not with that of the variables. So Log(y) and the independent variables can have a linear relationship still.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Are rescaling, normalizing, and standardizing are different?

A

Normalizing and standardizing are two methods of doing rescaling/scaling. In normalizing we measure how much standard deviation is the actual data away from the mean of the data while in standardization we bring back the data to a certain range of numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can you provide an example of the weighted least square algorithm? How do we pick the weight?

A

Weighted least square algorithm is a method of finding the parameters of a model. It can be applied with any algorithm where this is used as the cost function for the algorithm for example Linear regression. The weights should, ideally, be equal to the reciprocal of the variance of the measurement. Corresponding to each record there is a different weight associated with it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

If adding more variables reduces endogeneity, then how to reduce heteroscedasticity?

A

One of the prominent methods to do so is to use weighted least square analysis. Giving different weightage to the variables resolves heteroscedasticity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Can you give some examples where scaling/ normalizing is necessary and where the regression will work just fine without transforming the data first?

A

Scaling and normalizing are used when different features of the dataset are at a different scale, for example, if a dataset contains weight in kg and height in meter then they have different scales and it is needed to do scaling of the data. Linear regression works relatively better if there is no big-scale difference between the features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Will the addition of more variables cause overfitting?

A

When new variables are added to the model then it makes the model more complex and hence it will try to capture the noise in the data while being trained. This will cause the overfitting of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

If we take the log of Y, can we still account for outliers in our prediction model?

A

It is possible that even in the transformed data (log transformation) some points are far away from the main herd of the data points and they can be outliers. Transformation is not a fixed solution to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Can a regression model include both continuous and categorical variables at the same time?

A

Yes, a regression model can include both the categorical and continuous variables at the same time. The only thing is the categories in the categorical data need to be converted into numbers so that the model can be established properly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Can outliers affect Linear regression?

A

In linear regression, outliers can adversely affect the prediction of the model. A variable with outliers dominates over the other variables in terms of contribution to the model. It causes to increase in the variance in the prediction and the original dataset also.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is a hyperparameter?

A

When we train a machine learning model some parameters are estimated during the training process. These are the model parameters. Along with them, there are some parameters that we need to pass to the model while training it. We have the freedom to pass different values to these parameters and check at what value the model is performing better. Such parameters are called hyperparameters of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What does using lasso or ridge regression do to standard errors of coefficient estimates?

A

When we do regularization of the model using Ridge or Lasso regression then we make the model simpler (if it was originally overfitting). This is done by reducing the values of the coefficients. This causes the standard error to go down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What does sparsity mean?

A

Sparsity means out of a given number of values vanishing of some of the values. For example in the case of a 10x10 matrix, out of 100 entries, 60 are zero then it is a sparse matrix. In general, if the percentage of such values is high we refer to it as a sparse matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why does Lasso force the coefficients to be exactly equal to zero while ridge just shrinks them?

A

The lasso performs shrinkage so that there are “corners’’ in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares “hits’’ one of these corners, then the coefficient corresponding to the axis is shrunk to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Is it a subset of the entire dataset? If the validation set and training set work well then why would we still get errors in our test set?

A

A validation set is a subset of the data that is chosen to validate the model (whether it is performing well or not). Validating a model is a process of making the model as accurate and generic as possible, but this does not mean that the accuracy or performance of the model is 100%. Irrespective of the accuracy of the model on the validation set, most of the models are bound to make an error on the test set data. This is because the model can not capture the existing pattern in the test set data properly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Could you randomly choose a different validation set rather than setting aside a test set?

A

One can choose the validation set randomly if desired, it is not necessary that it has to be the last few records of the dataset. While doing cross-validation in machine learning different validation sets are chosen to validate the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

In the k folds method, how do we combine different regression models?

A

Cross-validation is a method to find the best working model out of a set of models over a given set of data. If the average of the performance metrics of a model M1 is better than that for the model M2 then the model M1 will be preferred with the best set of parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Can simulation tell us anything about bias?

A

Machine learned models exhibit bias, often because the datasets used to train them are biased. This causes the resulting models to perform poorly on records that are minorities within the training set and ultimately present higher risks to them. Computer simulations are used to interrogate and diagnose biases within ML classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Are “synthetic datasets” created from the population?

A

Yes, synthetic data is created from the population. If we are given a certain set of features X with n number of records then the synthetic data is created using these columns and records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

So we are creating new sample data sets based on the permutation /combinations of the original data set in bootstrapping?

A

In bootstrapping, we create a certain number of datasets from a given original dataset. This is done with replacement and without replacement of the data. With replacement means when we take one record from the data then we keep it back before extracting the second record. This causes repetition of the records in the bootstrap sample. Without replacement means when we take one record from the data then we do not keep it back into the original dataset before taking the next record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Is bootstrapping different from K-fold cross-validation?

A

Bootstrapping is different from K-fold cross-validation. Bootstrapping is a method of creating different samples from the same dataset while K-fold cross-validation is a process of splitting the dataset into a number of equal-sized folds in order to train different models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

For the simulation method, how do you draw the new data set? Do you draw from a pdf of the new data points using the current model parameters?

A

Given the original dataset and model parameters, the new datasets are created in simulation. It is used to create a dataset of the desired choice either to make the data more realistic or for some other real-world use purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Does bootstrapping require a minimum amount of samples (n)? Is Monte Carlo simulation some type of bootstrapping?

A

There is not a hard and fast rule for selecting the number of records while applying to bootstrap. What is the difference between Monte Carlo simulation and bootstrapping? Montecarlo simulation is different from the bootstrapping methods. Bootstrapping uses the original samples as the population from which it extracts samples to create a new dataset whereas Monte Carlo simulation is based on setting up a data generation process (with known values of the parameters).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Is bootstrapping done on a labeled dataset?

A

Bootstrapping can be done on both labeled and unlabelled data both. It is all about creating a new set of data from a given set of records. It does not need labeled data to create a bootstrap model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

When we bootstrap are we jointly estimating all parameters in multivariate regression or one at a time?

A

When we do bootstrap we have n number of datasets with a certain number of features. While applying multivariate regression we train the model on each of the datasets and make the final prediction as to the majority of predictions done by the models. So for every model parameters are estimated separately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Is bootstrapping is a way of simulating the sampling distribution?

A

Bootstrapping is different from simulation. In bootstrapping we create different samples from the same datasets while in simulation-based on certain parameters we create different samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Does Bootstrapping require a size >= 30 to assure normal distribution of the sample?

A

In bootstrap, we create a minimum of 20-30 datasets but the size of each dataset is equal to the size of the original dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

How taking duplicates makes it different data in case of bootstrapping

A

When we bootstrap a dataset, there are a few records that are duplicated during the process but not all the records. Even if we do bootstrap by sampling with replacement there is a 63% chance of unique records getting selected in each sample dataset. If the dataset is having duplicate records it is preferred to remove those duplicates before applying the bootstrap.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Any links/books recommended for the robust methods to handle endogeneity?

A

Endogeneity is a phenomenon that can occur in any kind of type. To study it for economics you can refer to the book “Dealing with Endogeneity in Regression Models with Dynamic Coefficients: 6 (Foundations and Trends® in Econometrics)”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Do graphical models handle causality better?

A

It can not be said with certainty. But using the graphical model for a causal relationship gives the benefit of understanding and interpreting the causality well by the visual inspections in the model. It represents the causal relations (if existing) better than the nongraphical models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

There will be different estimates of model parameters for different folds? Which one to select?

A

Cross-validation is a process to find which model is working better among a given set of models. If you have two models M1 and M2 and we applied cross-validation over them and the average performance metrics of one model are found to be better than the other then that model is supposed to perform better on the unseen data. It can generalize better in an unseen scenario.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What’s the main difference between MSE and standard error?

A

The standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation. While MSE is the mean squared error that is used to find the parameters of a model by optimizing mean squared error. Then seem to be two types of errors but they have a different purpose to fulfill.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Is the benefit of neural nets vs machine learning, that in NN’s, you don’t have to do feature section…the algorithm does it for you?

A

There are a lot of benefits of using neural networks than the classical machine learning models. What is mentioned here is one of those benefits of using neural nets over the classical machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

For supervised is there a restriction on the number of samples to be used for a model?

A

It depends on the computational complexity. Generally larger the data better is the training of the supervised model and better in predicting the result since it has more data to train.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Are we already assuming Normal distribution to construct confidence interval?

A

The equation we derive will be correct if we assume normal distribution but even if we do not assume it is approximately normal distribution due to the Central Limit Theorem and if the data is more and say for more than 50 then the assumption of the normal distribution is pretty safe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Does the wald test always have to have the null hypothesis with theta equal to zero?

A

Yes, In the wald test, we try to find if the standard error of a particular component in the test is zero. Since this revolves around the standard error, we compute it with zero only.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Difference between estimator and predictor?

A

None, they can be used interchangeably. Both are the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What if the normality test on residuals fails?

A

The residuals should be normally distributed. This is the basic assumption which should not be violated. Maximum likelihood estimation does not work if this assumption doesn’t hold true, in which case we should not use linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Could you please explain “matrix full rank”?

A

A matrix is said to have full rank if its rank i.e. the number of independent columns equals the largest possible for a matrix of the same dimensions, which is the lesser of the number of rows and columns. A matrix is said to be rank-deficient if it does not have full rank.

55
Q

How do we detect here high correlation between two variables?

A

We can check by computing the correlation between two desired variables. The correlation coefficient, r, is a number between -1 and 1 and tells us how well a regression line fits the data. It gives the strength and direction of the relationship between two variables. The relationship between two variables is generally considered strong when their r value is larger than 0.7.

56
Q

Can multicollinearity be addressed by using R-Square?

A

R-Square cannot define multicollinearity. R-Square explains the proportion of variance for a target variable explained by the independent variables.

57
Q

Can we remove collinearity by removing variables? And which variable to remove?

A

There are many ways one of them is to drop the variable which has less correlation to the target variable or we can also check the Variance inflation factor (VIF) to remove the variable. If VIF gives higher values for those variables then they can be eliminated. It is one of the industrial approaches.

58
Q

If the removed variable has some causal effect on the dependent variable, does the model account for that?

A

No, model will account for only the variables it sees. Model is maths - the solution needs human intervention - that is where domain knowledge comes into play.

59
Q

Can Hypothesis testing help to identify a Causal relationship?

A

No test can tell causal relation. The only way is to do controlled experiments. You cannot always do controlled experiments so economists also try natural experiments

60
Q

Will the regression generate the best fit line?

A

Yes. The line of best fit has to be generated by regression which gives us the prediction with minimum errors.

61
Q

In the case of endogeneity can we perform regression after clustering on each cluster?

A

That is actually a popular approach in the practical world - segmentation first and do regression separately. But the best approach would be to include as many hidden variables and work by increasing the dimensions of the data.

62
Q

Will adding Z as a new variable reduce endogeneity?

A

Yes. Z may contain any hidden information and generally adding more variables can mitigate endogeneity.

63
Q

Can we use discrete(categorical) variables in regression?

A

Regression is fine if your independent variables are discrete. It does not affect the model in any way.

64
Q

Do we have independent assumptions of the variables?

A

We never know the variables are completely independent of each other. There may be a slight association with one another. If there is a very strong correlation that leads to collinearity then they should be taken care of.

65
Q

Why do we use logarithm?

A

It is a part of the data transformation technique to improve the model to best fit the data. Using the logarithm of one or more variables improves the fit of the model by transforming the distribution of the features to a more normally-shaped bell curve.

66
Q

If we add non-linear data, will the model stay linear?

A

Even though we add data points that are nonlinear the coefficients that are learned in the model will be linear. Hence linear regression can be applied.

67
Q

Is augmentation is like using more harmonics in the Fourier equation?

A

Yes. Instead of different frequencies we try to augment and add new variables using interaction, polynomial to best fit the data.

68
Q

Does augmentation affect the model explainability?

A

It slightly affects the explainability but it is always a tradeoff between explainability and accuracy of the model.

69
Q

Will the new feature replace the old one or will be added to the data set?

A

New features will be added along with the existing features and not replaced in the dataset. Features are only removed if they are not helpful for the model building process.

70
Q

What does “ combined effect of TV and Radio “ mean intuitively? TV has affected, Radio has affected, and they seem to be independent variables.

A

It can be considered as the combined effect of TV and Radio can bring a bigger effect when compared to them individually since in some cases individually they may not be effective but together will be more effective.

71
Q

How to choose the best value of alpha in regularization?

A

Different values of alpha are evaluated, sometimes too high an alpha makes the prediction equal to a constant value ( either zero or some high value), that is why we should pick an alpha based on the task at hand. K-fold cross-validation can also be used to trial different values of alpha and the one which provides the least error is selected.

72
Q

What was theta hat for lasso and give a brief about it?

A

Theta hat for lasso represents the coefficients of the variables that are used in the model. So in the case of lasso if we increase the penalty the theta hat tends to move to zero for the variables that are to be dropped.

73
Q

Could you please explain how penalizing large Thetas helps prevent overfitting?

A

As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., overfitting. So we need to find out one optimum point in our model where the decrease in bias is equal. We can reduce overfitting by penalizing large coefficients. The higher the value of alpha, the bigger is the penalty and therefore the magnitude of coefficients is reduced.

74
Q

If there is more noise in the data does it lead to more false positives and less true positives?

A

Yes, the noise would mean a presence of more false positives and true negatives since it affects the performance of the model.

75
Q

What makes Ridge regression compute faster than Lasso?

A

It all depends on the computing power and data available to perform these techniques on statistical software. Ridge regression is faster compared to lasso but then again lasso has the advantage of completely reducing unnecessary parameters in the model.

76
Q

Can R square be used for measuring the prediction error?

A

R-Square explains the proportion of variance for a target variable explained by the independent variables and it is not used as an error metric. MSE, RMSE, etc are used as a metric for measuring the error.

77
Q

Is the problem with overfitting that it produces too many errors or the error is due to very high computation?

A

It produces misleading values as we made the model more complex and learn the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. It is not due to computation issues and error is high due to overfitting only.

78
Q

What is overfitting and what does high variance mean?

A

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. So it leads to high variance and overfitting as it learns more on the training data as we made the model more complex and when trying to predict on a new data it fails and leads to poor performance of the model.

79
Q

What does bias mean?

A

The bias is known as the difference between the prediction of the values by the ML model and the correct value. Being high in biasing gives a large error in training as well as testing data. It is recommended that an algorithm should always be low-biased to avoid the problem of underfitting. In general, bias is too simple and we should increase the complexity of the model by adding more variables.

80
Q

What is the difference between the validation set and the test set?

A

We use the validation set to test the performance of each method and the test set to test the performance of the final model/method that is chosen. The test set is untouched till the final testing of the model.

81
Q

With small sample sizes, this third test set is recommendable?

A

Yes, the test dataset will be smaller in size. The main dataset can be split for example like 80:10:10 for train, validation, and test dataset. Then the test is finally used for testing the selected model. This third dataset split i.e test dataset is kept hidden from the training and validation process.
The other more useful approach for small datasets is to use to cross-validation.

82
Q

Would it be advisable to repeat training+validation several times with different splits of training/validation? Then use statistics of error

A

Yes, we can think about doing that. We can maybe form a grid of values for training/validation split, then whichever one gives us the best result we go with that

83
Q

How big should each of these data set sizes be? Seems like we’re dividing our data up, again and again, to be more thorough, but the resulting subsets are smaller.

A

Generally, the training dataset will be larger in size since the model is trained on this training data and the validation dataset is used for evaluating the model. Then the test dataset which is kept hidden is used to test the final selected model to see how well the model performs. Generally, they are kept in the ratio of 70:20:10 or 80:10:10. The split depends on the size of the dataset and the train-validation-test split is an ideal method.

84
Q

Could LOOCV lead to overfitting?

A

When the data set is small and of the order of a few thousand, it can lead to overfitting.

85
Q

After doing the regression n times in LOOCV, do we take the average thetas?

A

Not really. Cross-validation is only used to check the performance of the model. After that regression is done on the training dataset to train the model.

86
Q

Does irreducible noise mean that the noise is constant at that point?

A

Irreducible noise cannot be constant even if we know everything about the variable, there will be still some noise or randomness in the behavior of the variable data.

87
Q

Do you vote to select the theta of the best performing model when we do cross-validation?

A

No - cross-validation is to test the performance and not to build the model. You have to run the regression again with a different set of variables, parameters to improve performance

88
Q

Are the folds in k-fold sequential or random?

A

It is chosen randomly for non-time series data. For Time-series data we need to use Sequentially

89
Q

Do all the folds need to be the same size?

A

Ideally, it should be the same. They can be unequal but there is no reason to not divide them equally.

90
Q

Does MSE accounts for both Bias + Variance?

A

Yes. MSE accounts for bias as well as variance. MSE tells us about the error in our prediction which accounts for both bias and variance.

91
Q

Can I use k-fold in clustering?

A

K-Fold is used for supervised learning. Once the model is trained on K-1 folds, it can be tested on the hold-out set. In unsupervised learning, we cannot test the model trained on K-1 folds because we don’t have labels.

92
Q

Why not use random subsets of the dataset?

A

Because the sample size of a random subset is different than or original data set, then the amount of noise will vary, (noise will be more in less data, and less in more data). We want to avoid the difference in the estimator’s performance.

93
Q

In the bootstrap method is there sampling with replacement?

A

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a population parameter. It means a data point in a drawn sample can reappear in future drawn samples as well.

94
Q

What is the mean of a sample?

A

Sum of all data points in the sample divided by the number of data points in that sample.

95
Q

What does pick a ‘random sample’ mean?

A

To randomly pick a sample means every data point is equally likely to be picked.

96
Q

Must the sample be equal size to the original data set in the bootstrap method?

A

No, the sample generated using bootstrap doesn’t have to be of the same size as the original data but usually, it is kept of the same size, and the samples not selected while sampling can be used to validate the model.

97
Q

Bootstrapped parameters will lead to normally distributed parameters by CLT?

A

Yes, it should ideally if one is to believe CLT gives a true estimate of the population parameters.

98
Q

What are the hyperparameters?

A

These are adjustable parameters that must be tuned in order to obtain a model with optimal performance. For example, alpha in regularization is a hyperparameter.

99
Q

Isn’t k-fold cross-validation for testing the generalizability of our model?

A

Yes, if the model performs equally overall folds or subsets we can go ahead with the model.

100
Q

How we can pick up the number of k in k-fold?

A

There is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance. However, it is a heuristic to choose k.

101
Q

Why coefficients are reduced to zero in lasso?

A

Lasso selects only some features while reducing the coefficients of others to zero. This property is known as feature selection and is absent in the case of ridge regression. It is generally used when we have more features because it automatically does feature selection.
We can Lasso when we want to discover sparsity structure. If we do not care about the sparsity structure, we can use either Lasso or Ridge. Lasso is not very useful when there is strong multicollinearity among independent variables

102
Q

In the Heteroskedasticity slide, the linear regression image does not look wrong in the case of TV/Sales. Can you give an example for a prediction that does not work well with Linear Regression?

A

The regression line seems good in that image but it is difficult to interpret the results using the regression line alone. You must interpret all the results before making final decisions. For example, whether 1. All the variables are independent of each other, 2. All the variables are significant 3. Error terms are normally distributed etc.

Consider a dataset with one continuous independent variable and a continuous dependent variable. Imagine, those values are distributed over 2-dimensional space which forms a circular shape. If linear regression is applied to this dataset, the prediction will not be efficient. The residuals will be very high for most of the points.

103
Q

What is meant by “full rank”?

A

Multicollinearity - The extent to which independent variables are correlated. A basic assumption of the linear regression model is that the rank of the matrix of observations on independent variables is the same as the number of explanatory variables. In other words, such a matrix is of full column rank. This implies that all the independent variables are independent of each other, and there is no linear relationship among them.

104
Q

Does multicollinearity happen when some of the independent variables are very correlated?

A

Yes. Multicollinearity happens when independent variables in the regression model are highly correlated to each other. The regression algorithm assumes that the independent variables are not correlated with each other, so this assumption must be met if we are to proceed with building a linear regression model.

105
Q

Will PCA explain the relative effect of TV vs Radio? If no, why not?

A

No. It will not explain the relative effect. PCA just tells if our data has redundancy. Then we can throw away those redundant variables. PCA only looks at the independent variables. But to explain the relative effect of these two variables, we have to consider the dependent variable as well.

106
Q

Is there some way to measure multicollinearity?

A

Yes. In the matrix (Xt * X)^(-1), look at the small eigenvalue of that matrix, if the smallest eigenvalue is close to zero, then that is the evidence of multicollinearity.
Here, Xt represents the transpose of X matrix and (-1) represents the inverse of the matrix
In practice, we can also use Variance Inflation Factor (VIF) to check the multicollinearity among variables.

107
Q

What if the ‘X’ is multicollinear only for a segment of the data?

A

That is not a problem. If the correlation is significant, it will still give enough evidence for multicollinearity that ‘X’ is correlated with some other variables.

108
Q

How to differentiate between multicollinearity vs correlation?

A

Correlation refers to the linear relationship between 2 variables while Multicollinearity is defined for regression model, where some features have a strong relationship with other variables or combination of other variables. It may affect the results and the interpretability of the linear regression model.

109
Q

Correlation does not mean causation - Is that true?

A

Yes, correlation does not imply causation. Causation applies to cases when action A causes outcome B. But correlation is simply a relationship, i.e., action A relates to action B.

110
Q

Is regression better when there is a direct causality?

A

We can also use regression when we do not have causality. We can use it to make predictions. For example, in a family, there are only two children. The height of one child does not cause the height of the other child. There is no caution relation, but using regression, we can predict the height of one child from the height of the other child.

111
Q

When to add/remove variables?

A

It is like a heuristic approach. In linear regression, add variables, see whether it improves our prediction in terms of R-squared. If it seems to give significant improvements we can put them in. Also, try to take out the variables, we can observe if there is any damage to the R-squared values. If not, probably those variables may not be relevant. We can take those variables away.

112
Q

What if a categorical variable has more than two values?

A

There can be categorical variables with more than one category. We can create dummy variables for each category, where each column represents different categories, and entries are 1 in the columns belonging to the specific category else 0.

113
Q

Do we want to check the adjusted R^2 for each new feature separately?

A

Yes, it can be helpful because the R-Squared value never decreases no matter the number of variables we add to our regression model. That is, even if we are adding redundant variables to the data, the value of R-Squared does not decrease. It either remains the same or increases with the addition of new independent variables.
On the other hand, the Adjusted R-squared takes into account the number of independent variables used for predicting the target variable. In doing so, we can determine whether adding new variables to the model increases the model fit. It can decrease if that is not the case.

114
Q

How do we know when to stop adding variables?

A

We can observe the adjusted R-squared and see if it improves much by adding new variables. We can use the validation set to get a good estimate of whether new variables add value to the model or not.
Note: There is also a trade-off between the computational time and the model performance. If we add, say, 5 variables to the data for the increase of 0.1 Adjusted R-Squared, then depending on the resources and application, we can decide whether it is a significant increase for the dimension and the computation time that we are increasing.

115
Q

Is alpha (Regularization Hyperparameter) <= 1?

A

Alpha is a parameter for the regularization term (penalty term) that combats overfitting by constraining the size of the coefficient values. And the alpha value is just a matter of units. It can take any value.

116
Q

When exactly do we use Ridge and when do we use LASSO?

A

Use Lasso when we want to discover sparsity structure. If we do not care about the sparsity structure, we can use either Lasso or Ridge. These days computers are good enough, we can try both techniques and see which one works best.

117
Q

Is there a threshold for sparsity?

A

It depends on how much data we have. If we have little data, we are not going to believe a model that has 100 coefficients. We may look for a sparse structure that has fewer thetas. But if we have billions of data, then even a structure with 1000 coefficients may be fine. It is relative to the size of the dataset.

118
Q

Is the validation set the same as the test set?

A

No, both data are different. The validation set is the data that is used to validate the results of our trained model and tuning model hyperparameters. The test set is the unseen data that is used to check whether the model is giving a generalized performance or not.

119
Q

What would be the minimum sample size recommended for this validation method?

A

It depends on the amount of data we have. Generally, we consider 60% data as a train set, 20% data as a validation set, and 20% as a test set.

120
Q

Could we change our validation set and run multiple iterations of our experiment until we get a good model?

A

Yes, this is what we do in K-fold cross-validation. Initially, we split the dataset into k groups. For each unique group, iterate the following procedure.
-Take one group as a validation dataset
-Take the remaining groups as a training dataset
-Fit a model on the training set and evaluate it on the validation set
We can summarize the performance of the model using the average scores on each group.

121
Q

Is there any difference between the validation process and cross-validation?

A

Validation - Divides the original training dataset into two different subsets, say training set and validation set. The training set is used for training and the validation set is used for assessing the performance of the model and tune hyperparameters. Here, the validation set is never getting trained by the model.
Cross-validation- The original training data is divided into ‘k’ number of subsets. In one epoch, use ‘k-1’ subsets of data for training and use the remaining dataset for validating. Like this, for every epoch, the validation dataset will be different. This is also called K-fold cross-validation.

122
Q

Could we use different random states and have different training/validation/test sets?

A

The random_state parameter is used for initializing the internal random number generator. Setting random_state a fixed value will guarantee that the same sequence of random numbers is generated each time we run the code.
We can get different training/validation/test sets if we change random states, but we keep it the same throughout our analysis to avoid any bias and use cross-validation to assess the model performance.

123
Q

How does RSS compare to MSE?

A

The residual sum of squares (RSS) measures the level of variance in the error term of a regression model. The smaller the residual sum of squares, the better the model fits the data; the greater the residual sum of squares, the poorer the model fits the data.
The mean squared error (MSE) is used to test the performance of the fitted models. It is related to RSS by the following equation:
MSE(Mean Squared Error) = (1/N) * RSS, where ‘N’ is the number of samples.

124
Q

How do we judge our model performance based on MSE?

A

Mean Square Error (MSE) is defined as the Mean (or Average) of the square of the difference between actual and estimated values. The smaller value of MSE indicates that the predicted values are closer to the actual values, i.e., a better model.

125
Q

In cross-validation, what would happen if the value of ‘k’ is the same as the total number of data points?

A

In cross-validation, if ‘k’ is equal to the number of data points, then for each fold we have only one data point in the validation set and the rest of the observations are used to train the model. It is called Leave-One-Out-Cross-Validation (LOOCV).

126
Q

What is meant by “With Replacement” in bootstrapping?

A

Bootstrapping is a resampling technique that generates a sampling distribution by repeatedly taking random samples from the known sample, with replacement. Here, ‘with replacement’ means a data point in a drawn sample can reappear in future drawn samples. This makes each drawn sample independent of the samples previously drawn.
If we did not sample with a replacement, we would always get the same data (if the number of data points selected is equal to the number of data points in the original data).

127
Q

In the bootstrap method, what is the relation between ‘n’ & ‘m’? Is m

A

Here, ‘n’ is the size of the original dataset. ‘m’ is the number of new bootstrap datasets. We can pick any value for ‘m’. It can be smaller, equal, or larger than ‘n’. In general, we keep it ‘m’ to be equal to ‘n’ but in certain scenarios, they can be different. For example, if we want to know details of the tails of the sampling distribution, we would need a bigger ‘m’.

128
Q

Is bootstrapping the same as permutation?

A

No. A permutation would be to choose each one of the data points just once. If every data point is chosen once, then the dataset would be essentially the same as the original dataset. But in bootstrapping, we pick some duplicates (sampling with replacement) to create a new dataset.

129
Q

Does bootstrap allow us to approximate the sampling distribution?

A

Yes, this is where bootstrap helps us. With bootstrap, we get samples out of an approximation of the sampling distribution and when we make a histogram out of those samples, that histogram is an approximation of the true sampling distribution. Once we have a sampling distribution, we see that how wide the standard deviation is and determine the standard error.

130
Q

Does the bootstrap method create a new dataset with the same mean and standard deviation of the sample set, or does it choose actual records randomly from the sample set?

A

Bootstrap chooses samples randomly from the actual dataset with replacement. Bootstrapping is a general approach to statistical inference based on building a sampling distribution for a statistic by resampling from the data that we have.

131
Q

What do the diagonal elements represent in the covariance matrix?

A

The covariance matrix is a symmetric matrix that shows covariances of each pair of variables. The diagonal element shows the covariance of a variable with itself which nothing but the variance of that variable. So, each diagonal element represents the variance of the respective variable.

132
Q

Is it possible or common to see the p-value high but the CI band not including the value zero? And vice versa?

A

No, the two things are equivalent. If the confidence interval includes zero, that means there is no statistically meaningful or statistically significant proof that the variable helps to predict the target variable. It is the same as saying the p-value is high (>0.05). Similarly, when CI doesn’t capture zero the p-value will be low.

133
Q

How do we decide when to use linear regression or other methods?

A

Linear regression is a basic model and it can be the first model to try for a regression problem. The linear regression model is widely used in many situations before attempting non-linear and more complicated models. It is the most accomplished theoretical model and helps in interpretability; some key concepts can also be explained well using the linear regression model.
In practice, we use multiple models of different kinds and the algorithm that gives the best results depends on the data and the problem on hand.

134
Q

In Lasso regularization, if one theta is 0, do we need to remove that variable and redo theta calculation for the remaining variables?

A

Lasso regularization can be used for feature selection. If any variable coefficient (theta) is zero, we can remove that variable. We can build the model with the remaining variables which have non-zero coefficient values.