05 - ML Flashcards
What is the difference between noise and variance?
While collecting data, the data acquisition medium (either human or machine) may make errors. Such errors are called noise. Variance measures the variation of data points from the mean of the data.
What is heterosekadascity? What is generally the shape of the heteroscedastic data?
Heterosekadascity is the phenomenon of having a different variance of data points along the regression line. Heteroscedastic data is generally having an irregular shape i.e. Cone.
How does the weight help with making a prediction?
When weights are established in a model the relationship between the dependent and independent variables is established. This expression is called the model. To predict unknown data, the values of independent variables are passed in the model to get the dependent variable.
Are the coefficients stand for weights in the model?
Yes, the coefficients established after the training process are the weights for each variable in the model. They play a vital role in making the prediction.
What is the difference between endogeneity and multicollinearity?
Endogeneity is the phenomenon of existing correlation between the independent variables and the error terms of the model. While multicollinearity is the process of correlation between independent variables of a model. These two are two different concepts.
What is the difference between correlation and causation?
Correlation between two features ensures that there is a relation (strong or weak) existing. It does not tell anything about whether one feature has originated from the other. For example, age and education can be correlated but none of them is originating the other. Causation is the effect of one feature causing to originate the other feature. For example, poverty causing starvation is a causal effect.
Is it necessary to run PCA to resolve multicollinearity?
Yes, it can be but not in all cases. PCA is used to reduce the number of features when they are very high in count. This process is called dimensionality reduction. To get rid of multicollinearity one out of two correlated features can be removed for all the pairs of correlated variables.
What is the maximum likelihood function?
The likelihood function is a function that is the combination of the likelihood of occurrence of all the events of a sample. When this function is maximized to get the parameters of the model then it is called the maximum likelihood function. It is a function of parameters not the variables of the data.
Can autocorrelation cause endogeneity in the data?
Autocorrelation is the process of existing correlation between a variable and its lagged version. In such a case the error term may be in correlation with the variable with existing correlation.
How do you detect endogeneity, to mitigate?
To detect endogeneity one can collect the error terms and find whether they are correlated to each other or not? If they are then the feature will be endogenic. This can be tested by visualization also, where if the variable seems to be related to the error terms then it can be declared to be endogenic. One of the possible ways of mitigating endogeneity is to do encoding of categorical variables. Encoding creates additional variables that may lead to removing the correlation between the variable with the error term.
Is there an overlap between the two terms - heteroscedasticity and endogeneity?
No, they are two different phenomena. Heteroscedasticity is the phenomenon of having a different variance of data points along the best fit line of regression. While endogeneity is the phenomenon of existing correlation between the independent variable and the error terms.
In regression, is it needed to have a categorical variable?
A regression problem is not necessary to have only continuous or categorical variables. It is possible that it can have either only continuous variables, only categorical variables, or a mix of the two also.
Should any boolean or binomial data always be converted to a 1/0?
While processing the data in python, it is much needed to convert it into numeric data type. Doing so helps us do mathematical operations on the data. Due to this, it is good to convert a binary independent variable to 0 and 1 using any of the encoding methods.
Would log(x2) and x2 have a very correlation that could cause multicollinearity issues?
Log(x2) is a transformed version of x2 and it is an increasing function. Creating such a column may cause multicollinearity because in the case of log transformation they will have some correlation.
Why do we need a non-linear function added to the model?
It is not always possible to get a linear relationship between the dependent and independent variables. In real life, most of the time a non-linear relation captures the actual relation between the variables better. Due to this, we need non-linear functions added to the model.
Would the product of two variables be an interaction between the features?
Yes, it is an interaction between the two variables. When two variables are multiplied, as an effect of interaction a new feature is generated.
Can you take the log of Y and still have linear regression?
Yes, it is possible. A linear relationship between variables means that the index/power of the coefficients is not equal to 1. Linearity is seen with the index of the parameter, not with that of the variables. So Log(y) and the independent variables can have a linear relationship still.
Are rescaling, normalizing, and standardizing are different?
Normalizing and standardizing are two methods of doing rescaling/scaling. In normalizing we measure how much standard deviation is the actual data away from the mean of the data while in standardization we bring back the data to a certain range of numbers.
Can you provide an example of the weighted least square algorithm? How do we pick the weight?
Weighted least square algorithm is a method of finding the parameters of a model. It can be applied with any algorithm where this is used as the cost function for the algorithm for example Linear regression. The weights should, ideally, be equal to the reciprocal of the variance of the measurement. Corresponding to each record there is a different weight associated with it.
If adding more variables reduces endogeneity, then how to reduce heteroscedasticity?
One of the prominent methods to do so is to use weighted least square analysis. Giving different weightage to the variables resolves heteroscedasticity.
Can you give some examples where scaling/ normalizing is necessary and where the regression will work just fine without transforming the data first?
Scaling and normalizing are used when different features of the dataset are at a different scale, for example, if a dataset contains weight in kg and height in meter then they have different scales and it is needed to do scaling of the data. Linear regression works relatively better if there is no big-scale difference between the features.
Will the addition of more variables cause overfitting?
When new variables are added to the model then it makes the model more complex and hence it will try to capture the noise in the data while being trained. This will cause the overfitting of the model.
If we take the log of Y, can we still account for outliers in our prediction model?
It is possible that even in the transformed data (log transformation) some points are far away from the main herd of the data points and they can be outliers. Transformation is not a fixed solution to outliers.
Can a regression model include both continuous and categorical variables at the same time?
Yes, a regression model can include both the categorical and continuous variables at the same time. The only thing is the categories in the categorical data need to be converted into numbers so that the model can be established properly.
Can outliers affect Linear regression?
In linear regression, outliers can adversely affect the prediction of the model. A variable with outliers dominates over the other variables in terms of contribution to the model. It causes to increase in the variance in the prediction and the original dataset also.
What is a hyperparameter?
When we train a machine learning model some parameters are estimated during the training process. These are the model parameters. Along with them, there are some parameters that we need to pass to the model while training it. We have the freedom to pass different values to these parameters and check at what value the model is performing better. Such parameters are called hyperparameters of the model.
What does using lasso or ridge regression do to standard errors of coefficient estimates?
When we do regularization of the model using Ridge or Lasso regression then we make the model simpler (if it was originally overfitting). This is done by reducing the values of the coefficients. This causes the standard error to go down.
What does sparsity mean?
Sparsity means out of a given number of values vanishing of some of the values. For example in the case of a 10x10 matrix, out of 100 entries, 60 are zero then it is a sparse matrix. In general, if the percentage of such values is high we refer to it as a sparse matrix.
Why does Lasso force the coefficients to be exactly equal to zero while ridge just shrinks them?
The lasso performs shrinkage so that there are “corners’’ in the constraint, which in two dimensions corresponds to a diamond. If the sum of squares “hits’’ one of these corners, then the coefficient corresponding to the axis is shrunk to zero.
Is it a subset of the entire dataset? If the validation set and training set work well then why would we still get errors in our test set?
A validation set is a subset of the data that is chosen to validate the model (whether it is performing well or not). Validating a model is a process of making the model as accurate and generic as possible, but this does not mean that the accuracy or performance of the model is 100%. Irrespective of the accuracy of the model on the validation set, most of the models are bound to make an error on the test set data. This is because the model can not capture the existing pattern in the test set data properly.
Could you randomly choose a different validation set rather than setting aside a test set?
One can choose the validation set randomly if desired, it is not necessary that it has to be the last few records of the dataset. While doing cross-validation in machine learning different validation sets are chosen to validate the model.
In the k folds method, how do we combine different regression models?
Cross-validation is a method to find the best working model out of a set of models over a given set of data. If the average of the performance metrics of a model M1 is better than that for the model M2 then the model M1 will be preferred with the best set of parameters.
Can simulation tell us anything about bias?
Machine learned models exhibit bias, often because the datasets used to train them are biased. This causes the resulting models to perform poorly on records that are minorities within the training set and ultimately present higher risks to them. Computer simulations are used to interrogate and diagnose biases within ML classifiers.
Are “synthetic datasets” created from the population?
Yes, synthetic data is created from the population. If we are given a certain set of features X with n number of records then the synthetic data is created using these columns and records.
So we are creating new sample data sets based on the permutation /combinations of the original data set in bootstrapping?
In bootstrapping, we create a certain number of datasets from a given original dataset. This is done with replacement and without replacement of the data. With replacement means when we take one record from the data then we keep it back before extracting the second record. This causes repetition of the records in the bootstrap sample. Without replacement means when we take one record from the data then we do not keep it back into the original dataset before taking the next record.
Is bootstrapping different from K-fold cross-validation?
Bootstrapping is different from K-fold cross-validation. Bootstrapping is a method of creating different samples from the same dataset while K-fold cross-validation is a process of splitting the dataset into a number of equal-sized folds in order to train different models.
For the simulation method, how do you draw the new data set? Do you draw from a pdf of the new data points using the current model parameters?
Given the original dataset and model parameters, the new datasets are created in simulation. It is used to create a dataset of the desired choice either to make the data more realistic or for some other real-world use purposes.
Does bootstrapping require a minimum amount of samples (n)? Is Monte Carlo simulation some type of bootstrapping?
There is not a hard and fast rule for selecting the number of records while applying to bootstrap. What is the difference between Monte Carlo simulation and bootstrapping? Montecarlo simulation is different from the bootstrapping methods. Bootstrapping uses the original samples as the population from which it extracts samples to create a new dataset whereas Monte Carlo simulation is based on setting up a data generation process (with known values of the parameters).
Is bootstrapping done on a labeled dataset?
Bootstrapping can be done on both labeled and unlabelled data both. It is all about creating a new set of data from a given set of records. It does not need labeled data to create a bootstrap model.
When we bootstrap are we jointly estimating all parameters in multivariate regression or one at a time?
When we do bootstrap we have n number of datasets with a certain number of features. While applying multivariate regression we train the model on each of the datasets and make the final prediction as to the majority of predictions done by the models. So for every model parameters are estimated separately.
Is bootstrapping is a way of simulating the sampling distribution?
Bootstrapping is different from simulation. In bootstrapping we create different samples from the same datasets while in simulation-based on certain parameters we create different samples.
Does Bootstrapping require a size >= 30 to assure normal distribution of the sample?
In bootstrap, we create a minimum of 20-30 datasets but the size of each dataset is equal to the size of the original dataset.
How taking duplicates makes it different data in case of bootstrapping
When we bootstrap a dataset, there are a few records that are duplicated during the process but not all the records. Even if we do bootstrap by sampling with replacement there is a 63% chance of unique records getting selected in each sample dataset. If the dataset is having duplicate records it is preferred to remove those duplicates before applying the bootstrap.
Any links/books recommended for the robust methods to handle endogeneity?
Endogeneity is a phenomenon that can occur in any kind of type. To study it for economics you can refer to the book “Dealing with Endogeneity in Regression Models with Dynamic Coefficients: 6 (Foundations and Trends® in Econometrics)”.
Do graphical models handle causality better?
It can not be said with certainty. But using the graphical model for a causal relationship gives the benefit of understanding and interpreting the causality well by the visual inspections in the model. It represents the causal relations (if existing) better than the nongraphical models.
There will be different estimates of model parameters for different folds? Which one to select?
Cross-validation is a process to find which model is working better among a given set of models. If you have two models M1 and M2 and we applied cross-validation over them and the average performance metrics of one model are found to be better than the other then that model is supposed to perform better on the unseen data. It can generalize better in an unseen scenario.
What’s the main difference between MSE and standard error?
The standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation. While MSE is the mean squared error that is used to find the parameters of a model by optimizing mean squared error. Then seem to be two types of errors but they have a different purpose to fulfill.
Is the benefit of neural nets vs machine learning, that in NN’s, you don’t have to do feature section…the algorithm does it for you?
There are a lot of benefits of using neural networks than the classical machine learning models. What is mentioned here is one of those benefits of using neural nets over the classical machine learning models.
For supervised is there a restriction on the number of samples to be used for a model?
It depends on the computational complexity. Generally larger the data better is the training of the supervised model and better in predicting the result since it has more data to train.
Are we already assuming Normal distribution to construct confidence interval?
The equation we derive will be correct if we assume normal distribution but even if we do not assume it is approximately normal distribution due to the Central Limit Theorem and if the data is more and say for more than 50 then the assumption of the normal distribution is pretty safe.
Does the wald test always have to have the null hypothesis with theta equal to zero?
Yes, In the wald test, we try to find if the standard error of a particular component in the test is zero. Since this revolves around the standard error, we compute it with zero only.
Difference between estimator and predictor?
None, they can be used interchangeably. Both are the same.
What if the normality test on residuals fails?
The residuals should be normally distributed. This is the basic assumption which should not be violated. Maximum likelihood estimation does not work if this assumption doesn’t hold true, in which case we should not use linear regression.