Model Evaluation Metrics Flashcards
What is the difference between noise and variance?
While collecting data, the data acquisition medium (either human or machine) may make errors. Such errors are called noise. Variance measures the variation of data points from the mean of the data.
What is heterosekadascity? What is generally the shape of the heteroscedastic data?
Heterosekadascity is the phenomenon of having a different variance of data points along the regression line. Heteroscedastic data is generally having an irregular shape i.e. Cone.
How does the weight help with making a prediction?
When weights are established in a model the relationship between the dependent and independent variables is established. This expression is called the model. To predict unknown data, the values of independent variables are passed in the model to get the dependent variable.
Are the coefficients stand for weights in the model?
Yes, the coefficients established after the training process are the weights for each variable in the model. They play a vital role in making the prediction.
Is it necessary to run PCA to resolve multicollinearity?
Yes, it can be but not in all cases. PCA is used to reduce the number of features when they are very high in count. This process is called dimensionality reduction. To get rid of multicollinearity one out of two correlated features can be removed for all the pairs of correlated variables.
What is the maximum likelihood function?
The likelihood function is a function that is the combination of the likelihood of occurrence of all the events of a sample. When this function is maximized to get the parameters of the model then it is called the maximum likelihood function. It is a function of parameters not the variables of the data.
In regression, is it needed to have a categorical variable?
A regression problem is not necessary to have only continuous or categorical variables. It is possible that it can have either only continuous variables, only categorical variables, or a mix of the two also.
Would log(x2) and x2 have a very correlation that could cause multicollinearity issues?
Log(x2) is a transformed version of x2 and it is an increasing function. Creating such a column may cause multicollinearity because in the case of log transformation they will have some correlation.
Why do we need a non-linear function added to the model?
It is not always possible to get a linear relationship between the dependent and independent variables. In real life, most of the time a non-linear relation captures the actual relation between the variables better. Due to this, we need non-linear functions added to the model.
Would the product of two variables be an interaction between the features?
Yes, it is an interaction between the two variables. When two variables are multiplied, as an effect of interaction a new feature is generated.
Can you take the log of Y and still have linear regression?
Yes, it is possible. A linear relationship between variables means that the index/power of the coefficients is not equal to 1. Linearity is seen with the index of the parameter, not with that of the variables. So Log(y) and the independent variables can have a linear relationship still.
Are rescaling, normalizing, and standardizing are different?
Normalizing and standardizing are two methods of doing rescaling/scaling. In normalizing we measure how much standard deviation is the actual data away from the mean of the data while in standardization we bring back the data to a certain range of numbers.
Can you provide an example of the weighted least square algorithm? How do we pick the weight?
Weighted least square algorithm is a method of finding the parameters of a model. It can be applied with any algorithm where this is used as the cost function for the algorithm for example Linear regression. The weights should, ideally, be equal to the reciprocal of the variance of the measurement. Corresponding to each record there is a different weight associated with it.
Can you give some examples where scaling/ normalizing is necessary and where the regression will work just fine without transforming the data first?
Scaling and normalizing are used when different features of the dataset are at a different scale, for example, if a dataset contains weight in kg and height in meter then they have different scales and it is needed to do scaling of the data. Linear regression works relatively better if there is no big-scale difference between the features.
Will the addition of more variables cause overfitting?
When new variables are added to the model then it makes the model more complex and hence it will try to capture the noise in the data while being trained. This will cause the overfitting of the model.
If we take the log of Y, can we still account for outliers in our prediction model?
It is possible that even in the transformed data (log transformation) some points are far away from the main herd of the data points and they can be outliers. Transformation is not a fixed solution to outliers.
Can a regression model include both continuous and categorical variables at the same time?
Yes, a regression model can include both the categorical and continuous variables at the same time. The only thing is the categories in the categorical data need to be converted into numbers so that the model can be established properly.
Can outliers affect Linear regression?
In linear regression, outliers can adversely affect the prediction of the model. A variable with outliers dominates over the other variables in terms of contribution to the model. It causes to increase in the variance in the prediction and the original dataset also.
What is a hyperparameter?
When we train a machine learning model some parameters are estimated during the training process. These are the model parameters. Along with them, there are some parameters that we need to pass to the model while training it. We have the freedom to pass different values to these parameters and check at what value the model is performing better. Such parameters are called hyperparameters of the model.
What does sparsity mean?
Sparsity means out of a given number of values vanishing of some of the values. For example in the case of a 10x10 matrix, out of 100 entries, 60 are zero then it is a sparse matrix. In general, if the percentage of such values is high we refer to it as a sparse matrix.
Is it a subset of the entire dataset? If the validation set and training set work well then why would we still get errors in our test set?
A validation set is a subset of the data that is chosen to validate the model (whether it is performing well or not). Validating a model is a process of making the model as accurate and generic as possible, but this does not mean that the accuracy or performance of the model is 100%. Irrespective of the accuracy of the model on the validation set, most of the models are bound to make an error on the test set data. This is because the model can not capture the existing pattern in the test set data properly.
Could you randomly choose a different validation set rather than setting aside a test set?
One can choose the validation set randomly if desired, it is not necessary that it has to be the last few records of the dataset. While doing cross-validation in machine learning different validation sets are chosen to validate the model.
In the k folds method, how do we combine different regression models?
Cross-validation is a method to find the best working model out of a set of models over a given set of data. If the average of the performance metrics of a model M1 is better than that for the model M2 then the model M1 will be preferred with the best set of parameters.
Can simulation tell us anything about bias?
Machine learned models exhibit bias, often because the datasets used to train them are biased. This causes the resulting models to perform poorly on records that are minorities within the training set and ultimately present higher risks to them. Computer simulations are used to interrogate and diagnose biases within ML classifiers.