Model Evaluation Metrics Flashcards

Question

Are "synthetic datasets" created from the population?

Answer 1

Yes, synthetic data is created from the population. If we are given a certain set of features X with n number of records then the synthetic data is created using these columns and records.

Answer 2

In bootstrapping, we create a certain number of datasets from a given original dataset. This is done with replacement and without replacement of the data. With replacement means when we take one record from the data then we keep it back before extracting the second record. This causes repetition of the records in the bootstrap sample. Without replacement means when we take one record from the data then we do not keep it back into the original dataset before taking the next record.

Answer 3

Bootstrapping is different from K-fold cross-validation. Bootstrapping is a method of creating different samples from the same dataset while K-fold cross-validation is a process of splitting the dataset into a number of equal-sized folds in order to train different models.

Answer 4

Given the original dataset and model parameters, the new datasets are created in simulation. It is used to create a dataset of the desired choice either to make the data more realistic or for some other real-world use purposes.

Answer 5

There is not a hard and fast rule for selecting the number of records while applying to bootstrap. What is the difference between Monte Carlo simulation and bootstrapping? Montecarlo simulation is different from the bootstrapping methods. Bootstrapping uses the original samples as the population from which it extracts samples to create a new dataset whereas Monte Carlo simulation is based on setting up a data generation process (with known values of the parameters).

Answer 6

Bootstrapping can be done on both labeled and unlabelled data both. It is all about creating a new set of data from a given set of records. It does not need labeled data to create a bootstrap model.

Answer 7

When we do bootstrap we have n number of datasets with a certain number of features. While applying multivariate regression we train the model on each of the datasets and make the final prediction as to the majority of predictions done by the models. So for every model parameters are estimated separately.

Answer 8

Bootstrapping is different from simulation. In bootstrapping we create different samples from the same datasets while in simulation-based on certain parameters we create different samples.

Answer 9

In bootstrap, we create a minimum of 20-30 datasets but the size of each dataset is equal to the size of the original dataset.

Answer 10

When we bootstrap a dataset, there are a few records that are duplicated during the process but not all the records. Even if we do bootstrap by sampling with replacement there is a 63% chance of unique records getting selected in each sample dataset. If the dataset is having duplicate records it is preferred to remove those duplicates before applying the bootstrap.

Answer 11

Cross-validation is a process to find which model is working better among a given set of models. If you have two models M1 and M2 and we applied cross-validation over them and the average performance metrics of one model are found to be better than the other then that model is supposed to perform better on the unseen data. It can generalize better in an unseen scenario.

Answer 12

The standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation. While MSE is the mean squared error that is used to find the parameters of a model by optimizing mean squared error. Then seem to be two types of errors but they have a different purpose to fulfill.

Answer 13

There are a lot of benefits of using neural networks than the classical machine learning models. What is mentioned here is one of those benefits of using neural nets over the classical machine learning models.

Answer 14

It depends on the computational complexity. Generally larger the data better is the training of the supervised model and better in predicting the result since it has more data to train.

Answer 15

The equation we derive will be correct if we assume normal distribution but even if we do not assume it is approximately normal distribution due to the Central Limit Theorem and if the data is more and say for more than 50 then the assumption of the normal distribution is pretty safe.

Answer 16

Yes, In the wald test, we try to find if the standard error of a particular component in the test is zero. Since this revolves around the standard error, we compute it with zero only.

Answer 17

None, they can be used interchangeably. Both are the same.

Answer 18

The residuals should be normally distributed. This is the basic assumption which should not be violated. Maximum likelihood estimation does not work if this assumption doesn't hold true, in which case we should not use linear regression.

Answer 19

A matrix is said to have full rank if its rank i.e. the number of independent columns equals the largest possible for a matrix of the same dimensions, which is the lesser of the number of rows and columns. A matrix is said to be rank-deficient if it does not have full rank.

Answer 20

We can check by computing the correlation between two desired variables. The correlation coefficient, r, is a number between -1 and 1 and tells us how well a regression line fits the data. It gives the strength and direction of the relationship between two variables. The relationship between two variables is generally considered strong when their r value is larger than 0.7.

Answer 21

R-Square cannot define multicollinearity. R-Square explains the proportion of variance for a target variable explained by the independent variables.

Answer 22

There are many ways one of them is to drop the variable which has less correlation to the target variable or we can also check the Variance inflation factor (VIF) to remove the variable. If VIF gives higher values for those variables then they can be eliminated. It is one of the industrial approaches.

Answer 23

Yes. The line of best fit has to be generated by regression which gives us the prediction with minimum errors.

Answer 24

Regression is fine if your independent variables are discrete. It does not affect the model in any way.

Answer 25

We never know the variables are completely independent of each other. There may be a slight association with one another. If there is a very strong correlation that leads to collinearity then they should be taken care of.

Answer 26

It is a part of the data transformation technique to improve the model to best fit the data. Using the logarithm of one or more variables improves the fit of the model by transforming the distribution of the features to a more normally-shaped bell curve.

Answer 27

Even though we add data points that are nonlinear the coefficients that are learned in the model will be linear. Hence linear regression can be applied.

Answer 28

Yes. Instead of different frequencies we try to augment and add new variables using interaction, polynomial to best fit the data.

Answer 29

It slightly affects the explainability but it is always a tradeoff between explainability and accuracy of the model.

Answer 30

New features will be added along with the existing features and not replaced in the dataset. Features are only removed if they are not helpful for the model building process.

Answer 31

It can be considered as the combined effect of TV and Radio can bring a bigger effect when compared to them individually since in some cases individually they may not be effective but together will be more effective.

Answer 32

Yes, the noise would mean a presence of more false positives and true negatives since it affects the performance of the model.

Answer 33

R-Square explains the proportion of variance for a target variable explained by the independent variables and it is not used as an error metric. MSE, RMSE, etc are used as a metric for measuring the error.

Answer 34

It produces misleading values as we made the model more complex and learn the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. It is not due to computation issues and error is high due to overfitting only.

Answer 35

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. So it leads to high variance and overfitting as it learns more on the training data as we made the model more complex and when trying to predict on a new data it fails and leads to poor performance of the model.

Answer 36

The bias is known as the difference between the prediction of the values by the ML model and the correct value. Being high in biasing gives a large error in training as well as testing data. It is recommended that an algorithm should always be low-biased to avoid the problem of underfitting. In general, bias is too simple and we should increase the complexity of the model by adding more variables.

Answer 37

We use the validation set to test the performance of each method and the test set to test the performance of the final model/method that is chosen. The test set is untouched till the final testing of the model.

Answer 38

Yes, the test dataset will be smaller in size. The main dataset can be split for example like 80:10:10 for train, validation, and test dataset. Then the test is finally used for testing the selected model. This third dataset split i.e test dataset is kept hidden from the training and validation process. The other more useful approach for small datasets is to use to cross-validation.

Answer 39

Yes, we can think about doing that. We can maybe form a grid of values for training/validation split, then whichever one gives us the best result we go with that

Answer 40

Generally, the training dataset will be larger in size since the model is trained on this training data and the validation dataset is used for evaluating the model. Then the test dataset which is kept hidden is used to test the final selected model to see how well the model performs. Generally, they are kept in the ratio of 70:20:10 or 80:10:10. The split depends on the size of the dataset and the train-validation-test split is an ideal method.

Answer 41

When the data set is small and of the order of a few thousand, it can lead to overfitting.

Answer 42

Not really. Cross-validation is only used to check the performance of the model. After that regression is done on the training dataset to train the model.

Answer 43

Irreducible noise cannot be constant even if we know everything about the variable, there will be still some noise or randomness in the behavior of the variable data.

Answer 44

No - cross-validation is to test the performance and not to build the model. You have to run the regression again with a different set of variables, parameters to improve performance

Answer 45

It is chosen randomly for non-time series data. For Time-series data we need to use Sequentially

Answer 46

Ideally, it should be the same. They can be unequal but there is no reason to not divide them equally.

Answer 47

Yes. MSE accounts for bias as well as variance. MSE tells us about the error in our prediction which accounts for both bias and variance.

Answer 48

K-Fold is used for supervised learning. Once the model is trained on K-1 folds, it can be tested on the hold-out set. In unsupervised learning, we cannot test the model trained on K-1 folds because we don't have labels.

Answer 49

Because the sample size of a random subset is different than or original data set, then the amount of noise will vary, (noise will be more in less data, and less in more data). We want to avoid the difference in the estimator's performance.

Answer 50

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a population parameter. It means a data point in a drawn sample can reappear in future drawn samples as well.

Answer 51

Sum of all data points in the sample divided by the number of data points in that sample.

Answer 52

To randomly pick a sample means every data point is equally likely to be picked.

Answer 53

No, the sample generated using bootstrap doesn't have to be of the same size as the original data but usually, it is kept of the same size, and the samples not selected while sampling can be used to validate the model.

Answer 54

Yes, it should ideally if one is to believe CLT gives a true estimate of the population parameters.

Answer 55

Yes, if the model performs equally overall folds or subsets we can go ahead with the model.

Answer 56

There is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance. However, it is a heuristic to choose k.

Answer 57

The regression line seems good in that image but it is difficult to interpret the results using the regression line alone. You must interpret all the results before making final decisions. For example, whether 1. All the variables are independent of each other, 2. All the variables are significant 3. Error terms are normally distributed etc. Consider a dataset with one continuous independent variable and a continuous dependent variable. Imagine, those values are distributed over 2-dimensional space which forms a circular shape. If linear regression is applied to this dataset, the prediction will not be efficient. The residuals will be very high for most of the points.

Answer 58

Multicollinearity - The extent to which independent variables are correlated. A basic assumption of the linear regression model is that the rank of the matrix of observations on independent variables is the same as the number of explanatory variables. In other words, such a matrix is of full column rank. This implies that all the independent variables are independent of each other, and there is no linear relationship among them.

Answer 59

Yes. Multicollinearity happens when independent variables in the regression model are highly correlated to each other. The regression algorithm assumes that the independent variables are not correlated with each other, so this assumption must be met if we are to proceed with building a linear regression model.

Answer 60

No. It will not explain the relative effect. PCA just tells if our data has redundancy. Then we can throw away those redundant variables. PCA only looks at the independent variables. But to explain the relative effect of these two variables, we have to consider the dependent variable as well.

Answer 61

Yes. In the matrix (Xt * X)^(-1), look at the small eigenvalue of that matrix, if the smallest eigenvalue is close to zero, then that is the evidence of multicollinearity. Here, Xt represents the transpose of X matrix and (-1) represents the inverse of the matrix In practice, we can also use Variance Inflation Factor (VIF) to check the multicollinearity among variables.

Answer 62

That is not a problem. If the correlation is significant, it will still give enough evidence for multicollinearity that 'X' is correlated with some other variables.

Answer 63

Correlation refers to the linear relationship between 2 variables while Multicollinearity is defined for regression model, where some features have a strong relationship with other variables or combination of other variables. It may affect the results and the interpretability of the linear regression model.

Answer 64

Yes, correlation does not imply causation. Causation applies to cases when action A causes outcome B. But correlation is simply a relationship, i.e., action A relates to action B.

Answer 65

It is like a heuristic approach. In linear regression, add variables, see whether it improves our prediction in terms of R-squared. If it seems to give significant improvements we can put them in. Also, try to take out the variables, we can observe if there is any damage to the R-squared values. If not, probably those variables may not be relevant. We can take those variables away.

Answer 66

There can be categorical variables with more than one category. We can create dummy variables for each category, where each column represents different categories, and entries are 1 in the columns belonging to the specific category else 0.

Answer 67

Yes, it can be helpful because the R-Squared value never decreases no matter the number of variables we add to our regression model. That is, even if we are adding redundant variables to the data, the value of R-Squared does not decrease. It either remains the same or increases with the addition of new independent variables. On the other hand, the Adjusted R-squared takes into account the number of independent variables used for predicting the target variable. In doing so, we can determine whether adding new variables to the model increases the model fit. It can decrease if that is not the case.

Answer 68

We can observe the adjusted R-squared and see if it improves much by adding new variables. We can use the validation set to get a good estimate of whether new variables add value to the model or not. Note: There is also a trade-off between the computational time and the model performance. If we add, say, 5 variables to the data for the increase of 0.1 Adjusted R-Squared, then depending on the resources and application, we can decide whether it is a significant increase for the dimension and the computation time that we are increasing.

Answer 69

It depends on how much data we have. If we have little data, we are not going to believe a model that has 100 coefficients. We may look for a sparse structure that has fewer thetas. But if we have billions of data, then even a structure with 1000 coefficients may be fine. It is relative to the size of the dataset.

Answer 70

No, both data are different. The validation set is the data that is used to validate the results of our trained model and tuning model hyperparameters. The test set is the unseen data that is used to check whether the model is giving a generalized performance or not.

Answer 71

It depends on the amount of data we have. Generally, we consider 60% data as a train set, 20% data as a validation set, and 20% as a test set.

Answer 72

Yes, this is what we do in K-fold cross-validation. Initially, we split the dataset into k groups. For each unique group, iterate the following procedure. -Take one group as a validation dataset -Take the remaining groups as a training dataset -Fit a model on the training set and evaluate it on the validation set We can summarize the performance of the model using the average scores on each group.

Answer 73

Validation - Divides the original training dataset into two different subsets, say training set and validation set. The training set is used for training and the validation set is used for assessing the performance of the model and tune hyperparameters. Here, the validation set is never getting trained by the model. Cross-validation- The original training data is divided into 'k' number of subsets. In one epoch, use 'k-1' subsets of data for training and use the remaining dataset for validating. Like this, for every epoch, the validation dataset will be different. This is also called K-fold cross-validation.

Answer 74

The random_state parameter is used for initializing the internal random number generator. Setting random_state a fixed value will guarantee that the same sequence of random numbers is generated each time we run the code. We can get different training/validation/test sets if we change random states, but we keep it the same throughout our analysis to avoid any bias and use cross-validation to assess the model performance.

Answer 75

The residual sum of squares (RSS) measures the level of variance in the error term of a regression model. The smaller the residual sum of squares, the better the model fits the data; the greater the residual sum of squares, the poorer the model fits the data. The mean squared error (MSE) is used to test the performance of the fitted models. It is related to RSS by the following equation: MSE(Mean Squared Error) = (1/N) * RSS, where 'N' is the number of samples.

Answer 76

Mean Square Error (MSE) is defined as the Mean (or Average) of the square of the difference between actual and estimated values. The smaller value of MSE indicates that the predicted values are closer to the actual values, i.e., a better model.

Answer 77

In cross-validation, if 'k' is equal to the number of data points, then for each fold we have only one data point in the validation set and the rest of the observations are used to train the model. It is called Leave-One-Out-Cross-Validation (LOOCV).

Answer 78

Bootstrapping is a resampling technique that generates a sampling distribution by repeatedly taking random samples from the known sample, with replacement. Here, 'with replacement' means a data point in a drawn sample can reappear in future drawn samples. This makes each drawn sample independent of the samples previously drawn. If we did not sample with a replacement, we would always get the same data (if the number of data points selected is equal to the number of data points in the original data).

Answer 79

Here, 'n' is the size of the original dataset. 'm' is the number of new bootstrap datasets. We can pick any value for 'm'. It can be smaller, equal, or larger than 'n'. In general, we keep it 'm' to be equal to 'n' but in certain scenarios, they can be different. For example, if we want to know details of the tails of the sampling distribution, we would need a bigger 'm'.

Answer 80

No. A permutation would be to choose each one of the data points just once. If every data point is chosen once, then the dataset would be essentially the same as the original dataset. But in bootstrapping, we pick some duplicates (sampling with replacement) to create a new dataset.

Answer 81

Yes, this is where bootstrap helps us. With bootstrap, we get samples out of an approximation of the sampling distribution and when we make a histogram out of those samples, that histogram is an approximation of the true sampling distribution. Once we have a sampling distribution, we see that how wide the standard deviation is and determine the standard error.

Answer 82

Bootstrap chooses samples randomly from the actual dataset with replacement. Bootstrapping is a general approach to statistical inference based on building a sampling distribution for a statistic by resampling from the data that we have.

Answer 83

The covariance matrix is a symmetric matrix that shows covariances of each pair of variables. The diagonal element shows the covariance of a variable with itself which nothing but the variance of that variable. So, each diagonal element represents the variance of the respective variable.

Answer 84

No, the two things are equivalent. If the confidence interval includes zero, that means there is no statistically meaningful or statistically significant proof that the variable helps to predict the target variable. It is the same as saying the p-value is high (>0.05). Similarly, when CI doesn't capture zero the p-value will be low.

Answer 85

Linear regression is a basic model and it can be the first model to try for a regression problem. The linear regression model is widely used in many situations before attempting non-linear and more complicated models. It is the most accomplished theoretical model and helps in interpretability; some key concepts can also be explained well using the linear regression model. In practice, we use multiple models of different kinds and the algorithm that gives the best results depends on the data and the problem on hand.

Model Evaluation Metrics Flashcards

(109 cards)