Chapter 3: Linear Models Flashcards
what is the difference between supervised and unsupervised learning problems?
supervised: our goal is to understand the relationship between the target variable and the predictors/ make accurate predictions for the target variable based on the predictors (there is a target variable)
unsupervised: target variable is absent. we are more interested in extracting relationships and structures between different variables in the data.
what two types of business problems are there in exam PA?
- prediction focused: the primary objective is to make an accurate prediction of the target variable on the basis of other predictors
- interpretation focused: we are interested in using the model to understand the true relationship between the target variable and the predictors (ex. how is the number of exams passed associated with the salary of an actuary)
How does the business problem affect the model that we create? (objective is prediction or interpretation?)
if the objective is to predict, then we will implement a model that produces good predictions even if it is costly to implement
if the objective is to interpret, then we can select a relatively simple, interpretable model that clearly shows the relationship between the target and predictors
why is it important that data are consistent?
ex. numeric variables - keeping them all in the same units
categorical variables - consistent naming for the levels
so that they can be directly compared to one another
When datasets contain PII (personally identifiable information), such as social security number, address, etc, what may need to be done with the data?
- anonymize the data to remove the PII
- data security: ensure that personal data receives sufficient protection, such as access restrictions
- terms of use: be aware of the terms of use on the data.
- unethical data: differential treatment based on these variables in a predictive model may lead to unfair discrimination
what proportion of the data should we use to train our models?
around 70-75%
how do you create the test/training split? from what package?
library(caret) set.seed() partition -createDataPartition(dataset$targetvariable, p = 0.75, list = FALSE) data.train - dataset[partition, ] data.test - dataset[-partition, ]
what is the purpose of a training/test split?
train: used to train/develop your model to estimate the signal function. typically done by optimizing a certain objective function
test: where you assess the prediction performance of your trained model according to certain performance metrics (imagining that the test set is a set of future data)
what is a good performance metric to use on regression problems?
test RMSE: write out the formula
we interpret this as the size of a typical prediction error in absolute value
why do we use RMSE instead of MSE?
the RMSE has the same unit as the target variable
what performance metric do we use for classification problems?
test classification error rate: write it out
the sum computes the number of test observations incorrectly classified. the division of n_test returns the proportion of misclassified observations on the test set
when is c-v most typically used?
to select the values of hyperparameters, which are parameters that control some aspect of the fitting itself
why is c-v powerful?
because it can assess the prediction performance of a model without using additional test data
how to perform k-fold cross validation?
- for a given positive integer k, randomly split the training data into k folds of approx. equal size. common choice = 10
- one fold is left out and the model is fitted to the remaining k-1 folds. then the fitted model is used to make a prediction for each observation in the left-out fold and a performance metric is computed on that fold.
- repeat this process with each fold left out in turn to get k performance values (e.g. RMSE for numeric and classification error rate for categorical)
- the overall prediction performance of the model can be estimated as the average of the k performance values (this is the CV error)
is it possible to validate our model graphically?
yes, the plot of observed against predicted values not only should fall on the y = x line, but also have no deviations from this line
T/F: prediction accuracy is the same as goodness of fit
false. it is not.
goodness of fit measures how well a model describes the past data, but doesn’t necessarily measure how good a model will perform on future data
what are the components of the expected test error? describe each of them.
- bias: (accuracy)
- this is the difference between the expected value of the predictive model and the true value of the signal function.
- bias measures the accuracy of f_hat
- bias is the part of the test error caused by the model not being flexible enough to capture the signal - variance: (precision)
- quantifies that amount that f_hat would change if we were to estimate it on a different test set.
- ideally, f_hat should be stable across different training sets
- a more flexible model has a higher variance because it is more sensitive to training data. - irreducible error:
- this is the variance of the noise, independent of the choice of predictive model but inherent in the target variable
as the flexbility of the model increases, what tends to drop quicker, bias or variance?
bias
what are two commonly used strategies for reducing the dimensionality of a categorical predictor?
- combining sparse categories with others:
- categories with very few observations should be the first candidates to be combined with other categories
- it is difficult to estimate the effects of these categories on the target variable reliably if they have few observations - combining similar categories:
- if the target variable behaves similarly (with respect to mean, median, etc.) in two categories of a categorical predictor then we can reduce the dimension of the predictor by consolidating these two categories without losing much information
in terms of a categorical predictor: what is the difference between granularity and dimensionality?
dimensionality:
- applicability - concept specific to categorical variables
- comparability - we can always order categorical variables by dimension (which has more levels?)
granularity:
- as we make a categorical predictor more granular, the information it stores becomes finer. its dimension increases and there are fewer observations at each level
- reducing the granularity of a categorical predictor makes the information contained by the predictor less detailed and makes the number of factor levels more manageable.
1. applicability - applies to both categorical and numeric variables
2. comparability - not always possible to order categorical variables by granularity
what is the optimal level of granularity for a categorical predictor
the level that optimizes the bias-variance trade off
What does the penalty term do in the AIC/BIC?
the AIC and BIC both demand that for the inclusion of another feature to improve the performance of a model, the feature has to increase the loglikelihood by the same amount that it increases the complexity
which has a higher penalty term? AIC or BIC?
BIC = more conservative
when checking model diagnostics, what two plots are commonly used and what is evaluated on both?
- residuals vs. fitted values plot
- equal variance (finger test - checking for homoscedasticity)
- any patterns in the plot
- mean 0 - Normal qq plot
check that the standardized residuals lie on the y = x line
how could we interpret a regression coefficient in a linear model?
“a unit increase in X is associated with an increase of beta (the coefficient of X) in Y on average, holding all other predictors fixed”
in situations where the relationship between Y and X does not appear to be linear, it may be desirable to expand the model to higher powers of X using polynomial regression.
what are the pros (1) and cons (2) to adding higher order terms to our model?
pros:
- we are able to take care of more complex relationships between the target variable and the predictors. The more polynomial terms included, the more flexible the model.
cons:
- the regression coefficients in a polynomial regression model are much more difficult to interpret. we cannot say that a one unit increase in X, holding other variables fixed - the other polynomial terms cannot be held fixed!
- There is no simply rule as to what power of x we should go to. for large values of order, the model becomes overly flexible. it is an iterative process to decide on the optimal value of m \
How could we deal with complex, nonlinear relationships (3)
- polynomial regression
- binning: using piecewise constant functions
- using piecewise linear functions
how could we use binning to incorporate nonlinearity into a model? what are the pros (1) and cons (2) to this method?
we don’t treat the numeric variable as numeric. We band the numeric variable and convert it into an ordered categorical variable whose levels are defined as non-overlapping intervals over the range of the variable.
- each level is represented by a dummy variable and receives a separate regression coefficient
pros:
- liberated the regression function from assuming any particular shape. there is no definite order among the coefficients of the dummy variables corresponding to different bins, allowing the target mean to vary irregularly over the bins.
cons:
- no simple rule as to how many bins to use. having many bins leads to sparse categories and unstable coefficient estimates
- binning results in a loss of information.