what is the difference between supervised and unsupervised learning problems?
supervised: our goal is to understand the relationship between the target variable and the predictors/ make accurate predictions for the target variable based on the predictors (there is a target variable)
unsupervised: target variable is absent. we are more interested in extracting relationships and structures between different variables in the data.
what two types of business problems are there in exam PA?
How does the business problem affect the model that we create? (objective is prediction or interpretation?)
if the objective is to predict, then we will implement a model that produces good predictions even if it is costly to implement
if the objective is to interpret, then we can select a relatively simple, interpretable model that clearly shows the relationship between the target and predictors
why is it important that data are consistent?
ex. numeric variables - keeping them all in the same units
categorical variables - consistent naming for the levels
so that they can be directly compared to one another
When datasets contain PII (personally identifiable information), such as social security number, address, etc, what may need to be done with the data?
what proportion of the data should we use to train our models?
around 70-75%
how do you create the test/training split? from what package?
library(caret) set.seed() partition -createDataPartition(dataset$targetvariable, p = 0.75, list = FALSE) data.train - dataset[partition, ] data.test - dataset[-partition, ]
what is the purpose of a training/test split?
train: used to train/develop your model to estimate the signal function. typically done by optimizing a certain objective function
test: where you assess the prediction performance of your trained model according to certain performance metrics (imagining that the test set is a set of future data)
what is a good performance metric to use on regression problems?
test RMSE: write out the formula
we interpret this as the size of a typical prediction error in absolute value
why do we use RMSE instead of MSE?
the RMSE has the same unit as the target variable
what performance metric do we use for classification problems?
test classification error rate: write it out
the sum computes the number of test observations incorrectly classified. the division of n_test returns the proportion of misclassified observations on the test set
when is c-v most typically used?
to select the values of hyperparameters, which are parameters that control some aspect of the fitting itself
why is c-v powerful?
because it can assess the prediction performance of a model without using additional test data
how to perform k-fold cross validation?
is it possible to validate our model graphically?
yes, the plot of observed against predicted values not only should fall on the y = x line, but also have no deviations from this line
T/F: prediction accuracy is the same as goodness of fit
false. it is not.
goodness of fit measures how well a model describes the past data, but doesn’t necessarily measure how good a model will perform on future data
what are the components of the expected test error? describe each of them.
as the flexbility of the model increases, what tends to drop quicker, bias or variance?
bias
what are two commonly used strategies for reducing the dimensionality of a categorical predictor?
in terms of a categorical predictor: what is the difference between granularity and dimensionality?
dimensionality:
granularity:
what is the optimal level of granularity for a categorical predictor
the level that optimizes the bias-variance trade off
What does the penalty term do in the AIC/BIC?
the AIC and BIC both demand that for the inclusion of another feature to improve the performance of a model, the feature has to increase the loglikelihood by the same amount that it increases the complexity
which has a higher penalty term? AIC or BIC?
BIC = more conservative
when checking model diagnostics, what two plots are commonly used and what is evaluated on both?