Anki Flashcards
How to check to see if the model is underfitting
Compare the model performance against a simple model such as the average target value or a GLM with only a few predictors
✔ Underfitting if performance is the same or worse
✔ It is not sufficient to just look at the training and testing error
BIC
Bayesian Information Criterion
Used to compare GLMs
Lower is better
Minimize error and maximize likelihood
2log(nrow(train)) - 2log(likelihood)
__ is when there is no pattern between the missingness and the value of the variable
Missing at random (MAR)
When predicting whether or not a policy will file claims, True Negatives (TN) are policies which
When a policy is predicted to not have a claim and does not have a claim
Describe the bias-variance-tradeoff
- the tradeoff between bias (underfitting) and variance (overfitting)
- increase the bias will often decrease the variance
- increasing the variance will often decrease the bias
Mean Squared Error = variance + bias^2 + irreducible error
When the distribution of a predictor variable is right-skewed you should ___
apply a log transform
True/False: The goal of feature selection is to choose the features which are most predictive and discard those which are not
False. Features may be predictive but excluded because of ✔ Racial or ethical concerns ✔ Limitations of future availability ✔ Stability of the data over time ✔ Inexplicability.
Decision trees identify the optimal variable-split-point combination by measuring ____ or ____
Entropy or Gini
When predicting whether or not a policy will file claims, True Positives (TP) are policies which
When a policy is predicted to have a claim and actually does have a claim
The set of simplifying assumptions made by the model is called
bias
GLM response distributions that are strictly positive
Poisson (discrete)
gamma (continuous)
inverse gaussian (continuous)
For the regression metric RMSE, is higher or lower better?
Lower
“Minimize error and maximize likelihood”
RMSE = root mean squared error
How to check to see if the model is overfitting
Train error is much better than the test error
One disadvantage of ___ models is that the predictor variables need to be uncorrelated.
GLM
Penalized regression model(s) where variables are removed by having their coefficients set to zero
LASSO and Elastic Net
When fitting a GLM, if the distribution of the target variable is right-skewed, you should ____
use log link function
What is the objective of the k-means algorithm?
To partition the observations into k groups such that the sum of squares from points to the assigned cluster centers is minimized
The variable “body mass index” contains missing values because the laptop that they were stored on had coffee spilled on it. This is an example of
Missing at random (MAR).
✔ There is no pattern between whether the value is missing and the target value.
✔ Observations can safely be omitted from the data with no loss in predictive power besides the smaller sample size.
✔ If > 20% of records are missing, consider removing the variable altogether.
When running k-means clustering, it is best to use multiple starting configurations (n.starts between 10-50) and then take the average cluster centers from all of them because this reduces the likelihood of ____
Getting stuck at a local minimum as opposed to the global minimum of the sum of squared errors between the cluster centers and each of the points
When a hierarchical clustering algorithm uses single linkage (the default), the distances between two clusters are computed by
Computing the distances between all points between clusters A and B and the using the smallest
Define an interaction effect
When the impact a predictor variable on the target variable differs based on the value of another predictor variable
One disadvantage of ____ models is that they are unable to detect non-linear relationships between the predictor variables.
GLMs
When a hierarchical clustering algorithm uses complete linkage (the default), the distances between two clusters are computed by
Computing the distances between all points between clusters A and B and using the largest.
AIC
Akaike Information Criterion
Used to compare GLMs
Lower is better
2p - 2log(likelihood)
p = # parameters
When fitting a Decision Tree, if the distribution of a predictor variable is right-skewed, you should ____
Do nothing because tree splits are based on the rank ordering of the predictor and so applying a log would make no difference in the performance
Not being complex enough to capture signal in the data is called
the bias of the model
high bias is the same as underfitting
One of the assumptions of a GLM is that the _____ is related to the linear predictor through a link function
mean of the target distribution
You should combine observations from factor levels with few observations into new groups that have move observations because doing so ____
reduces the dimension of the data set and increases predictive power
The amount by which the model will change given different training data is also called
Variance
High variance is the same as overfitting
Pearson’s Goodness of Fit Statistic
Used to measure the fit of Poisson (counting) models
The lower the better
In GLMs, we set the base factor levels to be the ones with the most observations because
This makes the GLM coefficients more stable because the intercept term is estimated with the largest sample size
The expected loss (error) from the model being too complex and sensitive to random noise is called _
The variance of the model
During data preparation, the only times that you should not combine factor levels with few observations together are when
- The mean of the target between levels is not similar
- it would make the results less interpretable
- project statement says not to
Advantages of single decision trees
- easy to interpret
- performs variable selection
- categorical variables do not require binarization for each level to be used as a separate predictor
- captures interactions
- captures non-linearities
- handles missing values
Describe imbalanced data
Target is a binary outcome with more observations of one class (majority) than the other (minority)
In binary classification, what is the interpretation of the model metric AUC when it is close to 0.5?
The model is doing no better than random guessing