Anki Flashcards
How to check to see if the model is underfitting
Compare the model performance against a simple model such as the average target value or a GLM with only a few predictors
✔ Underfitting if performance is the same or worse
✔ It is not sufficient to just look at the training and testing error
BIC
Bayesian Information Criterion
Used to compare GLMs
Lower is better
Minimize error and maximize likelihood
2log(nrow(train)) - 2log(likelihood)
__ is when there is no pattern between the missingness and the value of the variable
Missing at random (MAR)
When predicting whether or not a policy will file claims, True Negatives (TN) are policies which
When a policy is predicted to not have a claim and does not have a claim
Describe the bias-variance-tradeoff
- the tradeoff between bias (underfitting) and variance (overfitting)
- increase the bias will often decrease the variance
- increasing the variance will often decrease the bias
Mean Squared Error = variance + bias^2 + irreducible error
When the distribution of a predictor variable is right-skewed you should ___
apply a log transform
True/False: The goal of feature selection is to choose the features which are most predictive and discard those which are not
False. Features may be predictive but excluded because of ✔ Racial or ethical concerns ✔ Limitations of future availability ✔ Stability of the data over time ✔ Inexplicability.
Decision trees identify the optimal variable-split-point combination by measuring ____ or ____
Entropy or Gini
When predicting whether or not a policy will file claims, True Positives (TP) are policies which
When a policy is predicted to have a claim and actually does have a claim
The set of simplifying assumptions made by the model is called
bias
GLM response distributions that are strictly positive
Poisson (discrete)
gamma (continuous)
inverse gaussian (continuous)
For the regression metric RMSE, is higher or lower better?
Lower
“Minimize error and maximize likelihood”
RMSE = root mean squared error
How to check to see if the model is overfitting
Train error is much better than the test error
One disadvantage of ___ models is that the predictor variables need to be uncorrelated.
GLM
Penalized regression model(s) where variables are removed by having their coefficients set to zero
LASSO and Elastic Net
When fitting a GLM, if the distribution of the target variable is right-skewed, you should ____
use log link function
What is the objective of the k-means algorithm?
To partition the observations into k groups such that the sum of squares from points to the assigned cluster centers is minimized
The variable “body mass index” contains missing values because the laptop that they were stored on had coffee spilled on it. This is an example of
Missing at random (MAR).
✔ There is no pattern between whether the value is missing and the target value.
✔ Observations can safely be omitted from the data with no loss in predictive power besides the smaller sample size.
✔ If > 20% of records are missing, consider removing the variable altogether.
When running k-means clustering, it is best to use multiple starting configurations (n.starts between 10-50) and then take the average cluster centers from all of them because this reduces the likelihood of ____
Getting stuck at a local minimum as opposed to the global minimum of the sum of squared errors between the cluster centers and each of the points
When a hierarchical clustering algorithm uses single linkage (the default), the distances between two clusters are computed by
Computing the distances between all points between clusters A and B and the using the smallest
Define an interaction effect
When the impact a predictor variable on the target variable differs based on the value of another predictor variable
One disadvantage of ____ models is that they are unable to detect non-linear relationships between the predictor variables.
GLMs
When a hierarchical clustering algorithm uses complete linkage (the default), the distances between two clusters are computed by
Computing the distances between all points between clusters A and B and using the largest.
AIC
Akaike Information Criterion
Used to compare GLMs
Lower is better
2p - 2log(likelihood)
p = # parameters
When fitting a Decision Tree, if the distribution of a predictor variable is right-skewed, you should ____
Do nothing because tree splits are based on the rank ordering of the predictor and so applying a log would make no difference in the performance
Not being complex enough to capture signal in the data is called
the bias of the model
high bias is the same as underfitting
One of the assumptions of a GLM is that the _____ is related to the linear predictor through a link function
mean of the target distribution
You should combine observations from factor levels with few observations into new groups that have move observations because doing so ____
reduces the dimension of the data set and increases predictive power
The amount by which the model will change given different training data is also called
Variance
High variance is the same as overfitting
Pearson’s Goodness of Fit Statistic
Used to measure the fit of Poisson (counting) models
The lower the better
In GLMs, we set the base factor levels to be the ones with the most observations because
This makes the GLM coefficients more stable because the intercept term is estimated with the largest sample size
The expected loss (error) from the model being too complex and sensitive to random noise is called _
The variance of the model
During data preparation, the only times that you should not combine factor levels with few observations together are when
- The mean of the target between levels is not similar
- it would make the results less interpretable
- project statement says not to
Advantages of single decision trees
- easy to interpret
- performs variable selection
- categorical variables do not require binarization for each level to be used as a separate predictor
- captures interactions
- captures non-linearities
- handles missing values
Describe imbalanced data
Target is a binary outcome with more observations of one class (majority) than the other (minority)
In binary classification, what is the interpretation of the model metric AUC when it is close to 0.5?
The model is doing no better than random guessing
A “drug use” variable that has missing values because some respondents were reluctant to admit that they have broken the law. This is an example of ___
Missing not at random (MNAR)
When using this penalized regression model, the sizes of coefficients are reduced (shrunk) but never zero
Ridge Regression
When using a ___ link function, the coefficients can be explained as the impact on a z-score for a Normal distribution
probit
One of the assumptions of ____ is that the target variable has a specific distribution
GLM
For the metric log-likelihood, is higher or lower better?
Higher
Lamda (Elastic Net)
- determines the strength of regularization to use
- R tests a sequence of lambda values using cross validation and then chooses the one with the lowest test error
1/2MSE + λ(penalty)
?glmnet
For the regression metric Mean Absolute Error (MAE), is higher or lower better
lower
Using a ___ link function with a GLM results in a multiplicative model
log
How do GLMs handle missing values?
They get removed automatically by most software. This can result in a loss of useful information that could be predictive if there is any pattern in the missing values.
When the target distribution is strictly positive and continuous, the best GLM response distributions are
gamma
inverse gaussian
How do decision trees handle interactions?
Because decision trees use a series of conditional yes/no questions, the impact that a predictor has can be different depending on which previous splits were used
When a hierarchical clustering algorithm uses complete linkage (the default), the distances between two clusters are computed by
Computing the distances between all points between clusters A and B and using the largest.
When predicting whether or not a policy will file claims, False Positives (FP) are policies that
Predicted to have a claim but didn’t actually have a claim
Formula: Sensitivity or True Positive Rate (TPR)
TP/(TP + FN)
When predicting whether or not a policy will file claims, False Negatives (FN) are policies which
Predicted to not have a claim but actually did have a claim
When fitting a Bagged Tree, if the distribution of the target variable is right-skewed, you should ____
Do nothing because tree-based models automatically capture monotonic transformations.
In binary classification, a logit has an AUC of 0.99. What are the approximate sensitivity (TPR) and specificity (TNR) values?
Both are close to 1
Disadvantages of hierarchical clustering
- doesn’t work well on large data sets because it can be difficult to determine the correct number of clusters by the dendrogram
- computation complexity can result in very long computation times (as opposed to kmeans which is faster)
GLM output: Normal Q-Q graph
- The normal quantile-quantile graph shows the theoretical quantiles against the observed quantiles of the deviance residuals
- the residuals should always be normally distributed regardless of the GLM response family (except for binomial)
- some deviations along the upper and lower quantiles are acceptable. This indicates that the residuals have a “fat tail”
Describe the Tweedie Distribution
- A GLM response distribution which is a good fit for insurance claims data when there is over-dispersion such as having many values of zero
- model frequency as well as severity at the same time
GLM offset
- A constant term that is added to the linear predictor
- the same as including a variable which has a coefficient equal to 1
- exam pa only appear:
1. with poisson regression
2. with a log link function
3. as a measure of exposure such as the length of policy period - remember to apply a log to the offset when using a log link function
In binary classification, what is the interpretation of the model metric AUC when it is close to 1.0?
The model predicts the target perfectly
Advantages of hierarchical clustering
- the dendrogram helps to understand the data
- is the best fit for hierarchical data (i.e., geography such as city, state, country)
- shows how much clusters differ based on dendrogram length
- no input parameters
In a GLM, what does the p-value of a coefficient represent?
For a given coefficient estimate, the p-value is an estimate of the probability of a value of that magnitude (or higher) arising by pure chance
How do GLMs handle interactions?
they need to be added manually
In logistic regression, what is the formula to convert the linear predictor, z, to the probability, p?
p = e^z / (1 + e^z)
The process of k-means
- select the number of clusters, k
- randomly assign cluster centers
- put each point into the cluster that is closest
4 - 6. Move the cluster centers to the mean of the points assigned to it and continue until the centers stop moving - repeat steps 1-6 n.starts number of times to reduce the randomness of choosing the initial cluster centers
Define the curse of dimensionality
- When there are more features than observations (p > n) then we run the risk of overfitting the model. Using a dimensionality reduction method (PCA) or a model which performs feature selection can help this
- When there are too many features, observations become harder to cluster because every observation in the data appears equidistant from the others. If the distances are all approximately equal, then all the observations apprear equally alike
Describe Principal Component Analysis (PCA)
A dimensionality reduction method which converts potentially correlated variables into a subset of linearly independent new variables called principal components (PCs)
- each PC is created so that all prior PCs retain as much info from the original data as possible
- scaling is applied to each variable prior to fitting
- size and sign of the PC loadings are useful interpretations
Advantages of boosted trees
- high accuracy
- is effective in a wide range of applications
- handles nonlinearities, interaction effects, and missing data
Claim frequency model
Target: ? Distribution: ? Link: ? Weight: ? Offset: ?
Target: Counting variable Distribution: Poisson Link: Log Weight: none Offset: log(# of exposures)
GLM output: Residuals vs. fitted
Good fit:
- all points are centered near zero on the y-axis and spread out symmetrically along the x-axis
- this indicates that the variance is constant
- the mean of the residuals is near zero
Disadvantages of single decision trees
- lacks predictive power
- can overfit to the data easily
- There is often a simplification of the underlying process because all observations at terminal nodes have an equal predicted value
Tweedie distribution power variance parameter
Power variance, p, specifies the distribution: p = 0: Gaussian p = 1: Poisson p = 2: Gamma p = 3: Inverse Gaussian
Disadvantages of bagged trees
- high complexity
- difficult to interpret
- requires a lot of computation power
Advantages to GLMs
- easy to interpret
- can easily deploy to spreadsheet format
- handles different response distributions
- is commonly used in insurance rate making
When fitting a Bagged Tree, if the distribution of the target variable is right-skewed, you should ____
Do nothing because tree-based models automatically capture monotonic transformations.
Bagging vs Boosting:
Predictions are made?
Easy to overfit?
Improves predictive power by?
Bagging:
Predictions are made: in parallel
Easy to overfit: No
Improves predictive power by: reducing variance
Boosting
Predictions are made: sequentially
Easy to overfit: yes
Improves predictive power by: reducing variance and bias
Disadvantages of boosted trees
- high complexity
- hard to interpret
- easy to overfit if not tuned correctly
- requires a lot of computation power
Advantages of bagged trees
- high accuracy
- resilient to overfitting due to bagging
- only two parameters to tune (mtry, ntrees)
- handles nonlinearities, interaction efffects, and missing data
Disadvantages to GLMs
- does not select features without techniques
- strict assumptions around distribution shape, the randomness of error terms
- predictors need to be uncorrelated
- unable to detect nonlinearity (without manual adjustments)
- sensitive to outliers
- low predictive power
Formula: Specificity (False Negative Rate)
TN / (TN + FP)
How to interpret the coefficients of a probit model
+ positive coefficients for an input variable increase their linear predictor which is a z-score
- negative coefficients decrease it
numbers further from zero have larger effects
Define data leakage
When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict
Both AIC and BIC penalize the log-likelihood based on ____
the number of parameters
Why is binarization of dummy (indicator) variables performed?
For stepwise selection in GLMs in order to remove factor levels, rather than keep or remove the whole variable
Define multicollinearity in GLMs
- correlation between any two predictors is large
- any predictors are a linear combination of the others
Solutions:
- remove all but one of the predictors
- preprocess data using PCA
- use a tree-based model
Accuracy
- The percentage of observations which are classified correctly
- fails when we have imbalanced classes. In those cases AUC is more appropriate
Decision Tree Complexity Parameter (CP)
?rpart
- CP value represents “minimum benefit” that a split must add to the tree
cp = 0: no restrictions -> results in a tall tree -> high complexity -> high variance cp = 0.01 (default): each split must improve the fit by 0.01 when evaluated on the test set
BIC favors models with fewer parameters than AIC does when ____
log(nrow(train_data)) > 2
Correlation Key Points
- measures lineaer association between two variables
- positive correlation is when increasing one tends to increase the other and negatively correlation when decreasing one tends to increase the other
- does not equal causation
Decision tree cost-complexity pruning
- choose a tree that strikes a balance between having a low error and having few splits so that it can be interpreted
- adjusted for overfitting (tree too complex) or underfitting (tree too simple)
Steps:
- a decision tree with many leaves is created
- complexity is calculated for all subtrees using cross-validation
- the least important branches are pruned
Area Under the Curve (AUC) probability interpretation
The probability that a randomly chosen positive class is ranked higher than a randomly chosen negative class.
Alpha (elastic net)
the elastic net mixing parameter
GLM: Claim Frequency Model
Target Variable: Average number of claims per policy period Response Family: ? Link Function: ? Offset: ? Weight: ?
Response Family: Poisson
Link Function: Log
Offset: None
Weight: policy period (or other units of exposure)
results in same predictions as the claim count model