Anki Flashcards

1
Q

How to check to see if the model is underfitting

A

Compare the model performance against a simple model such as the average target value or a GLM with only a few predictors
✔ Underfitting if performance is the same or worse
✔ It is not sufficient to just look at the training and testing error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

BIC

A

Bayesian Information Criterion
Used to compare GLMs
Lower is better
Minimize error and maximize likelihood

2log(nrow(train)) - 2log(likelihood)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

__ is when there is no pattern between the missingness and the value of the variable

A

Missing at random (MAR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When predicting whether or not a policy will file claims, True Negatives (TN) are policies which

A

When a policy is predicted to not have a claim and does not have a claim

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the bias-variance-tradeoff

A
  • the tradeoff between bias (underfitting) and variance (overfitting)
  • increase the bias will often decrease the variance
  • increasing the variance will often decrease the bias

Mean Squared Error = variance + bias^2 + irreducible error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When the distribution of a predictor variable is right-skewed you should ___

A

apply a log transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True/False: The goal of feature selection is to choose the features which are most predictive and discard those which are not

A
False.  Features may be predictive but excluded because of 
✔ Racial or ethical concerns
✔ Limitations of future availability 
✔ Stability of the data over time
✔ Inexplicability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Decision trees identify the optimal variable-split-point combination by measuring ____ or ____

A

Entropy or Gini

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When predicting whether or not a policy will file claims, True Positives (TP) are policies which

A

When a policy is predicted to have a claim and actually does have a claim

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The set of simplifying assumptions made by the model is called

A

bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

GLM response distributions that are strictly positive

A

Poisson (discrete)
gamma (continuous)
inverse gaussian (continuous)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For the regression metric RMSE, is higher or lower better?

A

Lower

“Minimize error and maximize likelihood”
RMSE = root mean squared error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to check to see if the model is overfitting

A

Train error is much better than the test error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

One disadvantage of ___ models is that the predictor variables need to be uncorrelated.

A

GLM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Penalized regression model(s) where variables are removed by having their coefficients set to zero

A

LASSO and Elastic Net

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When fitting a GLM, if the distribution of the target variable is right-skewed, you should ____

A

use log link function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the objective of the k-means algorithm?

A

To partition the observations into k groups such that the sum of squares from points to the assigned cluster centers is minimized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The variable “body mass index” contains missing values because the laptop that they were stored on had coffee spilled on it. This is an example of

A

Missing at random (MAR).

✔ There is no pattern between whether the value is missing and the target value.
✔ Observations can safely be omitted from the data with no loss in predictive power besides the smaller sample size.
✔ If > 20% of records are missing, consider removing the variable altogether.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When running k-means clustering, it is best to use multiple starting configurations (n.starts between 10-50) and then take the average cluster centers from all of them because this reduces the likelihood of ____

A

Getting stuck at a local minimum as opposed to the global minimum of the sum of squared errors between the cluster centers and each of the points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When a hierarchical clustering algorithm uses single linkage (the default), the distances between two clusters are computed by

A

Computing the distances between all points between clusters A and B and the using the smallest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define an interaction effect

A

When the impact a predictor variable on the target variable differs based on the value of another predictor variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

One disadvantage of ____ models is that they are unable to detect non-linear relationships between the predictor variables.

A

GLMs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

When a hierarchical clustering algorithm uses complete linkage (the default), the distances between two clusters are computed by

A

Computing the distances between all points between clusters A and B and using the largest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

AIC

A

Akaike Information Criterion
Used to compare GLMs
Lower is better

2p - 2log(likelihood)
p = # parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

When fitting a Decision Tree, if the distribution of a predictor variable is right-skewed, you should ____

A

Do nothing because tree splits are based on the rank ordering of the predictor and so applying a log would make no difference in the performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Not being complex enough to capture signal in the data is called

A

the bias of the model

high bias is the same as underfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

One of the assumptions of a GLM is that the _____ is related to the linear predictor through a link function

A

mean of the target distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

You should combine observations from factor levels with few observations into new groups that have move observations because doing so ____

A

reduces the dimension of the data set and increases predictive power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

The amount by which the model will change given different training data is also called

A

Variance

High variance is the same as overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Pearson’s Goodness of Fit Statistic

A

Used to measure the fit of Poisson (counting) models

The lower the better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

In GLMs, we set the base factor levels to be the ones with the most observations because

A

This makes the GLM coefficients more stable because the intercept term is estimated with the largest sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

The expected loss (error) from the model being too complex and sensitive to random noise is called _

A

The variance of the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

During data preparation, the only times that you should not combine factor levels with few observations together are when

A
  • The mean of the target between levels is not similar
  • it would make the results less interpretable
  • project statement says not to
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Advantages of single decision trees

A
  • easy to interpret
  • performs variable selection
  • categorical variables do not require binarization for each level to be used as a separate predictor
  • captures interactions
  • captures non-linearities
  • handles missing values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Describe imbalanced data

A

Target is a binary outcome with more observations of one class (majority) than the other (minority)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

In binary classification, what is the interpretation of the model metric AUC when it is close to 0.5?

A

The model is doing no better than random guessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

A “drug use” variable that has missing values because some respondents were reluctant to admit that they have broken the law. This is an example of ___

A

Missing not at random (MNAR)

38
Q

When using this penalized regression model, the sizes of coefficients are reduced (shrunk) but never zero

A

Ridge Regression

39
Q

When using a ___ link function, the coefficients can be explained as the impact on a z-score for a Normal distribution

A

probit

40
Q

One of the assumptions of ____ is that the target variable has a specific distribution

A

GLM

41
Q

For the metric log-likelihood, is higher or lower better?

A

Higher

42
Q

Lamda (Elastic Net)

A
  • determines the strength of regularization to use
  • R tests a sequence of lambda values using cross validation and then chooses the one with the lowest test error

1/2MSE + λ(penalty)

?glmnet

43
Q

For the regression metric Mean Absolute Error (MAE), is higher or lower better

A

lower

44
Q

Using a ___ link function with a GLM results in a multiplicative model

A

log

45
Q

How do GLMs handle missing values?

A

They get removed automatically by most software. This can result in a loss of useful information that could be predictive if there is any pattern in the missing values.

46
Q

When the target distribution is strictly positive and continuous, the best GLM response distributions are

A

gamma

inverse gaussian

47
Q

How do decision trees handle interactions?

A

Because decision trees use a series of conditional yes/no questions, the impact that a predictor has can be different depending on which previous splits were used

48
Q

When a hierarchical clustering algorithm uses complete linkage (the default), the distances between two clusters are computed by

A

Computing the distances between all points between clusters A and B and using the largest.

49
Q

When predicting whether or not a policy will file claims, False Positives (FP) are policies that

A

Predicted to have a claim but didn’t actually have a claim

50
Q

Formula: Sensitivity or True Positive Rate (TPR)

A

TP/(TP + FN)

51
Q

When predicting whether or not a policy will file claims, False Negatives (FN) are policies which

A

Predicted to not have a claim but actually did have a claim

52
Q

When fitting a Bagged Tree, if the distribution of the target variable is right-skewed, you should ____

A

Do nothing because tree-based models automatically capture monotonic transformations.

53
Q

In binary classification, a logit has an AUC of 0.99. What are the approximate sensitivity (TPR) and specificity (TNR) values?

A

Both are close to 1

54
Q

Disadvantages of hierarchical clustering

A
  • doesn’t work well on large data sets because it can be difficult to determine the correct number of clusters by the dendrogram
  • computation complexity can result in very long computation times (as opposed to kmeans which is faster)
55
Q

GLM output: Normal Q-Q graph

A
  • The normal quantile-quantile graph shows the theoretical quantiles against the observed quantiles of the deviance residuals
  • the residuals should always be normally distributed regardless of the GLM response family (except for binomial)
  • some deviations along the upper and lower quantiles are acceptable. This indicates that the residuals have a “fat tail”
56
Q

Describe the Tweedie Distribution

A
  • A GLM response distribution which is a good fit for insurance claims data when there is over-dispersion such as having many values of zero
  • model frequency as well as severity at the same time
57
Q

GLM offset

A
  • A constant term that is added to the linear predictor
  • the same as including a variable which has a coefficient equal to 1
  • exam pa only appear:
    1. with poisson regression
    2. with a log link function
    3. as a measure of exposure such as the length of policy period
  • remember to apply a log to the offset when using a log link function
58
Q

In binary classification, what is the interpretation of the model metric AUC when it is close to 1.0?

A

The model predicts the target perfectly

59
Q

Advantages of hierarchical clustering

A
  • the dendrogram helps to understand the data
  • is the best fit for hierarchical data (i.e., geography such as city, state, country)
  • shows how much clusters differ based on dendrogram length
  • no input parameters
60
Q

In a GLM, what does the p-value of a coefficient represent?

A

For a given coefficient estimate, the p-value is an estimate of the probability of a value of that magnitude (or higher) arising by pure chance

61
Q

How do GLMs handle interactions?

A

they need to be added manually

62
Q

In logistic regression, what is the formula to convert the linear predictor, z, to the probability, p?

A

p = e^z / (1 + e^z)

63
Q

The process of k-means

A
  1. select the number of clusters, k
  2. randomly assign cluster centers
  3. put each point into the cluster that is closest
    4 - 6. Move the cluster centers to the mean of the points assigned to it and continue until the centers stop moving
  4. repeat steps 1-6 n.starts number of times to reduce the randomness of choosing the initial cluster centers
64
Q

Define the curse of dimensionality

A
  1. When there are more features than observations (p > n) then we run the risk of overfitting the model. Using a dimensionality reduction method (PCA) or a model which performs feature selection can help this
  2. When there are too many features, observations become harder to cluster because every observation in the data appears equidistant from the others. If the distances are all approximately equal, then all the observations apprear equally alike
65
Q

Describe Principal Component Analysis (PCA)

A

A dimensionality reduction method which converts potentially correlated variables into a subset of linearly independent new variables called principal components (PCs)

  • each PC is created so that all prior PCs retain as much info from the original data as possible
  • scaling is applied to each variable prior to fitting
  • size and sign of the PC loadings are useful interpretations
66
Q

Advantages of boosted trees

A
  • high accuracy
  • is effective in a wide range of applications
  • handles nonlinearities, interaction effects, and missing data
67
Q

Claim frequency model

Target: ?
Distribution: ?
Link: ?
Weight: ?
Offset: ?
A
Target: Counting variable
Distribution: Poisson
Link: Log
Weight: none
Offset: log(# of exposures)
68
Q

GLM output: Residuals vs. fitted

A

Good fit:

  • all points are centered near zero on the y-axis and spread out symmetrically along the x-axis
  • this indicates that the variance is constant
  • the mean of the residuals is near zero
69
Q

Disadvantages of single decision trees

A
  • lacks predictive power
  • can overfit to the data easily
  • There is often a simplification of the underlying process because all observations at terminal nodes have an equal predicted value
70
Q

Tweedie distribution power variance parameter

A
Power variance, p, specifies the distribution:
  p = 0: Gaussian
  p = 1: Poisson
  p = 2: Gamma
  p = 3: Inverse Gaussian
71
Q

Disadvantages of bagged trees

A
  • high complexity
  • difficult to interpret
  • requires a lot of computation power
72
Q

Advantages to GLMs

A
  • easy to interpret
  • can easily deploy to spreadsheet format
  • handles different response distributions
  • is commonly used in insurance rate making
73
Q

When fitting a Bagged Tree, if the distribution of the target variable is right-skewed, you should ____

A

Do nothing because tree-based models automatically capture monotonic transformations.

74
Q

Bagging vs Boosting:
Predictions are made?
Easy to overfit?
Improves predictive power by?

A

Bagging:
Predictions are made: in parallel
Easy to overfit: No
Improves predictive power by: reducing variance

Boosting
Predictions are made: sequentially
Easy to overfit: yes
Improves predictive power by: reducing variance and bias

75
Q

Disadvantages of boosted trees

A
  • high complexity
  • hard to interpret
  • easy to overfit if not tuned correctly
  • requires a lot of computation power
76
Q

Advantages of bagged trees

A
  • high accuracy
  • resilient to overfitting due to bagging
  • only two parameters to tune (mtry, ntrees)
  • handles nonlinearities, interaction efffects, and missing data
77
Q

Disadvantages to GLMs

A
  • does not select features without techniques
  • strict assumptions around distribution shape, the randomness of error terms
  • predictors need to be uncorrelated
  • unable to detect nonlinearity (without manual adjustments)
  • sensitive to outliers
  • low predictive power
78
Q

Formula: Specificity (False Negative Rate)

A

TN / (TN + FP)

79
Q

How to interpret the coefficients of a probit model

A

+ positive coefficients for an input variable increase their linear predictor which is a z-score
- negative coefficients decrease it
numbers further from zero have larger effects

80
Q

Define data leakage

A

When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict

81
Q

Both AIC and BIC penalize the log-likelihood based on ____

A

the number of parameters

82
Q

Why is binarization of dummy (indicator) variables performed?

A

For stepwise selection in GLMs in order to remove factor levels, rather than keep or remove the whole variable

83
Q

Define multicollinearity in GLMs

A
  • correlation between any two predictors is large
  • any predictors are a linear combination of the others

Solutions:

  • remove all but one of the predictors
  • preprocess data using PCA
  • use a tree-based model
84
Q

Accuracy

A
  • The percentage of observations which are classified correctly
  • fails when we have imbalanced classes. In those cases AUC is more appropriate
85
Q

Decision Tree Complexity Parameter (CP)

A

?rpart
- CP value represents “minimum benefit” that a split must add to the tree

cp = 0: no restrictions -> results in a tall tree -> high complexity -> high variance
cp = 0.01 (default): each split must improve the fit by 0.01 when evaluated on the test set
86
Q

BIC favors models with fewer parameters than AIC does when ____

A

log(nrow(train_data)) > 2

87
Q

Correlation Key Points

A
  • measures lineaer association between two variables
  • positive correlation is when increasing one tends to increase the other and negatively correlation when decreasing one tends to increase the other
  • does not equal causation
88
Q

Decision tree cost-complexity pruning

A
  • choose a tree that strikes a balance between having a low error and having few splits so that it can be interpreted
  • adjusted for overfitting (tree too complex) or underfitting (tree too simple)

Steps:

  1. a decision tree with many leaves is created
  2. complexity is calculated for all subtrees using cross-validation
  3. the least important branches are pruned
89
Q

Area Under the Curve (AUC) probability interpretation

A

The probability that a randomly chosen positive class is ranked higher than a randomly chosen negative class.

90
Q

Alpha (elastic net)

A

the elastic net mixing parameter

91
Q

GLM: Claim Frequency Model

Target Variable: Average number of claims per policy period
Response Family: ?
Link Function: ?
Offset: ?
Weight: ?
A

Response Family: Poisson
Link Function: Log
Offset: None
Weight: policy period (or other units of exposure)

results in same predictions as the claim count model