6 - Building Your First Model Flashcards

Question

What must be considered before using scanned PDF data in modeling?

Answer 1

Exhausting all existing tabular data before spending time on scanned PDFs.

Answer 2

Counting the total number of times a patient was hospitalized in the year before the prior authorization submission.

Answer 3

* Total number of hospital visits * Total number of medications prescribed * Average number of claims per month * Rate of change in prescription drug use

Answer 4

There’s no set rule; the goal is to find the most highly predictive features.

Answer 5

To identify statistical relationships between variables and find the strongest predictors.

Answer 6

The process of identifying which features to include in your model.

Answer 7

It can add more noise, resulting in worse predictions.

Answer 8

Selecting variables that have a high correlation with the outcome variable.

Answer 9

Methods like forward selection, backward selection, and stepwise selection that examine how useful each feature is.

Answer 10

Adds the most important variable first and continues adding until no significant variables are left.

Answer 11

A method that automatically includes variable selection as part of the model’s optimization algorithm.

Answer 12

A variation of regression that minimizes the sum of the squares of the residuals plus the sum of the absolute values of the regression coefficients.

Answer 13

To feed the model data and allow it to learn patterns.

Answer 14

Showing previously unseen data to the model and evaluating its predictions.

Answer 15

Using statistical associations and correlations from the training data.

Answer 16

Finding the optimal values of parameters that best fit the data points.

Answer 17

Y = mX + b, where m is the slope and b is the y-intercept.

Answer 18

It may perform poorly due to lack of representative training data.

Answer 19

The model may perform poorly in unexpected scenarios ## Footnote For example, if a model is trained only on pictures of cats and dogs, it may struggle with pictures of hippos.

Answer 20

By making our training data as representative as possible and filtering out dissimilar data points ## Footnote This helps the model learn from a diverse array of cases.

Answer 21

Using more data for training improves performance on the training set but complicates performance evaluation on the test set ## Footnote A common rule of thumb is to use 20 to 40 percent for testing.

Answer 22

In huge data sets, just 1 percent may be sufficient as a representative test set ## Footnote This allows for effective evaluation without needing a large portion of the data.

Answer 23

The model's ability to generalize to new populations that may differ from the training population ## Footnote For example, coal miners may have different health care utilization patterns than the general population.

Answer 24

Yes, but tweaking the model after testing can lead to overfitting ## Footnote This is similar to studying for a practice test and not performing well on the actual test.

Answer 25

Optimizing the model to perform well on test data at the expense of generalizing to other data sets ## Footnote This can lead to poor performance when applied to new, unseen data.

Answer 26

It reduces the likelihood of overfitting by testing the model on multiple validation sets ## Footnote This involves splitting data into folds and validating across them.

Answer 27

* Changing the data used by the model * Changing the type of model * Tuning hyperparameters ## Footnote These adjustments can significantly affect prediction accuracy.

Answer 28

A setting knob on a model that can be tuned to optimize performance ## Footnote For example, the time spent studying a page in a textbook can be viewed as a hyperparameter.

Answer 29

It depends on how the model will be used and the required accuracy for practical applications ## Footnote This is often determined by stakeholder needs.

Answer 30

It is the average absolute difference between predicted and actual expenditures ## Footnote This metric helps quantify model performance in a comprehensible way.

Answer 31

It treats large and small errors equally ## Footnote This means that a $200 error is penalized the same as a $100 error, potentially masking larger discrepancies.

Answer 32

The mean absolute error is the average of the absolute values of the differences between predicted and actual values ## Footnote It provides a single number indicating how much the model is off, on average.

Answer 33

Large errors and small errors are treated the same, meaning an error of $200 is penalized only twice as much as an error of $100.

Answer 34

Mean squared error is the square of the difference between predicted and actual values, penalizing larger discrepancies more than smaller ones.

Answer 35

If the error is $100, the squared error is $10,000; if the error is $200, the squared error is $40,000.

Answer 36

The Brier score is a mean squared error used to compare predicted probabilities to actual binary outcomes (0 or 1).

Answer 37

A lower Brier score indicates better model performance.

Answer 38

A well-calibrated model's predicted probabilities behave like true probabilities.

Answer 39

By comparing predicted probabilities to the actual frequency of outcomes within a group of data points.

Answer 40

Thresholding involves picking a probability threshold and designating predicted probabilities above it as 'yes' and below as 'no.'

Answer 41

Many model builders arbitrarily pick a threshold of 0.5.

Answer 42

It assumes that false negatives are equally undesirable as false positives.

Answer 43

In scenarios where false positives are worse than false negatives, a threshold closer to 0 may be more suitable.

Answer 44

The person using the model should communicate the relative costs of false positives and false negatives.

Answer 45

True positives, true negatives, false positives, false negatives.

Answer 46

Sensitivity, or true positive rate, is the percentage of all positives the model identifies.

Answer 47

Specificity, or true negative rate, is the percentage of all negatives the model identifies.

Answer 48

The probability that a predicted positive is a true positive, calculated as true positives divided by true positives plus false positives.

Answer 49

The probability that a predicted negative is a true negative, calculated as true negatives divided by true negatives plus false negatives.

Answer 50

Questions regarding population, outcome variable, feature selection, training, and model performance should be considered.

Answer 51

Most predictive features, steps taken for selection, importance techniques, and feature engineering techniques.

Answer 52

Regularization techniques, cross-validation, or early stopping.

Answer 53

Metrics appropriate for the use case, including sensitivity, specificity, and predictive values.

Answer 54

They assess the performance of trained models and validate generalizability.