Project Questions Flashcards

Question

What did you do to deal with dataset imbalance?

Answer 1

We applied stratified random sampling, which ensures that the distribution between 0 and 1 are similar in the test and training sets. This mitigates the risk of drawing a "lucky" or "unlucky" test/train set, affecting the predictive performance

Answer 2

By maximum likelihood estimation, where the model applies an iterative process to estimate the coefficients that maximises the likelihood of the given outcome being equal to the observed outcome

Answer 3

E) all of the above

Answer 4

LR > DT in following aspects A) if offers a precise interpretation of the association between each variable and the output through coefficients DT < LR B) it is more flexible in adapting to non-linear relations C) it provides a visual, intuitive mapping of the decision process, revealing the interplay of various predictors

Answer 5

For numerical variables, the importance of each predictors was evaluated based on their respective correlation with cancellation rate For categorical variables, the importance of each predictor was evaluated through graphical illustrations to illuminate any meaningful associations with cancellation rate For interactions, a series of heatmaps were constructed to explore any association across categorical predictors. For numerical variables, those with highest correlation were chosen

Answer 6

A) room type: denoted by letters -> unable to interpret B) agent ID: denoted by numbers --> unable to interpret C) hotel type: to enhance generalisability beyond the two hotel types D) reservation status: for predictive purposes on future bookings, you don't have this observation prior to expected guest arrival E) company: 94% were NULL observations F) days_waiting_list and arrival_date_day_of_month: negligible correlation with cancellation

Answer 7

TRUE AUC provides a general evaluation of a model's performance across its sensitivity and 1-specificity

Answer 8

Proportion of correct predictions in all records is measure by ACCURACY

Answer 9

Ability to correctly predict the positive class is measure by SENSITIVITY (=RECALL) Sensitivity = TP/(TP+FN)

Answer 10

Ability to correctly predict the negative class is measure by SPECIFICITY Specificity = TN/(TN+FP)

Answer 11

Proportion of correct predictions in those the classifier predicted as positives is measure by PRECISION Precision = TP/(TP+FP)

Answer 12

Proportion of correct predictions in those the classifier predicted as negatives is measure by NEGATIVE PREDICTIVE VALUE NPV= TN/(FN+TN)

Answer 13

In the context of the core business objective being to improve operational performance, prioritising a slight overestimation of cancellations (with some false positives) is preferable to risking revenue loss due to underestimating cancellations (false negatives). Therefore, the primary focus in the model comparison centres on SENSITIVITY vs. PRECISION

Answer 14

Hyperparameters play a crucial role in the domain of machine learning, setting themselves apart from model parameters as they operate EXTERNALLY to the learning process.

Answer 15

The cost complexity parameter in decision tree modeling balances the trade-off between the complexity of the tree and its predictive performance, preventing overfitting

Answer 16

The tune-length defines the number of complexity parameters to be evaluated, and in this study, a tune-length of 15 is chosen to strike a balance between performance, risk of overfitting, and computational demand.

Answer 17

D) all of the above

Answer 18

D) all of the above

Answer 19

An iterative process. Every model starts with a threshold of 0.25. Then, it is gradually adjusted upward using two decimal points to a number that maximises the accuracy of the model BUT under the condition that it must predict more false positives than false negatives

Answer 20

The NIR tells us the accuracy (predictive performance) of a naive benchmark. The naive consistently predicts the outcome based on the largest proportion of the output being positive or negative. I.e., if NIR is 0.71, simply means that 29% of the dataset has a is cancelled outcome (1). By predicting "not cancelled" every time, the naive model will then be right in 71% of the time.

Answer 21

DT 4 was rendered the most optimal DT model. The accuracy of 3 and 4 are the only two that exceeds NIR. DT 3 had higher precision, but DT 4 had higher sensitivit --> DT 4 is best based on business goal (preference for overprediction at the expense of higher false positives)

Answer 22

LR 3 was rendered the most optimal RL model. Both sensitivity and precision increase from LR 1-3. Despite Precision being highest for model 4, sensitivity is highest for LR 3 --> LR 3 best based on business goal (preference for overprediction at the expense of higher false positives)

Answer 23

1) Lower computational demand and faster update 2) Easier to interpret with coefficients indicating the association to predictive features - easier to communicate

Answer 24

No - decision trees tells you whether the given booking is most likely to be cancelled or not, not the distinct probability. It does tell you however that based on the training set, x% of the total data was captured in a given leaf node, and 0.Y of those observations turned out to be cancelled/not cancelled

Answer 25

Based on booking cancellation prediction, hotels can leverage the insight to increase operational performance through more informed revenue management initiatives (overbooking) and minimise cost base (staffing)