Project Questions Flashcards
What is the purpose of the “set.seed()” function in R?
It ensures that output from randomness will be the same given same inputs.
I.e., omitting the seed function, running the same code will lead to different results every time due to different points of departure
What is the utility of the “caret” package?
What is the utility of the “corrplot” package?
What is the utility of the “tidyverse” package?
What is the utility of the “pROC” package?
What is the utility of the “rpart.plot” package?
What is a factor?
In R, a factor is a categorical variable that represents distinct levels or categories
What is an integer?
An integer is a whole number that does not have any fractional or decimal part.
What is a character?
A character is a data type in programming that represents individual letters, numbers, symbols, or spaces
To incorporate interaction effect with one or both variables being categorical, dummy variables are used for this purpose. Why?
In logistic regression, Dummy variables provide a way to represent categorical variables numerically. Each dummy variable represents one category of the categorical variable and takes on values of 0 or 1, indicating the absence or presence of the category. If absent, the entire term becomes 0. If present, the model multiples the corresponding coefficient with 1 (present).
Which of the following additional use cases does BDA have in hospitality?
A) facilitates service innovation
B) insights into customer satisfaction through e.g., big data text analysis of customer reviews
C) creating client profiles and enhance customer relationship management
D) all of the abovw
D) All of the above
What is the goal of data transformation? It involves modifying the structure or content of a dataset.
The goal is to ensure appropriate fit between the type of data and chosen statistical method
Which data transformation steps did you perform?
1) Transformation from character to factor
2) Transformation from integer to factor
3) Combining #kids and #babies
Why did you transform characters to factors?
Factors allows the model to interpret and utilise categorical variables efficiently by providing a finite number of options that the value can take
Why did you transform integers to factors?
Factors allows the model to distinguish between numerical variables that represent categories and numerical variables that represent continuous quantities
Give me an example of which integer you you transformed into a factor, and explain why it made sense
Is_repeat_guest and is_cancelled: in reality, these variable can take one of two values: yes (1) or no (0). Thus, it made sense to transform them from integers to factors to ensure that the model interpreted them as categorical variables
Give me an example of which character you you transformed into a factor, and explain why it made sense
Meal: there are 4 options of meal types, all each by two capital letters and came as characters. We wanted the model to interpret the variable as categorical to allow it to map any associations between this and the output.
As part of data cleaning, you removed some observations based on their deposit type. Elaborate on this
Removed the observations with deposit type = non-refundable: EDA showed that 99% of these bookings were cancelled - highly counter-intuitive
As part of data cleaning, you removed observations with no adults recorded. Why? Is this necessarily an error?
Not necessarily an error. However, the rooms with no adults were expected to be perfectly associated with another booking. Thus, these rooms are not representative as a singular observation to be considered in the model
Why did you keep the remaining two deposit types rather than deleting the entire variable altogether?
The observations with refundable deposit and no deposit paid behaved intuitively, and no abnormality was detected. E.g., cancellation rate is slightly higher for no deposits, which makes sense
Why did you remove observations with “undefined” meal and distribution channel? Were they missing completely at random?
This resulted in deletion of ~1000 rows, which we evaluated not to impact the model significantly (w. +100k observations remaining after all cleaning)
MCAR:
How did you seek to tackle the bias-variance trade-off?
To ensure that at right balance was striked between generalisation (bias) and overfitting (variance), we partitioned the the data into a training set (60%) and test set (40%). Then, we applied 5-fold cross-validation
What does it mean that you applied 5-fold cross validation?
5-fold cross validation is done within the TRAINING set, where the data is divided into 5 portions (folds). In each fold, 4/5 of the training data is used for training, and 1 is used for testing. This is repeated 5 times until all training data has been used for training AND testing
Why did you choose 5 folds?
It came down to a trade-off between the degree of cross-validation and computational time - each fold takes a long time.
Meanwhile, with the significant size of the dataset, each fold still contains a large amount of data - suggesting that the bias-variance tradeoff was sufficiently handled
What did you do to deal with dataset imbalance?
We applied stratified random sampling, which ensures that the distribution between 0 and 1 are similar in the test and training sets. This mitigates the risk of drawing a “lucky” or “unlucky” test/train set, affecting the predictive performance
How does the LR model estimate coefficients?
By maximum likelihood estimation, where the model applies an iterative process to estimate the coefficients that maximises the likelihood of the given outcome being equal to the observed outcome
Which of the following statements are TRUE about DT?
A) it operates by recursive partitioning a dataset into subsets based on the values and input features, which creates a hierarchical structure of decision nodes
B) Each intermediary node represents a decision point, wherein a given feature is evaluated
C) Each terminal leaf node reflects a final outcome or prediction
D) The recursive partitioning divides the dataset into subsets that minimises entropy and Gini impurity
E) all of the above
E) all of the above
Which of the following options are advantages of LR relative to DT?
A) if offers a precise interpretation of the association between each variable and the output through coefficients
B) it is more flexible in adapting to non-linear relations
C) it provides a visual, intuitive mapping of the decision process, revealing the interplay of various predictors
D) all of the above
LR > DT in following aspects
A) if offers a precise interpretation of the association between each variable and the output through coefficients
DT < LR
B) it is more flexible in adapting to non-linear relations
C) it provides a visual, intuitive mapping of the decision process, revealing the interplay of various predictors
Explain the process employed in your selection of variables in the mode.
How did you determine the importance of numerical variables?
How did you determine the importance of
categorical variables?
How did you determine the importance of
different interactions?
For numerical variables, the importance of each predictors was evaluated based on their respective correlation with cancellation rate
For categorical variables, the importance of each predictor was evaluated through graphical illustrations to illuminate any meaningful associations with cancellation rate
For interactions, a series of heatmaps were constructed to explore any association across categorical predictors. For numerical variables, those with highest correlation were chosen
You excluded following explanatory variables. Why?
A) room type
B) agent ID
C) hotel type
D) reservation status
E) company
F) days_waiting_list and arrival_date_day_of_month
A) room type: denoted by letters -> unable to interpret
B) agent ID: denoted by numbers –> unable to interpret
C) hotel type: to enhance generalisability beyond the two hotel types
D) reservation status: for predictive purposes on future bookings, you don’t have this observation prior to expected guest arrival
E) company: 94% were NULL observations
F) days_waiting_list and arrival_date_day_of_month: negligible correlation with cancellation
AUC is a threshold independent metric
TRUE/FALSE
TRUE
AUC provides a general evaluation of a model’s performance across its sensitivity and 1-specificity
Confusion-matrix-derived metrics are threshold-dependent
TRUE/FALSE
TRUE
Proportion of correct predictions in all records is measure by____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value
Proportion of correct predictions in all records is measure by ACCURACY
Ability to correctly predict the positive class is measure by____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value
Ability to correctly predict the positive class is measure by SENSITIVITY (=RECALL)
Sensitivity = TP/(TP+FN)
Ability to correctly predict the negative class is measure by____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value
Ability to correctly predict the negative class is measure by SPECIFICITY
Specificity = TN/(TN+FP)
Proportion of correct predictions in those the classifier predicted as positives is measure by_____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value
Proportion of correct predictions in those the classifier predicted as positives is measure by PRECISION
Precision = TP/(TP+FP)
Proportion of correct predictions in those the classifier predicted as negatives is measure by_____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value
Proportion of correct predictions in those the classifier predicted as negatives is measure by NEGATIVE PREDICTIVE VALUE
NPV= TN/(FN+TN)
In the context of the core business objective being to improve operational performance, prioritising a slight overestimation of cancellations (with some false positives) is preferable to risking revenue loss due to underestimating cancellations (false negatives). Therefore, the primary focus in the model comparison centres on _____ vs. _____
In the context of the core business objective being to improve operational performance, prioritising a slight overestimation of cancellations (with some false positives) is preferable to risking revenue loss due to underestimating cancellations (false negatives). Therefore, the primary focus in the model comparison centres on SENSITIVITY vs. PRECISION
½Hyperparameters play a crucial role in the domain of machine learning, setting themselves apart from model parameters as they operate _______ to the learning process.
Fill in the bland
Hyperparameters play a crucial role in the domain of machine learning, setting themselves apart from model parameters as they operate EXTERNALLY to the learning process.
What role does the cost complexity parameter play in decision tree modeling, particularly in relation to overfitting?
The cost complexity parameter in decision tree modeling balances the trade-off between the complexity of the tree and its predictive performance, preventing overfitting
What is the purpose of the tune-length hyperparameter in the systematic tuning process for pruning decision trees?
The tune-length defines the number of complexity parameters to be evaluated, and in this study, a tune-length of 15 is chosen to strike a balance between performance, risk of overfitting, and computational demand.
Why did you set the threshold at below 0.5?
A) asymmetric costs associated to overprediction vs. underprediction
B) the study defines a preference for slight overestimation of cancellations, leading to lower threshold choice
C) to ensure that the model overestimates cancellations rather than underestimating it
D) all of the above
D) all of the above
You set different thresholds for each of the model versions. Why?
A) to ensure an unbiased evaluation of predictive performance across models that allows for comparability
B) each model possesses unique characteristics that impact its performance at various threshold levels
C) it ensures a more impartial performance assessment of each model rather than setting an “unfair” threshold for some of the models
D) all of the above
D) all of the above
How did you set the individual thresholds?
An iterative process. Every model starts with a threshold of 0.25. Then, it is gradually adjusted upward using two decimal points to a number that maximises the accuracy of the model BUT under the condition that it must predict more false positives than false negatives
A Sensitivity of 0.59 in Model 3 indicates
that the model correctly predicts 59.% of the actual positives, whereas a Precision of 0.57 indicates that 57% of the positive predictions are actually positive.
TRUE/FALSE
TRUE
What does a no information rate of 0.7151 mean?
The NIR tells us the accuracy (predictive performance) of a naive benchmark.
The naive consistently predicts the outcome based on the largest proportion of the output being positive or negative.
I.e., if NIR is 0.71, simply means that 29% of the dataset has a is cancelled outcome (1). By predicting “not cancelled” every time, the naive model will then be right in 71% of the time.
Which of the DT models was optimal? Why?
DT 4 was rendered the most optimal DT model.
The accuracy of 3 and 4 are the only two that exceeds NIR. DT 3 had higher precision, but DT 4 had higher sensitivit –> DT 4 is best based on business goal (preference for overprediction at the expense of higher false positives)
Which of the LR models was optimal? Why?
LR 3 was rendered the most optimal RL model.
Both sensitivity and precision increase from LR 1-3.
Despite Precision being highest for model 4, sensitivity is highest for LR 3 –> LR 3 best based on business goal (preference for overprediction at the expense of higher false positives)
Best DT model (4) outperforms the best LR model (3) across all evaluated metrics.
This leads to the conclusion that this DT emerges as the better choice based
on predictive performance - both sensitivity and precision is higher for DT 4 than LR 3
TRUE/FALSE
TRUE
Which advantages does LR have over DT wrt. practical adoption?
1) Lower computational demand and faster update
2) Easier to interpret with coefficients indicating the association to predictive features - easier to communicate
LR provides a percentage likelihood for a particular booking being cancelled (probability) - do DTs give the same output?
No - decision trees tells you whether the given booking is most likely to be cancelled or not, not the distinct probability. It does tell you however that based on the training set, x% of the total data was captured in a given leaf node, and 0.Y of those observations turned out to be cancelled/not cancelled
What is the business objective of the study?
Based on booking cancellation prediction, hotels can leverage the insight to increase operational performance through more informed revenue management initiatives (overbooking) and minimise cost base (staffing)