Project Questions Flashcards
What is the purpose of the “set.seed()” function in R?
It ensures that output from randomness will be the same given same inputs.
I.e., omitting the seed function, running the same code will lead to different results every time due to different points of departure
What is the utility of the “caret” package?
What is the utility of the “corrplot” package?
What is the utility of the “tidyverse” package?
What is the utility of the “pROC” package?
What is the utility of the “rpart.plot” package?
What is a factor?
In R, a factor is a categorical variable that represents distinct levels or categories
What is an integer?
An integer is a whole number that does not have any fractional or decimal part.
What is a character?
A character is a data type in programming that represents individual letters, numbers, symbols, or spaces
To incorporate interaction effect with one or both variables being categorical, dummy variables are used for this purpose. Why?
In logistic regression, Dummy variables provide a way to represent categorical variables numerically. Each dummy variable represents one category of the categorical variable and takes on values of 0 or 1, indicating the absence or presence of the category. If absent, the entire term becomes 0. If present, the model multiples the corresponding coefficient with 1 (present).
Which of the following additional use cases does BDA have in hospitality?
A) facilitates service innovation
B) insights into customer satisfaction through e.g., big data text analysis of customer reviews
C) creating client profiles and enhance customer relationship management
D) all of the abovw
D) All of the above
What is the goal of data transformation? It involves modifying the structure or content of a dataset.
The goal is to ensure appropriate fit between the type of data and chosen statistical method
Which data transformation steps did you perform?
1) Transformation from character to factor
2) Transformation from integer to factor
3) Combining #kids and #babies
Why did you transform characters to factors?
Factors allows the model to interpret and utilise categorical variables efficiently by providing a finite number of options that the value can take
Why did you transform integers to factors?
Factors allows the model to distinguish between numerical variables that represent categories and numerical variables that represent continuous quantities
Give me an example of which integer you you transformed into a factor, and explain why it made sense
Is_repeat_guest and is_cancelled: in reality, these variable can take one of two values: yes (1) or no (0). Thus, it made sense to transform them from integers to factors to ensure that the model interpreted them as categorical variables
Give me an example of which character you you transformed into a factor, and explain why it made sense
Meal: there are 4 options of meal types, all each by two capital letters and came as characters. We wanted the model to interpret the variable as categorical to allow it to map any associations between this and the output.
As part of data cleaning, you removed some observations based on their deposit type. Elaborate on this
Removed the observations with deposit type = non-refundable: EDA showed that 99% of these bookings were cancelled - highly counter-intuitive
As part of data cleaning, you removed observations with no adults recorded. Why? Is this necessarily an error?
Not necessarily an error. However, the rooms with no adults were expected to be perfectly associated with another booking. Thus, these rooms are not representative as a singular observation to be considered in the model
Why did you keep the remaining two deposit types rather than deleting the entire variable altogether?
The observations with refundable deposit and no deposit paid behaved intuitively, and no abnormality was detected. E.g., cancellation rate is slightly higher for no deposits, which makes sense