L:2 Data Handling Flashcards
What does CRISP-DM stand for?
cross industry standard process for data mining
Which of the following statements are TRUE about CRISP-DM?
A) a framework that can help structure the approach for data analytics projects
B) its utility lies in its helpfulness in turning vague business questions into explicit analytical tasks
C) The model incl. a data mapping stage
D) The model incl. a deployment stage
E) The model incl. a data preparation stage
A) a framework that can help structure the approach for data analytics projects
B) its utility lies in its helpfulness in turning vague business questions into explicit analytical tasks
D) The model incl. a deployment stage
E) The model incl. a data preparation stage
What are the 6 stages of CRISP-DM
1) Business understanding
2) Data understanding
3) Data preparation
4) Modeling
5) Evaluation
6) Deployment
Business understanding (stage 1) is about turning vague vocal business objectives into quantitative and explicit data analytics task
TRUE/FALSE
TRUE
Data understanding (stage 2) concerns the identification and inspection of data to understanding data limitations, missingness, need for data transformation etc.
TRUE/FALSE
TRUE
Data preparation (stage 3) entails formatting, cleaning, transforming, and combining data to enable the intended analysis.
TRUE/FALSE
TRUE
Modelling (stage 4) entails applying analytical techniques to analyse the data and thus to identify the appropriate technique for the given business problem as well as tuning considerations
TRUE/FALSE
TRUE
_______ involves adjusting hyperparameters to optimize performance. The process can also incl. feature selection, which involves adjusting the variables or features used in the model to improve its predictive capabilities
a) model tuning
b) iterative modelling
c) tuning parameters
d) modelling parameters
a) model tuning
A typical process of model tuning would involve iteratively running several models using default parameters; fine-tuning them; and run them again.
One single run of a single model or parameterisation will not sufficiently address the use case
TRUE/FALSE
TRUE
Assessing model performance (stage 5) involves assessing how well the model performs in terms of e.g., predictive capabilities
In this assessment, the generalisability of the model is directly evaluated based on the model’s performance when used on new data
TRUE/ FALSE
TRUE
After model evaluation, which step is crucial before going to the deployment stage?
It is crucial to return back to the business goals documented in step 1 (Business Understanding) and reflect whether the results are applicable in a comprehensive way
Deployment (stage 6) entails putting the model into practice to produce value and make considerations such as how can the model be integrated into existing business operations.
TRUE/FALSE
TRUE
Data that is missing completely at random (MCAR) should not be immediately removed from the data
TRUE/FALSE
FALSE
If data is MCAR, usually one should proceed by just deleting these observations with missing values
If data is NOT MCAR, which of the following statements are FALSE?
A) these observations should be deleted
B) the observations where data is not MCAR should not be deleted
C) An alternative to deletion is to change all N.A. observations to the mean value of the observations in the column
FALSE: A) these observations should be deleted
Observations should be deleted when the data is MCAR. When not MCAR, you should keep them or change the NA to a zero or mean value to avoid introducing bias