CRISP-DM Flashcards
What are the stages of CRISP-DM
Project Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
What happens in the Project Understanding Phase?
Problem formulation
Mapping problem to a data mining task
Understanding resources
Define Problem Formulation
Defining:
-objectives/deliverables
-criteria for success
-potential benefits/risks
-constrains/assumptions
for a data analysis project.
Problem Formulation stage - time used vs importance
Time: 20%
Importance:80%
What does ‘mapping problem to data mining task’ mean?
Part of the Project Understanding Phase.
Choosing data analysis method to use: classification, regression, deviation analysis, association analysis.
What can be done to reduce communication problems in Project Understanding phase?
For ex:
Cognitive mapping
Rephrasing
What are cognitive maps?
A visual representation of a problem that supports domain understanding.
Nodes are domain variables, arrows show direction and type of influence between nodes.
What happens in the Data Understanding Phase?
Collecting the data
Using exploratory data analysis to get familiar with data and gain insights.
Evaluating data quality
Selecting interesting subsets that may contain patterns.
What kind of goals can a data mining project have?
Type of data mining task (regression, classification, clustering)
Predictive accuracy, model flexibility, interpretability, run-time
What types of methods are available for solving different types of problems?
Classification, regression, clustering, association analysis, neural networks.
How can different requirements or desirable properties affect our choice of method?
If it needs to be highly interpretable, neural networks wouldn’t be a good fit. Some methods take more computational resources.
How can a project go wrong from the very start?
If the data does not support the project goals.
What is the main goal of Data Understanding phase?
Rule of thumb for Data Understanding?
Never trust any data before some plausibility tests!
What types of attributes are there?
Can categorial attributes be represented as numerical?
What are some typical problems with data quality?