Project Life Cycle Flashcards
CRISP-DM methodology (general)
- Business Understanding
- Data Understanding
- Data preparation
- Modelling
- Evaluation
- Deployment
What is Business Understanding?
a. Determine business objectives:
Investigate the background of the business objectives
Formulate business success criteria
b. Assess situation
Inventory of resources, requirements, assumptions and constraints
Risks and contingencies
Terminology (domain knowledge)
Costs and benefits
c. Determine Data Science goals
Data science project goals/hypotheses
Data science success criteria (when is a project done?)
d. Produce a project plan
Write a project plan
Make an initial assessment of tools, techniques and resources
What is data understanding?
- Collect initial data
- Get familiar with data landscape; what data sources and files can you get access to or do you need?
- Identify problems/challenges
- Make and communicate initial data collection report
- Describe data
- Write a data description report: including format, quantity, identities of the fields and other surface features
- Explore data
- Address Data Mining questions using querying, data visualisation and reporting technique
- Do simple statistical analyses as distribution, simple aggregation, relations sub-populations
- Write a data exploration report; first findings on hypothesis/hypotheses?
- Verify data quality
- Examine the quality of the data (ex. Missing values, outliers, errors, incorrect data (formats)
- Communicate data quality report and impact
What is Data Preparation?
Select data
Describe data set
Logic for inclusion/exclusion
Clean data
- Raise the data quality (ex. Estimation of missing values and removing outliers)
- Write down the decisions and assumptions you made in a data cleaning report
Construct (new) required data
- ex. Derived attributes, generated records
Integrate and format data
- Merge data
- Aggregate data
- Feature engineering
What is Modelling?
Which methods/tools must I use to get my results?
Select Modelling technique
- Make and document choices for modelling techniques and assumptions
Three types of modelling techniques:
- Regression: given an input, predict a numeric value (Supervised learning)
- Linear Regression
- Classification: given an input, assign a class (Supervised learning)
- Linear Classifiers (logistic regression and Support Vector Machine — SVM)
- Decision trees
- Random forest
- Nearest Neighbour
- K-Nearest Neighbour
- Clustering: given data, organise groups (Unsupervised learning)
- Neural Networks
b. Generate test design
- Describe the intended plan for training, testing and evaluating models
- Use of Cross-validation
c. Build model
- List parameter setting along with a logic
- Program/build the model and document results and difficulties encountered
d. Asses Model
- Summarise model results according to the evaluation plan (eg accuracy and precision)
- Revise parameter settings (and document)
What is Evaluation?
What insight can I extract from my results?
Accuracy (TP+TN)/(TP+TN+FP+FN)
Precision (TP)/(TP+FP)
- Evaluate results
- Approve Models
- Get the approval of the modelling results (come back to project definition of done)
- Review of the process
- Did we correctly build the model?
- - Did we have access to all the necessary data?
- Highlight activities that have been missed
- Determine next steps
- List of possible actions with pros and cons
- Write and communicate an action plan
What is Deployment?
Plan deployment - Summarise deployment strategy Initiate monitoring and maintenance - How can we assure the data science result will be used in day to day business and its environment? Produce final report - Produce final report - Plan presentation? Review project - Document experience - Opportunities new?follow-up projects?