Definitions Flashcards
Data Science is the art of turning data into actions.
Combines: Domain Expertise, Statistics, and Computing Skills
Flows back and forth between deductive and inductive reasoning
Relatively new discipline in which methodologies and frameworks are still being solidified
Inter-related concepts
of Data Science
Analytics, Business Analytics, Data Science, Business Intelligence, Data Analytics, Big Data, Statistical Learning.
Deductive Reasoning
Theory Driven, Hypothesis —-> To Analytics.
Inductive Reasoning
Empirically Drive, Analytics —–> Hypothesis
Big Data:
Data in which the volume, variety, or velocity of information prohibits analysis via conventional desktop or server scale tools.
Distributed Processing (or computing):
A solution to the big data problem. Platforms which allow the power of individual machines to be simultaneously utilized to solve big data problems (e.g. Hadoop)
Machine Learning:
Most closely associated with Inductive reasoning. Algorithms that allow computers to learn from data without explicit instructions from the operator.
Supervised Learning:
Machine learning in which the outcome is defined by the operator. Can think of predicting outcomes.
Unsupervised Learning:
Machine learning in which the outcome is not defined. Can think of classifying observations or dimensions.
Regression:
A class of problems in which the objective is to predict the value of an outcome.
Classification:
A class of problems in which the objective is to predict which group or “class” of an observation is likely to belong to.
Parametric Techniques:
Techniques in which there are specific assumptions about the nature and/or shape of relationships between variables. E.g. in linear regression the slope of a line is being fit.
Non-parametric Techniques:
Techniques in which there are not specific assumptions about the nature and/or shape of the relationships between variables. E.g. decision trees.
Un-Structured Data:
Data that has no easily identified structure (e.g. free-form text responses)
Types of Analytics
Descriptive Analytics: What is or has been?
Predictive Analytics: What is likely to happen?
Prescriptive Analytics: What should you do?
Good Analytics
Creates Action: What will be different?
Understands context: What are the physics of the problem?
Avoids Bias: In the model and in the setup
Focuses on Impact: What value is generated?
Data Science is a response and solution to the data deluge
Tools and process to deal with “Big Data”
Creates advantage to companies that use it effectively
Data Science can handle a breadth of problems
Different domains Different outcomes Different purposes (Descriptive, Predictive, Prescriptive)
CRISP-DM Definition
Cross-Industry Standard Process for Data Mining
CRISP-DM
Components
Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
Crisp-DM
Business Understanding
Determine Business Objectives:
Business background
Objectives and Success Criteria
Assess the situation:
Resource Inventory (e.g. budget, people)
Requirements, Assumptions, Constraints, Risks, Contingencies
Cost/Benefit
Determine Data Mining Goals and Success Criteria
Produce a Project Plan
Crisp-DM
Data Understanding
Collect Initial Data
Describe the Data
Explore the Data
Verify Data Quality
Crisp-DM
Data Preparation
Select Data
Cleaning Data:
Missing, Invalids
Construct New Data:
Transformations, Structure Data
Integrate Data
Format Data
Crisp-DM
Modeling
Select Modeling Technique
Generate a Test Design
Build the Model
Assess the Model
Revise the Model
Crisp-DM
Evaluate the Results
Evaluate the Results Relative to Objectives
Review
Determine Next Steps
Crisp-DM
Plan for Deployment
Deployment Plan
Monitoring and Maintenance Plan
Final Report
Final Review
Common Pitfalls in Data Science Projects
Assume model build and evaluation are a linear process:
In reality, they are very iterative
Agile methodologies valuable here
Requires tight integration between data scientist and domain knowledge
Do not allocate enough time for data gathering, clean-up, and understanding
Often the longest poll in the tent
Often iterative as analysis leads to more questions requiring more data
Build solutions that are not compatible with infrastructure and implementation
Complexity of model overwhelms the ability to implement
Speed of execution not compatible with use case
Do not match monitoring and maintenance to the velocity of the problem
World is not static, just because a model works today does not mean it will work tomorrow
Need to update the model in a way that is consistent with the business problem.
Poorly defined business problem
Contextual differences between build and application
Analytics do not answer the core question
Human bias
Confirmation
Uncertainty
The CRISP-DM methodology is an attempt to define the common analytical process that occurs regardless of industry context
The CRISP-DM methodology lays out specific steps involved in analytics project
Even with the CRISP-DM methodology, there are a series of common pitfalls to watch out for