Definitions Flashcards

1
Q

Data Science is the art of turning data into actions.

A

Combines: Domain Expertise, Statistics, and Computing Skills
Flows back and forth between deductive and inductive reasoning
Relatively new discipline in which methodologies and frameworks are still being solidified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Inter-related concepts

of Data Science

A

Analytics, Business Analytics, Data Science, Business Intelligence, Data Analytics, Big Data, Statistical Learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Deductive Reasoning

A

Theory Driven, Hypothesis —-> To Analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inductive Reasoning

A

Empirically Drive, Analytics —–> Hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Big Data:

A

Data in which the volume, variety, or velocity of information prohibits analysis via conventional desktop or server scale tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Distributed Processing (or computing):

A

A solution to the big data problem. Platforms which allow the power of individual machines to be simultaneously utilized to solve big data problems (e.g. Hadoop)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Machine Learning:

A

Most closely associated with Inductive reasoning. Algorithms that allow computers to learn from data without explicit instructions from the operator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Supervised Learning:

A

Machine learning in which the outcome is defined by the operator. Can think of predicting outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Unsupervised Learning:

A

Machine learning in which the outcome is not defined. Can think of classifying observations or dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Regression:

A

A class of problems in which the objective is to predict the value of an outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Classification:

A

A class of problems in which the objective is to predict which group or “class” of an observation is likely to belong to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Parametric Techniques:

A

Techniques in which there are specific assumptions about the nature and/or shape of relationships between variables. E.g. in linear regression the slope of a line is being fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Non-parametric Techniques:

A

Techniques in which there are not specific assumptions about the nature and/or shape of the relationships between variables. E.g. decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Un-Structured Data:

A

Data that has no easily identified structure (e.g. free-form text responses)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of Analytics

A

Descriptive Analytics: What is or has been?

Predictive Analytics: What is likely to happen?

Prescriptive Analytics: What should you do?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Good Analytics

A

Creates Action: What will be different?

Understands context: What are the physics of the problem?

Avoids Bias: In the model and in the setup

Focuses on Impact: What value is generated?

17
Q

Data Science is a response and solution to the data deluge

A

Tools and process to deal with “Big Data”

Creates advantage to companies that use it effectively

18
Q

Data Science can handle a breadth of problems

A
Different domains
Different outcomes
Different purposes (Descriptive, Predictive, Prescriptive)
19
Q

CRISP-DM Definition

A

Cross-Industry Standard Process for Data Mining

20
Q

CRISP-DM

Components

A
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
21
Q

Crisp-DM

Business Understanding

A

Determine Business Objectives:
Business background
Objectives and Success Criteria

Assess the situation:
Resource Inventory (e.g. budget, people)
Requirements, Assumptions, Constraints, Risks, Contingencies
Cost/Benefit

Determine Data Mining Goals and Success Criteria

Produce a Project Plan

22
Q

Crisp-DM

Data Understanding

A

Collect Initial Data

Describe the Data

Explore the Data

Verify Data Quality

23
Q

Crisp-DM

Data Preparation

A

Select Data

Cleaning Data:
Missing, Invalids

Construct New Data:
Transformations, Structure Data

Integrate Data

Format Data

24
Q

Crisp-DM

Modeling

A

Select Modeling Technique

Generate a Test Design

Build the Model

Assess the Model

Revise the Model

25
Q

Crisp-DM

Evaluate the Results

A

Evaluate the Results Relative to Objectives

Review

Determine Next Steps

26
Q

Crisp-DM

Plan for Deployment

A

Deployment Plan

Monitoring and Maintenance Plan

Final Report

Final Review

27
Q

Common Pitfalls in Data Science Projects

A

Assume model build and evaluation are a linear process:
In reality, they are very iterative
Agile methodologies valuable here
Requires tight integration between data scientist and domain knowledge

Do not allocate enough time for data gathering, clean-up, and understanding
Often the longest poll in the tent
Often iterative as analysis leads to more questions requiring more data
Build solutions that are not compatible with infrastructure and implementation
Complexity of model overwhelms the ability to implement
Speed of execution not compatible with use case

Do not match monitoring and maintenance to the velocity of the problem
World is not static, just because a model works today does not mean it will work tomorrow
Need to update the model in a way that is consistent with the business problem.

Poorly defined business problem
Contextual differences between build and application
Analytics do not answer the core question

Human bias
Confirmation
Uncertainty

The CRISP-DM methodology is an attempt to define the common analytical process that occurs regardless of industry context

The CRISP-DM methodology lays out specific steps involved in analytics project

Even with the CRISP-DM methodology, there are a series of common pitfalls to watch out for