1 - Introduction Flashcards

1
Q

What is the growth rate of job openings in data science from 2012 to 2017?

A

6.5 times as many job openings in 2017 as compared to 2012

This indicates a significant increase in demand for data science professionals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What was IBM’s projection for the yearly demand for data scientists by 2020?

A

Nearly 700,000 openings

This projection reflects the increasing need for data professionals in various industries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary reason for the high demand for data scientists in America?

A

There is a shortage of talent

This shortage has made data science one of the top jobs in the U.S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define data science.

A

The systematic analysis of data within a scientific framework

This involves an adaptive, iterative, and phased approach to data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three main components that data science combines?

A
  • Data-driven approach of statistical data analysis
  • Computational power and programming acumen of computer science
  • Domain-specific business intelligence

These components work together to uncover actionable insights from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Data Science Methodology (DSM)?

A

A framework that helps the analyst keep track of the analysis phases

DSM is adaptive and iterative, allowing for revisiting previous phases as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

List the seven phases of the Data Science Methodology.

A
  • Problem Understanding Phase
  • Data Preparation Phase
  • Exploratory Data Analysis Phase
  • Setup Phase
  • Modeling Phase
  • Evaluation Phase
  • Deployment Phase

Each phase plays a crucial role in the data science process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the focus of the Problem Understanding Phase?

A

Clearly enunciate project objectives and formulate a solvable problem

This phase aims to align teams on the problem to be addressed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the most labor-intensive phase of the data science process?

A

Data Preparation Phase

This phase involves cleaning and preparing raw data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What tasks are performed during the Exploratory Data Analysis Phase?

A
  • Exploring univariate relationships
  • Exploring multivariate relationships
  • Binning based on predictive value
  • Deriving new variables

Simple exploratory methods are used to gain preliminary insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of the Setup Phase?

A

To prepare for modeling by performing necessary tasks like cross-validation and baseline performance

This ensures data is ready for effective modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the Modeling Phase involve?

A
  • Selecting and implementing modeling algorithms
  • Ensuring models outperform baseline models
  • Fine-tuning model algorithms

This phase is crucial for uncovering profitable relationships in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is evaluated during the Evaluation Phase?

A
  • Model performance against baseline measures
  • Whether models solve the original problem
  • Application of error costs intrinsic to the data

This phase determines the effectiveness of the models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the final phase of the Data Science Methodology?

A

Deployment Phase

This phase involves reporting results and adapting models for real-world use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the most common data science tasks?

A
  • Description
  • Estimation
  • Classification
  • Clustering
  • Prediction
  • Association

Each task serves a specific purpose in data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define the Description task in data science.

A

Describing patterns and trends within the data

This task is often used by both specialists and nonspecialists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Estimation in data science?

A

Approximating the value of a numeric target variable using predictor variables

Estimation models learn from known target values to predict unknowns.

18
Q

What distinguishes Classification from Estimation?

A

Classification deals with categorical target variables, while estimation deals with numeric target variables

This makes classification a crucial task for many applications.

19
Q

What is the goal of the Clustering task?

A

Identifying groups of records that are similar

Clusters can provide insights and serve as inputs for further analysis.

20
Q

What does the Prediction task involve?

A

Forecasting future outcomes based on current data

Prediction can relate to both numeric and categorical variables.

21
Q

What is the Association task in data science?

A

Determining which attributes are associated with each other

This helps in understanding relationships between different variables.

22
Q

What is data science?

A

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

23
Q

Which areas of study does data science combine?

A

Data science combines statistics, computer science, and domain knowledge.

24
Q

What is the goal of data science?

A

The goal of data science is to extract meaningful insights and knowledge from data.

25
Name the seven phases of the DSM.
* Problem Understanding Phase * Data Preparation Phase * Data Exploration Phase * Model Building Phase * Evaluation Phase * Deployment Phase * Monitoring Phase
26
Why is it a good idea to have a Problem Understanding Phase?
The Problem Understanding Phase helps to clarify the objectives and scope of the data science project.
27
Why do we need a Data Preparation Phase? Name three issues that are handled in this phase.
The Data Preparation Phase is needed to clean and format the data for analysis. Issues handled include: * Handling missing values * Removing duplicates * Data normalization
28
In which phase does the data analyst begin to explore the data to learn some simple information?
Data Exploration Phase
29
Explain in your own words why we need to establish baseline performance for our models. Which phase does this occur in?
Establishing baseline performance helps to measure the effectiveness of the model. This occurs in the Model Building Phase.
30
Which phase represents the heart of your data scientific investigation? Why might we apply more than one algorithm to solve a problem?
Model Building Phase; more than one algorithm may be applied to find the best solution or to improve accuracy.
31
How do we determine whether our predictions are any good? During which phase does this occur?
We determine the quality of predictions during the Evaluation Phase.
32
True or false: The data scientist’s work is done with the Evaluation Phase. Explain.
False; the data scientist's work continues into the Deployment and Monitoring Phases.
33
Explain how the DSM is adaptive.
The DSM is adaptive as it can adjust to new information and changes in project requirements.
34
Describe how the DSM is iterative.
The DSM is iterative because it allows for revisiting and refining previous phases based on findings.
35
List the most common data science tasks.
* Estimation * Prediction * Classification * Clustering * Association
36
Which of these tasks have many nonspecialists been doing all along?
Estimation and Prediction
37
What is estimation? In estimation, what must be true of the target variable?
Estimation is predicting a numeric variable; the target variable must be continuous.
38
What is the most widespread task in data science? For this task, what must be true of the target variable?
Prediction; the target variable must be categorical.
39
What are cluster profiles?
Cluster profiles are descriptions of the characteristics of each cluster formed during clustering.
40
True or false: Prediction can only be used for categorical target variables. Explain.
False; prediction can also be used for continuous target variables.
41
For an association rule, what do we mean by support?
Support refers to the proportion of records the rule applies to.