Data Science using Python and R - 1 Flashcards

1
Q

What is the growth rate of job openings in data science from 2012 to 2017?

A

6.5 times as many job openings in 2017 as compared to 2012

This indicates a significant increase in demand for data science professionals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What was IBM’s projection for the yearly demand for data scientists by 2020?

A

Nearly 700,000 openings

This projection reflects the increasing need for data professionals in various industries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary reason for the high demand for data scientists in America?

A

There is a shortage of talent

This shortage has made data science one of the top jobs in the U.S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define data science.

A

The systematic analysis of data within a scientific framework

This involves an adaptive, iterative, and phased approach to data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three main components that data science combines?

A
  • Data-driven approach of statistical data analysis
  • Computational power and programming acumen of computer science
  • Domain-specific business intelligence

These components work together to uncover actionable insights from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the Data Science Methodology (DSM)?

A

A framework that helps the analyst keep track of the analysis phases

DSM is adaptive and iterative, allowing for revisiting previous phases as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

List the seven phases of the Data Science Methodology.

A
  • Problem Understanding Phase
  • Data Preparation Phase
  • Exploratory Data Analysis Phase
  • Setup Phase
  • Modeling Phase
  • Evaluation Phase
  • Deployment Phase

Each phase plays a crucial role in the data science process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the focus of the Problem Understanding Phase?

A

Clearly enunciate project objectives and formulate a solvable problem

This phase aims to align teams on the problem to be addressed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the most labor-intensive phase of the data science process?

A

Data Preparation Phase

This phase involves cleaning and preparing raw data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What tasks are performed during the Exploratory Data Analysis Phase?

A
  • Exploring univariate relationships
  • Exploring multivariate relationships
  • Binning based on predictive value
  • Deriving new variables

Simple exploratory methods are used to gain preliminary insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of the Setup Phase?

A

To prepare for modeling by performing necessary tasks like cross-validation and baseline performance

This ensures data is ready for effective modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the Modeling Phase involve?

A
  • Selecting and implementing modeling algorithms
  • Ensuring models outperform baseline models
  • Fine-tuning model algorithms

This phase is crucial for uncovering profitable relationships in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is evaluated during the Evaluation Phase?

A
  • Model performance against baseline measures
  • Whether models solve the original problem
  • Application of error costs intrinsic to the data

This phase determines the effectiveness of the models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the final phase of the Data Science Methodology?

A

Deployment Phase

This phase involves reporting results and adapting models for real-world use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the most common data science tasks?

A
  • Description
  • Estimation
  • Classification
  • Clustering
  • Prediction
  • Association

Each task serves a specific purpose in data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define the Description task in data science.

A

Describing patterns and trends within the data

This task is often used by both specialists and nonspecialists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Estimation in data science?

A

Approximating the value of a numeric target variable using predictor variables

Estimation models learn from known target values to predict unknowns.

18
Q

What distinguishes Classification from Estimation?

A

Classification deals with categorical target variables, while estimation deals with numeric target variables

This makes classification a crucial task for many applications.

19
Q

What is the goal of the Clustering task?

A

Identifying groups of records that are similar

Clusters can provide insights and serve as inputs for further analysis.

20
Q

What does the Prediction task involve?

A

Forecasting future outcomes based on current data

Prediction can relate to both numeric and categorical variables.

21
Q

What is the Association task in data science?

A

Determining which attributes are associated with each other

This helps in understanding relationships between different variables.

22
Q

What is data science?

A

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

23
Q

Which areas of study does data science combine?

A

Data science combines statistics, computer science, and domain knowledge.

24
Q

What is the goal of data science?

A

The goal of data science is to extract meaningful insights and knowledge from data.

25
Q

Name the seven phases of the DSM.

A
  • Problem Understanding Phase
  • Data Preparation Phase
  • Data Exploration Phase
  • Model Building Phase
  • Evaluation Phase
  • Deployment Phase
  • Monitoring Phase
26
Q

Why is it a good idea to have a Problem Understanding Phase?

A

The Problem Understanding Phase helps to clarify the objectives and scope of the data science project.

27
Q

Why do we need a Data Preparation Phase? Name three issues that are handled in this phase.

A

The Data Preparation Phase is needed to clean and format the data for analysis. Issues handled include:
* Handling missing values
* Removing duplicates
* Data normalization

28
Q

In which phase does the data analyst begin to explore the data to learn some simple information?

A

Data Exploration Phase

29
Q

Explain in your own words why we need to establish baseline performance for our models. Which phase does this occur in?

A

Establishing baseline performance helps to measure the effectiveness of the model. This occurs in the Model Building Phase.

30
Q

Which phase represents the heart of your data scientific investigation? Why might we apply more than one algorithm to solve a problem?

A

Model Building Phase; more than one algorithm may be applied to find the best solution or to improve accuracy.

31
Q

How do we determine whether our predictions are any good? During which phase does this occur?

A

We determine the quality of predictions during the Evaluation Phase.

32
Q

True or false: The data scientist’s work is done with the Evaluation Phase. Explain.

A

False; the data scientist’s work continues into the Deployment and Monitoring Phases.

33
Q

Explain how the DSM is adaptive.

A

The DSM is adaptive as it can adjust to new information and changes in project requirements.

34
Q

Describe how the DSM is iterative.

A

The DSM is iterative because it allows for revisiting and refining previous phases based on findings.

35
Q

List the most common data science tasks.

A
  • Estimation
  • Prediction
  • Classification
  • Clustering
  • Association
36
Q

Which of these tasks have many nonspecialists been doing all along?

A

Estimation and Prediction

37
Q

What is estimation? In estimation, what must be true of the target variable?

A

Estimation is predicting a numeric variable; the target variable must be continuous.

38
Q

What is the most widespread task in data science? For this task, what must be true of the target variable?

A

Prediction; the target variable must be categorical.

39
Q

What are cluster profiles?

A

Cluster profiles are descriptions of the characteristics of each cluster formed during clustering.

40
Q

True or false: Prediction can only be used for categorical target variables. Explain.

A

False; prediction can also be used for continuous target variables.

41
Q

For an association rule, what do we mean by support?

A

Support refers to the proportion of records the rule applies to.