Data Science using Python and R - 1 Flashcards

Question 1

Q

What is the growth rate of job openings in data science from 2012 to 2017?

Answer

A

6.5 times as many job openings in 2017 as compared to 2012

This indicates a significant increase in demand for data science professionals.

Question 2

Q

What was IBM’s projection for the yearly demand for data scientists by 2020?

Answer

A

Nearly 700,000 openings

This projection reflects the increasing need for data professionals in various industries.

Question 3

Q

What is the primary reason for the high demand for data scientists in America?

Answer

A

There is a shortage of talent

This shortage has made data science one of the top jobs in the U.S.

Question 4

Q

Define data science.

Answer

A

The systematic analysis of data within a scientific framework

This involves an adaptive, iterative, and phased approach to data analysis.

Question 5

Q

What are the three main components that data science combines?

Answer

A

Data-driven approach of statistical data analysis
Computational power and programming acumen of computer science
Domain-specific business intelligence

These components work together to uncover actionable insights from data.

Question 6

Q

What is the Data Science Methodology (DSM)?

Answer

A

A framework that helps the analyst keep track of the analysis phases

DSM is adaptive and iterative, allowing for revisiting previous phases as needed.

Question 7

Q

List the seven phases of the Data Science Methodology.

Answer

A

Problem Understanding Phase
Data Preparation Phase
Exploratory Data Analysis Phase
Setup Phase
Modeling Phase
Evaluation Phase
Deployment Phase

Each phase plays a crucial role in the data science process.

Question 8

Q

What is the focus of the Problem Understanding Phase?

Answer

A

Clearly enunciate project objectives and formulate a solvable problem

This phase aims to align teams on the problem to be addressed.

Question 9

Q

What is the most labor-intensive phase of the data science process?

Answer

A

Data Preparation Phase

This phase involves cleaning and preparing raw data for analysis.

Question 10

Q

What tasks are performed during the Exploratory Data Analysis Phase?

Answer

A

Exploring univariate relationships
Exploring multivariate relationships
Binning based on predictive value
Deriving new variables

Simple exploratory methods are used to gain preliminary insights.

Question 11

Q

What is the purpose of the Setup Phase?

Answer

A

To prepare for modeling by performing necessary tasks like cross-validation and baseline performance

This ensures data is ready for effective modeling.

Question 12

Q

What does the Modeling Phase involve?

Answer

A

Selecting and implementing modeling algorithms
Ensuring models outperform baseline models
Fine-tuning model algorithms

This phase is crucial for uncovering profitable relationships in the data.

Question 13

Q

What is evaluated during the Evaluation Phase?

Answer

A

Model performance against baseline measures
Whether models solve the original problem
Application of error costs intrinsic to the data

This phase determines the effectiveness of the models.

Question 14

Q

What is the final phase of the Data Science Methodology?

Answer

A

Deployment Phase

This phase involves reporting results and adapting models for real-world use.

Question 15

Q

What are the most common data science tasks?

Answer

A

Description
Estimation
Classification
Clustering
Prediction
Association

Each task serves a specific purpose in data analysis.

Question 16

Q

Define the Description task in data science.

Answer

A

Describing patterns and trends within the data

This task is often used by both specialists and nonspecialists.

Question 17

Q

What is Estimation in data science?

Answer

A

Approximating the value of a numeric target variable using predictor variables

Estimation models learn from known target values to predict unknowns.

Question 18

Q

What distinguishes Classification from Estimation?

Answer

A

Classification deals with categorical target variables, while estimation deals with numeric target variables

This makes classification a crucial task for many applications.

Question 19

Q

What is the goal of the Clustering task?

Answer

A

Identifying groups of records that are similar

Clusters can provide insights and serve as inputs for further analysis.

Question 20

Q

What does the Prediction task involve?

Answer

A

Forecasting future outcomes based on current data

Prediction can relate to both numeric and categorical variables.

Question 21

Q

What is the Association task in data science?

Answer

A

Determining which attributes are associated with each other

This helps in understanding relationships between different variables.

Question 22

Q

What is data science?

Answer

A

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

Question 23

Q

Which areas of study does data science combine?

Answer

A

Data science combines statistics, computer science, and domain knowledge.

Question 24

Q

What is the goal of data science?

Answer

A

The goal of data science is to extract meaningful insights and knowledge from data.

Question 25

Q

Name the seven phases of the DSM.

Answer

A

Problem Understanding Phase
Data Preparation Phase
Data Exploration Phase
Model Building Phase
Evaluation Phase
Deployment Phase
Monitoring Phase

Question 26

Q

Why is it a good idea to have a Problem Understanding Phase?

Answer

A

The Problem Understanding Phase helps to clarify the objectives and scope of the data science project.

Question 27

Q

Why do we need a Data Preparation Phase? Name three issues that are handled in this phase.

Answer

A

The Data Preparation Phase is needed to clean and format the data for analysis. Issues handled include:
* Handling missing values
* Removing duplicates
* Data normalization

Question 28

Q

In which phase does the data analyst begin to explore the data to learn some simple information?

Answer

A

Data Exploration Phase

Question 29

Q

Explain in your own words why we need to establish baseline performance for our models. Which phase does this occur in?

Answer

A

Establishing baseline performance helps to measure the effectiveness of the model. This occurs in the Model Building Phase.

Question 30

Q

Which phase represents the heart of your data scientific investigation? Why might we apply more than one algorithm to solve a problem?

Answer

A

Model Building Phase; more than one algorithm may be applied to find the best solution or to improve accuracy.

Question 31

Q

How do we determine whether our predictions are any good? During which phase does this occur?

Answer

A

We determine the quality of predictions during the Evaluation Phase.

Question 32

Q

True or false: The data scientist’s work is done with the Evaluation Phase. Explain.

Answer

A

False; the data scientist’s work continues into the Deployment and Monitoring Phases.

Question 33

Q

Explain how the DSM is adaptive.

Answer

A

The DSM is adaptive as it can adjust to new information and changes in project requirements.

Question 34

Q

Describe how the DSM is iterative.

Answer

A

The DSM is iterative because it allows for revisiting and refining previous phases based on findings.

Question 35

Q

List the most common data science tasks.

Answer

A

Estimation
Prediction
Classification
Clustering
Association

Question 36

Q

Which of these tasks have many nonspecialists been doing all along?

Answer

A

Estimation and Prediction

Question 37

Q

What is estimation? In estimation, what must be true of the target variable?

Answer

A

Estimation is predicting a numeric variable; the target variable must be continuous.

Question 38

Q

What is the most widespread task in data science? For this task, what must be true of the target variable?

Answer

A

Prediction; the target variable must be categorical.

Question 39

Q

What are cluster profiles?

Answer

A

Cluster profiles are descriptions of the characteristics of each cluster formed during clustering.

Question 40

Q

True or false: Prediction can only be used for categorical target variables. Explain.

Answer

A

False; prediction can also be used for continuous target variables.

Question 41

Q

For an association rule, what do we mean by support?

Answer

A

Support refers to the proportion of records the rule applies to.

Brainscape's Knowledge GenomeTM

Data Science using Python and R - 1 Flashcards

Brainscape's Knowledge Genome^TM