Data Science using Python and R - 1 Flashcards
What is the growth rate of job openings in data science from 2012 to 2017?
6.5 times as many job openings in 2017 as compared to 2012
This indicates a significant increase in demand for data science professionals.
What was IBM’s projection for the yearly demand for data scientists by 2020?
Nearly 700,000 openings
This projection reflects the increasing need for data professionals in various industries.
What is the primary reason for the high demand for data scientists in America?
There is a shortage of talent
This shortage has made data science one of the top jobs in the U.S.
Define data science.
The systematic analysis of data within a scientific framework
This involves an adaptive, iterative, and phased approach to data analysis.
What are the three main components that data science combines?
- Data-driven approach of statistical data analysis
- Computational power and programming acumen of computer science
- Domain-specific business intelligence
These components work together to uncover actionable insights from data.
What is the Data Science Methodology (DSM)?
A framework that helps the analyst keep track of the analysis phases
DSM is adaptive and iterative, allowing for revisiting previous phases as needed.
List the seven phases of the Data Science Methodology.
- Problem Understanding Phase
- Data Preparation Phase
- Exploratory Data Analysis Phase
- Setup Phase
- Modeling Phase
- Evaluation Phase
- Deployment Phase
Each phase plays a crucial role in the data science process.
What is the focus of the Problem Understanding Phase?
Clearly enunciate project objectives and formulate a solvable problem
This phase aims to align teams on the problem to be addressed.
What is the most labor-intensive phase of the data science process?
Data Preparation Phase
This phase involves cleaning and preparing raw data for analysis.
What tasks are performed during the Exploratory Data Analysis Phase?
- Exploring univariate relationships
- Exploring multivariate relationships
- Binning based on predictive value
- Deriving new variables
Simple exploratory methods are used to gain preliminary insights.
What is the purpose of the Setup Phase?
To prepare for modeling by performing necessary tasks like cross-validation and baseline performance
This ensures data is ready for effective modeling.
What does the Modeling Phase involve?
- Selecting and implementing modeling algorithms
- Ensuring models outperform baseline models
- Fine-tuning model algorithms
This phase is crucial for uncovering profitable relationships in the data.
What is evaluated during the Evaluation Phase?
- Model performance against baseline measures
- Whether models solve the original problem
- Application of error costs intrinsic to the data
This phase determines the effectiveness of the models.
What is the final phase of the Data Science Methodology?
Deployment Phase
This phase involves reporting results and adapting models for real-world use.
What are the most common data science tasks?
- Description
- Estimation
- Classification
- Clustering
- Prediction
- Association
Each task serves a specific purpose in data analysis.
Define the Description task in data science.
Describing patterns and trends within the data
This task is often used by both specialists and nonspecialists.
What is Estimation in data science?
Approximating the value of a numeric target variable using predictor variables
Estimation models learn from known target values to predict unknowns.
What distinguishes Classification from Estimation?
Classification deals with categorical target variables, while estimation deals with numeric target variables
This makes classification a crucial task for many applications.
What is the goal of the Clustering task?
Identifying groups of records that are similar
Clusters can provide insights and serve as inputs for further analysis.
What does the Prediction task involve?
Forecasting future outcomes based on current data
Prediction can relate to both numeric and categorical variables.
What is the Association task in data science?
Determining which attributes are associated with each other
This helps in understanding relationships between different variables.
What is data science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
Which areas of study does data science combine?
Data science combines statistics, computer science, and domain knowledge.
What is the goal of data science?
The goal of data science is to extract meaningful insights and knowledge from data.
Name the seven phases of the DSM.
- Problem Understanding Phase
- Data Preparation Phase
- Data Exploration Phase
- Model Building Phase
- Evaluation Phase
- Deployment Phase
- Monitoring Phase
Why is it a good idea to have a Problem Understanding Phase?
The Problem Understanding Phase helps to clarify the objectives and scope of the data science project.
Why do we need a Data Preparation Phase? Name three issues that are handled in this phase.
The Data Preparation Phase is needed to clean and format the data for analysis. Issues handled include:
* Handling missing values
* Removing duplicates
* Data normalization
In which phase does the data analyst begin to explore the data to learn some simple information?
Data Exploration Phase
Explain in your own words why we need to establish baseline performance for our models. Which phase does this occur in?
Establishing baseline performance helps to measure the effectiveness of the model. This occurs in the Model Building Phase.
Which phase represents the heart of your data scientific investigation? Why might we apply more than one algorithm to solve a problem?
Model Building Phase; more than one algorithm may be applied to find the best solution or to improve accuracy.
How do we determine whether our predictions are any good? During which phase does this occur?
We determine the quality of predictions during the Evaluation Phase.
True or false: The data scientist’s work is done with the Evaluation Phase. Explain.
False; the data scientist’s work continues into the Deployment and Monitoring Phases.
Explain how the DSM is adaptive.
The DSM is adaptive as it can adjust to new information and changes in project requirements.
Describe how the DSM is iterative.
The DSM is iterative because it allows for revisiting and refining previous phases based on findings.
List the most common data science tasks.
- Estimation
- Prediction
- Classification
- Clustering
- Association
Which of these tasks have many nonspecialists been doing all along?
Estimation and Prediction
What is estimation? In estimation, what must be true of the target variable?
Estimation is predicting a numeric variable; the target variable must be continuous.
What is the most widespread task in data science? For this task, what must be true of the target variable?
Prediction; the target variable must be categorical.
What are cluster profiles?
Cluster profiles are descriptions of the characteristics of each cluster formed during clustering.
True or false: Prediction can only be used for categorical target variables. Explain.
False; prediction can also be used for continuous target variables.
For an association rule, what do we mean by support?
Support refers to the proportion of records the rule applies to.