1 - Introduction Flashcards
What is the growth rate of job openings in data science from 2012 to 2017?
6.5 times as many job openings in 2017 as compared to 2012
This indicates a significant increase in demand for data science professionals.
What was IBM’s projection for the yearly demand for data scientists by 2020?
Nearly 700,000 openings
This projection reflects the increasing need for data professionals in various industries.
What is the primary reason for the high demand for data scientists in America?
There is a shortage of talent
This shortage has made data science one of the top jobs in the U.S.
Define data science.
The systematic analysis of data within a scientific framework
This involves an adaptive, iterative, and phased approach to data analysis.
What are the three main components that data science combines?
- Data-driven approach of statistical data analysis
- Computational power and programming acumen of computer science
- Domain-specific business intelligence
These components work together to uncover actionable insights from data.
What is the Data Science Methodology (DSM)?
A framework that helps the analyst keep track of the analysis phases
DSM is adaptive and iterative, allowing for revisiting previous phases as needed.
List the seven phases of the Data Science Methodology.
- Problem Understanding Phase
- Data Preparation Phase
- Exploratory Data Analysis Phase
- Setup Phase
- Modeling Phase
- Evaluation Phase
- Deployment Phase
Each phase plays a crucial role in the data science process.
What is the focus of the Problem Understanding Phase?
Clearly enunciate project objectives and formulate a solvable problem
This phase aims to align teams on the problem to be addressed.
What is the most labor-intensive phase of the data science process?
Data Preparation Phase
This phase involves cleaning and preparing raw data for analysis.
What tasks are performed during the Exploratory Data Analysis Phase?
- Exploring univariate relationships
- Exploring multivariate relationships
- Binning based on predictive value
- Deriving new variables
Simple exploratory methods are used to gain preliminary insights.
What is the purpose of the Setup Phase?
To prepare for modeling by performing necessary tasks like cross-validation and baseline performance
This ensures data is ready for effective modeling.
What does the Modeling Phase involve?
- Selecting and implementing modeling algorithms
- Ensuring models outperform baseline models
- Fine-tuning model algorithms
This phase is crucial for uncovering profitable relationships in the data.
What is evaluated during the Evaluation Phase?
- Model performance against baseline measures
- Whether models solve the original problem
- Application of error costs intrinsic to the data
This phase determines the effectiveness of the models.
What is the final phase of the Data Science Methodology?
Deployment Phase
This phase involves reporting results and adapting models for real-world use.
What are the most common data science tasks?
- Description
- Estimation
- Classification
- Clustering
- Prediction
- Association
Each task serves a specific purpose in data analysis.
Define the Description task in data science.
Describing patterns and trends within the data
This task is often used by both specialists and nonspecialists.
What is Estimation in data science?
Approximating the value of a numeric target variable using predictor variables
Estimation models learn from known target values to predict unknowns.
What distinguishes Classification from Estimation?
Classification deals with categorical target variables, while estimation deals with numeric target variables
This makes classification a crucial task for many applications.
What is the goal of the Clustering task?
Identifying groups of records that are similar
Clusters can provide insights and serve as inputs for further analysis.
What does the Prediction task involve?
Forecasting future outcomes based on current data
Prediction can relate to both numeric and categorical variables.
What is the Association task in data science?
Determining which attributes are associated with each other
This helps in understanding relationships between different variables.
What is data science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
Which areas of study does data science combine?
Data science combines statistics, computer science, and domain knowledge.
What is the goal of data science?
The goal of data science is to extract meaningful insights and knowledge from data.