Study Flashcards
Also known as the discovery phase
Business understanding
Analyst defines the major questions of interest that need to be answered
Business understanding
The phase of collecting data
Data acquisition
Alternative names include data cleansing, data wrangling, data munging, and feature engineering
Data cleaning
When ignored the results from analysis may be irrelevant
No one common tool, may use SQL, Python, R, or Excel
Data quality is measured in terms of uniqueness and relevance
Data cleaning
Analyst begins to understand the basic nature of data and the relationships within
Often relies on visualization tools and numerical summaries such at central tendency and variability
Central tendency is a single value that attempts to describe a set of data by identifying the central position
Variability describes how far apart data points lie from each other and from the center of a distribution
Data exploration
Creating models that enable predictions of outcomes of interest
Tools such as Python and R play an important role in automating the training and use of models
Predictive modeling
Sometimes machine learning is used as a synonym
Data mining
Ability of computers to look for patterns in large amounts of data
Tools such as Python and R play an important role
Data mining
An analyst tells the story of the data and uses graphs or interactive dashboards to inform others of the findings from the analyses
Reporting and visualization
The goal is to provide actionable insights for various stakeholders
Reporting and visualization
Scope Project
Identify stakeholders and research questions/KPIs
Identify timeline, budget, and participants
Business Understanding
Gather/collect data from a variety of sources
Provide structure to data accessible via relational databases (SQL)
Build data pipeline (ETL)
Use of API to download data from an external source
Data acquisition
Estimate/project future values or likelihood of an event.
Extend correlations found in EDA to mathematical models
Predict/determine output values based on input values
Cross-validation of predictive models to ensure accuracy.
Predictive Modeling
Creating training and testing datasets to build models from
Identify/detect patterns
Determine if groups (clusters) exist in data
Classify data into groups
Create models that “learn” and improve (e.g., machine/deep learning, AI, etc.)
Data mining
Tell a story with data
Provide a summary of analytic analysis
Provide insights to stakeholders
Create insightful graphs that showcase trends and forecasts
Reporting and visualization
What happened?
Descriptive Analytics
Why did it happen?
Diagnostic Analytics
What will happen?
Predictive Analytics
How can we make it happen?
Predictive Analytics
Is a relationship between two variables: when one variable changes, you know the degree in which the other variable changes
Correlation
Is when there is a real-world explanation for why this is logically happening; it implies a cause and effect
Causation
Which phase of the data analytics life cycle is also known as the discovery phase?
Business understanding
Which phase of data analytics life cycle allows an analyst to use graphs or interactive dashboards to tell the story of the data?
Reporting and visualization
Which phase of data analytics life cycle does the analyst begin to understand the nature of the data?
Data exploration
Which phase of the data analytics life cycle provides structure to data accessible via relational databases?
Data acquisition
Which term is defined as a relationship between two variables?
Correlation
A way to graph numerical data in groups or bins that allow bars to represent frequencies
Histogram
Provides a concise summary of the quartiles of numerical data (i.e., cut points that divide the data into 25% percentile segments)
Boxplot
Colorful graph that can visually show frequency or interaction using a range of colors
Heatmap
Two-dimensional graph
Great to visualize correlation or relationships
Scatterplot
Predict an outcome based on a set of predictor variables
Regression
Technique in which the analyst wants to assign an item to a specific category based on various conditions
Classification
Groupings are unknown and the analyst wishes to determine if the objects belong to any groups
Clustering
Looks for trends in data over time
Focused on breaking apart different reasons for the variation (decomposition)
Time series
Technique attempts to group variables into meaning groups
Principal Component Analysis
Tool - Data Science (Deep Learning/AI), Web Development, Embedded System
Python
Tool - Data analysis and statistical modeling
R
Tool - Can easily perform matrix computation as well as optimization
Python
Tool - Consists of many to use packages
R
Key Characteristic: Often numbers or labels, stored in a structured framework of columns and rows relating to pre-set parameters
Typical File Types: Databases
Structured Data File Type
Key Characteristic: Loosely organized into categories using meta tags
Typical File Types: JSON, XML, Email, Web pages
Semi-structured Data File Type
Key Characteristic: Text-heavy information that’s not organized in a clearly defined framework or model
Typical File Types: Audio, Video, Image data, Natural Language, Documents
Unstructured Data File Type