Phases of Data Life Cycle Flashcards
Business Understanding
Planning & Discovery, Scope project
Identify stakeholders and research questions/KPIs
Identify timeline, budget, and participants
Lack of clear focus on stakeholders, timeline, limitations and budget could potentially derail an analysis
Data Acquisition
Extration, data gathering, data query , data collection, ETL (extract, transform, load) Gather/collect data from a variety of sources
Provide structure to data accessible via relational databases (SQL)
Build data pipeline (ETL)
Use of API to download data from an external source Quality and type of data may make access more difficult
Data Cleaning
Wrangling
Scrubbing
Munging
Fixing improperly formatted values
Dealing with duplicates, missing data, and outliers
Data reduction
Some cleaning techniques could dramatically change data/outcomes
Outliers not dealt with can cause problems with statistical models due to excessive variability.
Data Exploration
Exploratory Data Analysis (EDA)
Descriptive Statistics
Central Tendency/ Measures of center (e.g., mean, median, mode), variability (e.g., standard deviations and quartiles) and distributions (e.g., normal, skewed, etc)
Identify basic correlations between variables
Pattern discovery Skipping this step could enable faulty perceptions of the data which hurt advanced analytics.
Predictive Modeling
Data Modeling
Correlation based models
Regression models
Time series
Estimate/project future values or likelihood of an event.
Extend correlations found in EDA to mathematical models
Predict/determine output values based on input values
Cross-validation of predictive models to ensure accuracy.
Too many input variables (predictors) can cause problems
Correlation does not imply causation.
Time series models often need sufficient time data to offer precise trending.
Predictive model accuracy should be assessed using cross-validation.
Data Mining
Machine Learning
Deep Learning
AI (artificial intelligence)
Supervised/ Unsupervised Models
Creating training and testing datasets to build models from
Identify/detect patterns
Determine if groups (clusters) exist in data
Classify data into groups
Create models that “learn” and improve (e.g., machine/deep learning, AI, etc)
Running on entire data is problematic; need to subset data into training and testing datasets to build models.
Reporting and Visualization
Dashboards
Tell a story with data
Provide a summary of analytic analysis
Provide insights to stakeholders
Create insightful graphs that showcase trends and forecasts
Due to potential large audience consumption, mistakes can cause bad business decisions and loss of revenue
Improper scales used in graphs could push for interpretations of the story that is inaccurate
Iron Triangle
Quality, Time, Cost - High Quality, Rapid Delivery, High Cost 2. Low Cost, Rapid Delivery, Low Quality 3. High Quality, Low Cost, Slow Delivery - find the sweet spot
Data Privacy
Responsibly collecting, using and storing data about people, in line with the expectations of those people, your customers, regulations and laws.
Data Ethics
Doing the right thing with data, considering the human impact from all sides, and making decisions based on your brand values.
IRAC
Issue: State the legal issue(s) to be discussed.
Rule: State the relevant statutes and case law.
Application: Apply the relevant rules to the facts that created the issue.
Conclusion: State the most likely conclusions using the logic of the application section and whether there has been a violation.
Ethics Assessments starting in 2019 should be asking…
What is fair? What is the right thing to do?
Regulators will have to decide if the focus for regulation should be…
Rights-based, risk and harms-based, or accountability-based or a combo of all 3
Policy makers must consider:
What harms are they trying to protect people from?
What rights do they want to guarantee?
What problems are they trying to solve?
What are the privacy outcomes they hope to achieve for their citizens?
8 Ethical Guidelines for Data Analytics
- The ethical statistician uses methodology and data that are relevant and appropriate; without favoritism or prejudice; and in a manner intended to produce valid, interpretable, and reproducible results.
- The ethical statistician is candid about any known or suspected limitations, defects, or biases in the data that may affect the integrity or reliability of the statistical analysis.
- The ethical statistician supports valid inferences, transparency, and good science in general, keeping the interests of the public, funder, client, or customer in mind
- protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project.
- Even in adversarial settings, discourse tends to be most successful when statisticians treat one another with mutual respect and focus on scientific principles, methodology, and the substance of data interpretations.
- understands the differences between questionable statistical, scientific, or professional practices and practices that constitute misconduct.
- Those employing any person to analyze data are implicitly relying on the profession’s reputation for objectivity. However, this creates an obligation on the part of the employer to understand and respect statisticians’ obligation of objectivity.
- Science and statistical practice are often conducted in teams made up of professionals with different professional standards. The statistician must know how to work ethically in this environment.