Introduction to Data Literacy Flashcards
Can help us learn how data can be used to connect the dots and create value?
Data Literacy
The ability to read, work with, analyze, and communicate insights with data.
Data Literacy
Three main components of data literacy?
Reading data
Working with and analyzing data
Communicating insights with data
What does reading data consist of?
Identifying data sources
Collect data
Manage data
Allow you to store organize and share your data
Databases
Main tools for communication?
Visualizations and Storytelling
In the DIKW pyramid, this consists of raw observations or measurements?
Data
In the DIKW pyramid, this refers to unorganized, unprocessed, and does not have meaning (yet)
Data
In the DIKW pyramid, this refers to raw data placed into context.
Information
In the DIKW pyramid, this is typically done by organizing or aggregating data.
Information
In the DIKW pyramid, this refers to combining information and making connections to learn and gain meaning.
Knowledge
In the DIKW pyramid, this is typically done by detecting patterns, making generalizations or predictions.
Knowledge
In the DIKW pyramid, this is applied knowledge, or knowledge in action, as it allows to act proactively.
Wisdom
In the DIKW pyramid, this is typically done by combining knowledge logically to determine the course of action.
Wisdom
Characteristics of insights?
Allow to get closer to wisdom
Valuable, realistically achieved
Apply knowledge and take action
Approached, but not quite reached
The process of using data to make an informed decision about a specific problem and acting upon it.
Data-driven decision making
5 main steps that underpin every data-driven process:
Problem statement
Data Collection
Data Analysis
Communication
Action and reflection
Problem statement answers the question:
What is the problem that you want to solve?
Step in data-driven decision making that guides the data-driven process?
Problem statement
Typical problem categories:
Describing the state of an organization or process
Diagnosing causes of events
Detecting anomalies or predicting events
Guiding questions on how to define a problem:
What is the current situation?
What do we need to know?
Where do we want to be?
A good problem statement is:
Clearly defined
Actionable
Realistic
Data comes in different forms
Images and text
Network and spatial data
Different sources of data?
Open Data and Internal data
Open data includes:
Public databases and records
The importance of data type has an effect on:
How to collect the data
How to store the data
How to analyze the data
Data in tabular form
Structured Data
Easy to search and organize
Structured Data
Requires less preprocessing
Structured Data
Stored in relation databases
Structured Data
Data without pre-defined structure
Unstructured data
Difficult to search and organize
Unstructured data
Requires more preprocessing
Unstructured Data
Stored in document databases
Unstructured Data
Examples of structured data
Spreadsheets
Data tables
Examples of unstructured data
Images
Videos
Sound
Text
Describes something with numbers
Quantitative
Can be measured or counted
Quantitative
Wider range of statistics and analysis methods
Quantitative
Describes something with categories
Qualitative
Can be observed
Qualitative
More restricted range of statistics and analysis methods
Qualitative
allows the user to store, retrieve, and access the data
Database management system (DBMS)
Different type of databases
Relational vs. document databases
Data warehouse vs. data lake
Document databases stores what type of data?
Unstructured data
Relational databases stores what type of data?
Structured Data
Contains processed, organized data in preparation for future analysis
Data warehouse
Used to store raw data that has not been prepared yet.
Data Lake
Designing and optimizing database systems is typically the responsibility of a _______
Data engineer
Data is stored o remote servers and accessed over the internet
Data storage in cloud
Data storage in the cloud has services provided by a specialized third party
true
Cost-effective, but still rely on third party for security dependent
True
The purpose of ___________ move data from one database to another.
Pipelines
Pipelines can be automated collection and storage via the _____________
ETL Process
ETL process stands for?
Extract, transform, and Load
Making use of pipelines ensures what?
The availability of up-to-date and accurate data
Accessing and Retrieving data from databases?
Querying
Industry standard for querying?
SQL
SQL stands for?
Structured Querying Language
Another way to leverage the data available in databases?
Dashboards
Alternative non-technical way to collecting, managing and sharing data between teams.
Dashboards
Provides information at a glance?
Dashboards
Receives data from a linked database
Dashboards
Data is presented in a very visual way
Dashboards
A multipurpose tool used for exploratory analysis of the data and communicating
Dashboards
Dirty data is categorized as what?
Incorrect
Incomplete
Inconsistent
Caused by human error, technical issues, or issues with the data collection process
Dirty data
Consists of data that is incorrect or inconsistent
Data Errors
Data errors are typically cause by _____________ error in recording the value or the format
Human or Technical error
Techniques to counter data errors:
Original value or valid format is known: correct data
If unknown: drop data
When data is incomplete, what do we call it?
Missing data
Missing data will be problematic if:
Many data points are missing
There are underlying patterns in the missing data
What techniques should we do to counter missing data?
Dropping data
Imputation
Societal bias can be reflected in data
Data Bias
Leads to unrepresentative data and results
Data Bias
Techniques to counter to avoid data bias:
Sound data collection process
Awareness in conclusions
Explainable AI models
Set of techniques to counter data problems
Data Cleaning
Important preparation step for any data analysis
Data Cleaning
Not all data problems are completely solvable
True
Four main types of analytics:
Descriptive Analytics
Diagnostic Analytics
Predictive Analytics
Prescriptive Analytics
What is being asked in Descriptive analytics?
Why is it happening?
What is being asked in Diagnostic Analytics?
Why is it happening?
What is being asked in Predictive Analytics?
What will happen?
What is being asked in Prescriptive Analytics?
What should we do?
What type of analytics responsible for finding the root causes of events?
Diagnostic Analytics
What type of analytics summarizes and visualizes the data?
Descriptive Analytics
What type of analytics identifies the possible outcomes and the probability that they will happen?
Predictive Analytics
What type of analytics determines the best course of action given the outcome we want to achieve?
Prescriptive Analytics
Common techniques for Descriptive analytics
Descriptive statistics
Visualizations
Outlier Detection
EDA
Why should we use descriptive analytics?
Get to know the data
Investigate relationships in the data
Preparation for more advance techniques
Focus on exploring the data:
Assessing main characteristics
Finding relationships, patterns or groups
Suggesting hypotheses for future analysis
Exploratory Data Analysis
Groundwork for further analysis but also valuable on its own
EDA
Why use diagnostic analytics?
Find potential causes of events or reasons for behaviors
Investigate casual relationships
Suggest solutions based on the identified causes
Common techniques of Diagnostic Analytics:
Drill-down analytics
Correlation and regression analysis
Hypothesis testing
Root cause analysis
Formal set of steps to look beyond superficial causes that have a direct effect
Root cause Analysis
Steps of Root cause analysis
Define the event
Collect relevant data
Determine Contributing factors
Find root causes
Recommend possible solutions
Why use Predictive analytics?
Anticipate most likely outcomes
Forecast a process or sequence
Estimate an unknown based on the information that is available
Two types of machine learning models:
Classification-Based
Regression-Based
Common techniques used in Predictive Analytics:
Machine Learning Models
Time Series forecasting
Predictive text analysis
Predicting housing prices based on neighborhood characteristics
Regression-based
Predicting cancellation of subscriptions
Classification-based
Predicting sales revenue over time
Time series Forecasting
Predicting whether an email is spam or not
Predictive text analysis
Steps in Predictive Modeling
Define the outcome
Collect and Prepare data
Build Predictive model
Interpret and evaluate the model
Implement / Fine-tune
In the predictive modeling phase, data is split into __________________ to build the predictive model
Training and Test Set
Predictions are interpreted and evaluated on the test data, using pre-determined metrics like (accuracy) percentage of correct predictions
True
Primary purpose of prescriptive analytics
To help decide what best to do
Why use prescriptive analytics?
Make informed, data-driven decision
Optimize processes
Mitigate Risks
Common techniques used in Prescriptive Analytics
Rule-based systems
Reinforcement Learning
Scenario and simulation analysis
Consist of generating a set of rules or decision logic to get the best outcome
Rule-based systems
An algorithm learns to achieve a particular objective or optimize an outcome by receiving positive and negative feedback when running though a set of actions.
Reinforcement Learning
Running through a set of pre-determined scenarios or simulating multiple outcomes to help select the decision that leads to the best outcome
Scenario and simulation analysis
Predicts interest based on past behavior and Provides recommendations based on predicted interests
Recommendation engine