ETL Flashcards
What is Data?
is anything we can quantify whether it be figures, numbers, texts, pictures, values, dates, etc. (Researching & absorbing)
What is Information?
Is any meaningful insight we can gather from the data. Any conclusions we can make by reviewing the data (doing)
What is Knowledge?
Gathering of Information to understand what is happening in the Data (interacting)
What is Wisdom?
Using that knowledge to make future decisions (reflecting)
What is Structured Data?
Eazy to analyze and turn into information
What is Unstructured Data?
Difficult or tedious to extract the data into information
What is Semi-Structured Data?
Some structure on how the data is presented but may have tough parts to analyze or difficult to extract
What are the Data Quality Traits?
Consistent, Relevant, Complete, Accurate, Granular
Describe Consistent Data
Data doesn’t contradict or conflict with itself
Describe Relevant Data
Data is up-to-date ( Is this data that we still need )
Describe Complete Data
No missing records or fields (when using that data in comparison to other data. The data should have same fields)
Describe Accurate Data
Represents reality ( The collected data should be realistic and not skewed)
Describe Granular Data
How specific is our data ( We need to determine how specific our data is. What kind of conclusion can we draw with how specific our data is )
What is Enterprise Data?
Data (Data that is collected across the aspect of our business)
What are the different types of Enterprise Data?
Transactional Data, Analytical Data, Master Data, Reference Data, Reporting Data, Meta Data
What is transactional Data?
data that is used in on-going operations within the company.
What is analytical Data?
numerical values or metrics that provide business intelligence and help make business decisions.
What is Master Data?
Data that is persistent and define the primary business entities within the company. This could include
What is Reference Data?
Subset of master data. Any information that is usually standardized by some other party.
What is Reporting Data?
Aggregated data compiled for the purpose of analytics
What is Meta Data?
Data about the data.
What is OLTP (Online Transaction Processing) ?
A data system that is based off transactions. Is optimized to run as many concurrent transactions as possible. Focuses on very fast write and update capabilities. Highly normalized and relational. For Client Facing Applications
What is OLAP (Online Analysis Processing) ?
A data system for business facing applications. The databases are optimized for reads/querying. Is a denormalized database. Easier to store
What is ETL?
EXTRACT TRANSFORM LOAD - A procedure of copying data from one or more sources into a new location called a data target.
What is Extract? And the different types
taking data from a data source’
Full Extraction - Taking all data from a data source
Incremental Extraction - taking only new data from a data source
What is transformation? And the different types
Converting/modifying extracted data to match a desired data target.
Cleaning(Making data Consistent), DeDuplication, Format Revision, Key Restructing
Advanced: Validation, filtering, derivation
What is load? And the different types
Adding the data into the data target, such as an ODS, Staging area, Data Warehouse, or Data Mart.
- Initial Load - populating all the data warehouse tables
- Incremental Load - Adding in new incremenetal data periodically as needed.
- Full Refresh - Removing all data from one or more tables and reloading all fresh data. (kind of truncating and reloading)
What is a data source?
- Anywhere that we are gathering our data from ( could be data gathered from structured/unstructured/semistructured)
- Client facing applications
- External Data
What is a data warehouse?
Single central hub for the entire company’s dataset (All of the Enterprise Data). Everything extracted from all data sources.
What is a staging area?
is a temporary holding location for data that has just been extracted (or potentially through the ETL process)
What is an Operational Data Store (ODS)
interim area used before a Data Warehouse
Provides a snapshop of the latest data from multiple data sources
What is a Data Mart?
A custom designed database for business intelligence for a specific department / aspect of your company that you wish to draw conclusions from Dimensional Database
What is a Dimensional Cube?
A visual representation of the relationship between fact tables and dimensions
What is Data Mining?
Process of collecting the data, and modeling the data to find relationships / patterns, and making business decisions based on these models
What is Data Purging?
Process of deleting data from a data warehouse. The frequency of a data purge depend entirely on the business requirements. Data is often kept
for 5-10 years before it is purged
What is ETL Testing?
The process of validating, verifying, and qualifying data while preventing duplicate records and data loss
What is the ETL Testing Lifecycle?
Identify business requirements, Validate our data sources, Design Test cases, ETL, Summary Report, Test Closure
What are some different types of ETL tests?
Source to Target Testing, Data Transformation Testing, Data Integration Testing
How would you do a source to target Testing with sql?
Can select count(*) from one table and ensure that count equals our target and source
How would you do data transformation test with sql?
Changing null values to 0 - SELECT ISNULL(myColumn, 0 ) FROM myTable
Transforming with replace - Update , Set (Replace), Where
What is Data Migration Testing?
Process of verifying that data migrates from source to target with no loss of information/data
What is data validation?
Method for checking the accuracy and quality of your data
How would you do some data validation testing with sql?
isnull function, update, replace
Some ETL Testing Scenarios :
Validation, Null validation, Duplicate Validation, Data Cleaning
What is business Reporting?
Process of gathering data by utilizing different software and tools to extract relavent information.
Can use google data studios to create this visual representation
How would you check for duplicates?
SELECT name, COUNT(email)
FROM users
GROUP BY email
HAVING COUNT(email) > 1
You can select by what you to output with the count of what duplicates you’re looking for from the table you want and group by whatever duplicate you’re looking for with having count greater than 1