Databases Flashcards
Database
Where a collection of data is store.
Helps to power applications and day to day operations.
Allows for easier management and analysis of data.
Database Management System (DBMS)
Software used to store, retrieve, define, and manage data in a database.
OLAP
OLTP
SQL vs NoSQL
Data Warehouse
OLAP
Data Lake
Data Lake basically stores everything structured data or unstructured data, images, logs, files, etc, and acts as your single source of truth. Data is in original form with no transformation applied. At most, stored data files are time stamped and organized for easier consumption from various services. Due to expansion in cloud tech, it’s really cost effective to store large volumes of data. This is why you want to store everything in your data lake without throwing any data. You keep on adding new data. Then you ingest this data in your data warehouse and pick and choose what you need for analysis purposes. Due to the fact that all your data is now stored at one place, you no longer have data silo issues, and it also gives you easier access to control data governance.
Data Lake can store pretty much any type of data and from data sources directly. While data warehouse can store source data directly, it is not optimized to all different types of data. At most, it is good at structured and unstructured to some extent, and it performs better when data is cleaned and processed and transformed. Data lake is usually accessed by data engineers as they are the one who will be building data ingestion pipelines and applying security controls and make data available to data warehouse. Data scientists can also access data lake directly, but for very specific use cases. Data warehouse on the other side is mostly used by data warehouse developers, by developers to build semantic layer and by analysts can access relevant data mart layers. Schema for data Lake is on Read. What this means is you create schema when only reading the data giving much needed agility while data warehouse schema is on. Right. What this means is you need to define schema for the data before writing into the database. Benefit of this is it gives you very fast results. Datalake stores all of your data while data warehouse is for specific purpose. Business tracks or enable business intelligence. Data Lake provides a very low cost for storing all of your data, while data warehouse storage costs
Single-Tier Architecture
Data Source > Data Warehouse > BI Tools
Pros:
- Minimizes data stored and redundancies
Cons:
- Not the best design for multiple data streams and complex enterprise requirements
- Slow down performance of BI tools due to querying data directly from your data warehouse
Two-Tier Architecture
Data Source > Data Lake > Staging Layer > Data Warehouse Layer > BI Tools
Pros:
- More scalable
- Prevents using the data warehouse as main area for storing raw data
- Single storage area for all of our source data
- Staging layer allows for data transformation cleaning prior to storing our data inside the data warehouse layer
Three-Tier Architecture
Data Source > Data Lake > Staging Layer > Data Warehouse Layer > Data Mart > BI Tools
Datamarts basically contain domain specific information, and datasets are usually smaller, close to or less than 100 gigabyte. It can make business specific reporting possible by focusing on specific business tracks such as marketing, finance, customers, etcetera. They are also an important component of bi enablement. Because BI tools can connect to the data mart layer instead of your big data warehouse layer.
Bottom vs Middle vs Top Tier
Five Vs
Important to think about with Data Lakes
Five Vs:
- Volume: Amount of raw data
- Velocity: Speed of data processing
- Variety: Different data types
- Veracity: Level of trust in data
- Value: Potential unlocked