Ch 5.4 Data Warehouses and Data Marts Flashcards
If a manaageer wants to know the profit margin on used books, how would they find this?
use SQL or query by example
this allows managers to look up things in the database
What is QBE
making database queries that allow user to search the doc based on an example (string of text )
Why do we need data warehouses and data marts if be have QBE or SQL
this is because if we want to look at trends, we need a complicated system
what do data warehouses and data marts have that sql doesnt
1) info is organized better than sql is
2) these databases are read only!!! no need for updating or processing
3) SQL is used to access one record at a time; Data warehouss and mart is designed to access large groups of related records
Data warehouse
holds historiccal data organized by subject
-> very expensive
data mart
cheaper alternative to data warehouses,
low cost and scaled down
-> implemented in less than 90 days
how are data warehouses and marts organized?
1) organized by business dimension or subject (customer, vendor, product, price etc)
2) use online analytical processing (analyzes data from end users)
3)integrated: data collected from multiple systems and integrated around subjects
4) time variant: have HISTORICAL data
5) non volatile: users cant change or update data
6) multidimensional: store data in more than 2 dimensions “data cube”
what are components of the data warehouse environment:
1) source systems that provide data
2) data integration tech
3) diff architectures for storing data
4) diff tools and applications for users
5) metadata (Data about the data)
Source systems
of data warehouses and data marts
“organizational pain point” motivates businesses to develop BI, this BI needs data to work
PROVIDE DATA TO THE WAREHOUSE OR MART
EX: operational data, erp, website data, 3rd party data,
relational database vs multidinensional data warehouses and data marts
Relational database is like ERD tables: separate tables w PKs and FKs
Data cube: its a matrix of 3 dimensiosns
Data integration- data warehouse environment
USES ETL
tech that processes and prepares data for usage
steps of ETL
EXTRACTION: handwritted code or SQL queries
TRANSFORMATION: data integrated around a common key (student id, and all data around the student classes courses major parking)
- sometimes transformation is changing 0 and 1 to M and F
LOADED: loading data into the mart/warehouse (loading window)
STORING DATA- environment of data warehouses/marts
1) most common: data warehouse
-> one single read-only version of truth
2) independant data mart: store data for a single/few applications (ex: marketing or finance) FUNCTIONAL AREA CENTRIC
-> not very effective, inconsistent data
3) Hub and spoke: a central data warehouse that has the data plus many dependant data martsMe
METADATA- environment of data warehouses/marts
data about the data, things like data quality and the data wrehouse status
INCLUDES UPDATE SCHEDULES
DATA QUALITY- environment of data warehouses/marts
QUALITY of data in the warehouse must be high!
DATA cleansing needs to be used to ensure data meets users needs
BI governance
establishing people committees processes to mainatin the data warehouse and provide for changes in data type or processings
users
business value for users rises when data can be accessed quickly and easily for analysis/consolidation
data lake
storage of all org data regardless of its source or format
-recieve data in any format (unstrcuture/structure)
- recieive inconsistent data/ overlapping data
benefit of data lake
1) Organizations can derive value from unlimited types of data.
2) Organizations do not need to have all of the answers in advance.
3) Organizations have no limits on how they can query the data.
4) Organizations do not create silos. Instead, data lakes provide a single, unified view of data across the organization
how is data loaded into a data lake
1) Define the incoming data from a business perspective.
2) Document the context, origin, and frequency of the incoming data.
3) Classify the security level (public, internal, sensitive, restricted) of the incoming data.
4) Document the creation, usage, privacy, regulatory, and encryption business rules that apply to the incoming data.
5) Identify the owner (sponsor) of the incoming data.
6) Identify the data steward(s) charged with monitoring the health of the specific datasets
Schema on read
A data lake works on a principle called schema-on-read. This means that there is no predefined schema into which data needs to be fitted before storage. Only when the data is read during processing is it parsed and adapted into a schema as needed.
most big orgs have an EDW… WHAT IS THIS?
Enterprise data warehouse