Database Architecture Flashcards
What is ETL?
ETL stands for Extract, Transform, and Load. In ETL, data flows from the data source to staging to the data destination. ETL can help with data privacy and compliance, cleansing sensitive data before loading into the data destination.
What is ELT? How does it differ from ETL?
ELT stands for Extract, Load, and Transform. ELT lets the data destination do the transformation, eliminating the need for data staging. ELT is simpler than ETL and is for companies with minor data needs.
What is a data warehouse? Where is the data from?
A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. The data within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications.
Name common types of data warehouse architecture (4 types).
The architecture of a data warehouse is determined by the organization’s specific needs. Common architectures include:
Simple: All data warehouses share a basic design in which metadata, summary data, and raw data are stored within the central repository of the warehouse.
Simple with a staging area: Operational data must be cleaned and processed before being put in the warehouse. Many data warehouses add a staging area for data before it enters the warehouse, to simplify data preparation
Hub and spoke: Adding data marts between the central repository and end users allows an organization to customise its data warehouse to serve various lines of business. When the data is ready for use, it is moved to the appropriate data mart
Sandboxes: Sandboxes are private, secure, safe areas that allow companies to quickly and informally explore new datasets or ways of analyzing data without having to conform to or comply with the formal rules and protocol of the data warehouse
Name main types of data warehouse (3 types).
Enterprise Data Warehouse (EDW): Centralised warehouse providing decision support service across the enterprise. It offers a unified approach for organising and representing data
Operational Data Store (ODS): Only supports daily operations, so their view of historical data is very limited. Works well with current data as it is refreshed in real time
Data Mart: A data mart is a subset of the data warehouse. It specially designed for a particular line of business, such as sales, finance, sales or finance
What are the three levels of database architecture?
Physical/internal level: Lowest level in the three level architecture, describes how data is actually stored in the database, as well as compression and encryption techniques
Conceptual/logical level: Higher level than the physical level. It describes how the database appears to the users conceptually and the relationships between various data tables.
External/view level: Highest level and closest to the user. The external level only shows the relevant database content to the users in the form of views and hides the rest of the data
What is staging?
Staging is an intermediate storage area used for data processing during the extract, transform and load (ETL) process. The data staging area sits between the data source(s) and the data target(s), which are often data warehouses. Data staging areas are often transient in nature, with their contents being erased prior to running an ETL process or immediately following successful completion of an ETL process.
What are the differences between OLTP and OLAP?
OLTP stands for Online Transaction Process System. It is known for maintaining transactional level data of the organization and generally is highly normalised (3NF). Main operations are insert, update and delete.
OLAP stands for Online Analytical Process System. It is known for a lot of analysis and fulfils reporting purposes. It may not be normalised. Main operation is to extract multidimensional data for analysis.
What are the benefits of a data warehouse (name 4)?
Subject-oriented: They can analyze data about a particular subject or functional area (such as sales)
Integrated: Data warehouses create consistency among different data types from disparate sources
Nonvolatile: Once data is in a data warehouse, it’s stable and doesn’t change
Time-variant: Data warehouse analysis looks at change over time
What is a virtual data warehouse?
A virtual data warehouse is a set of separate databases, which can be queried together, so a user can effectively access all the data as if it was stored in one data warehouse.
Explain the different data models that are available in detail (3 types).
Conceptual data model: As the name itself implies, this data model depicts the high-level design of the available physical data.
Logical data model: Within the logical model, the entity names, entity relationships, attributes, primary keys, and foreign keys will show up.
Physical data model: Based on this data model, the view will give out more information and showcases how the model is implemented in the database. All the primary keys, foreign keys, table names, and column names will show up.
What are the differences between normalisation and denormalisation?
Normalisation and denormalisation are both methods used in databases.
Normalisation is the process of creating a set schema to store non-redundant and consistent data in order to reduce data redundancy and inconsistency. It optimises disk space and increase number of tables.
Denormalisation is the process of combining the data so that it can be queried speedily in order to achieve faster execution of queries through introducing redundancy. It use more disk space but decreases number of tables.
What is the difference between star schema and snowflake schema, and what are the advantages and disadvantages of each?
The star schema and snowflake schema are two ways to structure a data warehouse.
The star schema has a centralised data repository, stored in a fact table. The schema splits the fact table into a series of denormalised dimension tables. where there is only one link between fact table and dimension. The star schema’s simpler design makes it much easier to write complex queries. However, it uses a lot of disk space and isn’t the best option for data integrity.
The snowflake schema is different because it normalises the data. The snowflake schema uses less disk space and better preserves data integrity. The main disadvantage is the complexity of queries required to access data—each query must dig deep to get to the relevant data because there are multiple joins.
What are the rules of 1NF?
Each table cell should contain a single value
Each record needs to be unique
What are the rules of 2NF?
Must be in 1NF
There are no partial dependencies i.e. a non-prime attribute is functionally dependent to a candidate key