Database Design Flashcards
OLTP
Online Transaction Processing (OLTP)
operational system
- purpose is to support frequent transactions
- Enables the real-time execution of large numbers of database transactions by large numbers of people
- enables multi-user access to the same data while ensuring data integrity
- rapid data processing
OLTP systems are designed for use by frontline workers (e.g., cashiers, bank tellers, part desk clerks) or for customer self-service applications (e.g., online banking, e-commerce, travel reservations).
Google Cloud SQL = optimized for OLTP workloads
OLAP
Online Analytical Processing (OLAP)
operational system optimized for conducting complex data analysis / supporting decision making
Enable complex queries
Typical read-intensive (rather than writing data)
Amazon Redshift = a data warehouse designed for OLAP
Database Transaction
A change, insertion, deletion or query of data in a database
In OLTP, a defining characteristic of a database transaction is its atomicity (indivisibility).
A transaction either succeeds as a whole or fails - it cannot remain in a pending state
How do OLTP and OLAP data work together
OLTP systems often serve as a source of information for OLAP systems
OLTP data is usually stored in an operational database, that is pulled and cleaned to create an OLAP data warehouse.
Without transactional data, data analysis can’t be done.
Analyses from OLAP systems inform business transactions, which can provide a basis for making improvements to the OLTP system
How are OLAP systems and OLTP Systems Different?
OLTP systems use relational databases that can accommodate many concurrent users, frequent queries and updates
OLAP systems typically use data warehouses, which are optimized for complex queries involving multiple data facts from current and historical data
OLTP workloads involve a balance of read and write; OLAP workloads are read-intensive.
OLTP systems require frequent or concurrent backups; OLAP systems can be backed up far less frequently.
Structured Data
Structured data is data whose elements are addressable for effective analysis. It is defined by schemas where datatypes and tables are defined.
Easier to analyze, but less flexible and not very scalable
EX: excel files, POS data
Semi-Structured Data
Semi-structured data is information that has some organizational properties that can make it easier to analyze, but doesn’t follow a larger schema.
Often has an ad-hoc, self describing nature
Examples: JSON, XML
Unstructured Data
Unstructured data is not organized in a predefined manner and doesn’t follow a model. It is not a good fit for most relational databases.
Most data in the world is unstructured.
Examples: photos, chat logs
Traditional Databases
Traditional databases are based on a fixed schema that is static in nature
Used for storing real-time relational, structured data for OLTP Systems
Schema-on-write - requires that data be validated against a schema before being written to the database
Example: MySQL
Data Warehouses
Data warehouses are used for analyzing archived, structured data (OLAP)
Optimized for analytics, reading and aggregating data
Oftentimes read-only
Example: BigQuery
Blob Storage
BLOB = Binary Large Object Storage
Cloud storage for unstructured data
Blob storage keeps masses of data in storage areas called data lakes
Data Lakes
Data lakes are used for storing data of all structures without a set schema. Lets you store data without knowing what you’re going to use it for in the future
Cheaper because it uses object storage
Schema-on-read
You need to catalog the data or it can become a data swamp
You can run big-data analytics on data stored in a data lake using Apache Spark.
Example: Google Cloud Storage
Object Storage
A computer storage architecture designed to handle large amounts of unstructured data
It designates data as distinct units, bundled w/ metadata and a unique ID that can be used to locate and access each unit
schema-on-write
vs
schema-on-read
In traditional databases, the table’s schema is imposed during the load time. Data that doesn’t match the schema is rejected.
Schema-on-read doesn’t require data to match any internal schema when it is loaded. The schema is created only once you are reading the data. More commonly used with data lakes
MPP
Massively Parallel Processing
A storage structure designed to handle multiple operations simultaneously by several processing units. Each processing unit works independently with it’s own operating system and dedicated memory.
This is a data warehouse architecture
Object Storage
Object storage is a data storage architecture for handling large amounts of unstructured data. This data does not conform to, or cannot be easily organized into a traditional relational database with columns and rows
Database Design
Database design is a collection of steps that help create, implement, and maintain a business’s data management systems.
The primary purpose of designing a database is to produce physical and logical models of designs for the proposed database system
Database Models
Database models provide high-level specs for database structure
The most popular model is the relational model, which defines rows as records and columns as attributes
Schemas
Schemas provide a blueprint for a database
They are implementations of the database model.
Schemas define the tables, fields, relationships and views that a database will have.
A schema must be respected when inserting data into a structured database
Conceptual Data Model
The conceptual data model defines WHAT the system contains.
It is the highest-level, and thus the least detailed.
Defines entities, attributes, and the relationships between them
Typically created by the business stakeholders and data architects.
It’s purpose is to organize, scope and define business concepts and rules
Logical Data Model
The logical data model defines HOW entities and relationships can be mapped into tables
The purpose is to develop a detailed structure for the data element and data entity relationships
We explicitly define: primary key, foreign key, parent-child relationships
Physical Data Model
The Physical Data Model defines HOW the system will be implemented. It provides a layout of the actual database with all its components and services.
It’s based on the technology choices that are made
Relational Database Model
Data in a Relational Database is organized into two-dimensional tables with rows and columns.
Primary keys in this model are what make a record unique, and foreign keys refer to another table’s primary key
Dimensional Modeling
Adaptation of the Relational Model, but for Data Warehouse design
Optimized for OLAP queries (aggregate data, not being frequently updates)
Uses the “Star Schema”
Comprised of Dimensions and Facts
Dimensions (Dimensional Modeling)
A Dimension Table contains dimensions of a fact in a dimensional model
Examples:
An Order_Id in a fact table will lead to an orders dimension table with order_id, order_time, order_price, etc.
Facts (Dimensional Modeling)
Fact table is a primary table in a dimensional model. It holds the primary keys referenced by dimension tables.
Examples:
- Orders Table: order_id, customer_id, store_id,
Star Schema
Star schemas are the simplest form of the dimensional model.
They are comprised of fact and dimension tables.
Fact tables hold records of metrics which are described further by dimension tables
A star schema is denormalized
Snowflake Schema
The snowflake schema is an extension of the star schema.
Start schemas are extended over one dimension, whereas snowflake schemas are extended over many dimensions
A snowflake schema is more normalized than a star schema
Normalization
Normalization is a database technique that divides tables into smaller tables and connects them via relationships.
The goal is to reduce redundancy and increase data integrity.
To normalize a table, you identify repeating groups of data and create new tables for them.
Pros and Cons of Normalization
Pros:
- saves space because it prevents us from repeating data
- enforces data integrity because of referential integrity (CA vs. California)
- work well for OLTP because it prioritizes safe, fast insertion of data
Cons:
- Slower queries because there are many joins within the tables
Types of Anomaly Errors In Denormalized Databases
Update Anomaly: Data inconsistency that arises when updating a database with redundancies
Insertion Anomaly - When we are unable to insert a new record because of missing attributes
Deletion Anomaly - When deletion of record(s) causes unintentional loss of data
Database View
A view is a virtual table that is not part of the physical schema
The query (not the resulting data) is stored in memory
You can query a view as you would a normal table, without having to type out the same queries over and over
Views do not take up any storage aside from storing the query statement
View are also useful for access control and for masking the complexity of queries. This is essential for highly normalized schemas
Materialized View
Materialized views store the query results rather than storing the query. They must be refreshed or you risk querying stale data.
Materialized views are very useful in data warehouses where data is updated less frequently, so you don’t have to worry about the view showing stale data. It can also help with queries that have long execution times.
Table Partitioning
Breaks a table into segments that can make it easier to manage and query your data
This can help to improve query performance and control costs by reducing the number of bytes read by a query
Partitioning is part of the physical data model, because we are distributing the same data over several physical entities
Vertical Partitioning
Vertical partitioning splits up a table vertically based on the columns
This can be done even if the table is totally normalized
Horizontal Partitioning
Partitions table based on rows
Sharding
When a horizontal partition is used to spread a table over several machines
For example, partition user data based on geographic location of users.
PRO: query can be easily directed to correct shard, meaning less scanning of the data
CON: can lead to unbalanced load, for example if distribution of users across regions is uneven
Benefits of Partitioning Data
Rarely-accessed partitions can be moved to a slower medium
Optimizes indices, which increases the chance that heavily-used indices will remain accessible in memory
SQL DBMS
Relational Data Base Management System, based on the relational model of data
Queried with SQL
Beneficial when working with structured data that will benefit from a pre-defined schema
example: MySQL
NoSQL DBMS
non-relational database management system
less structured and is document-centered rather than being table-centered
Also offers greater flexibility
Types of NoSQL Storage Models
Key-value stores (Redis)
Document Databases (MongoDB)
Columnar - best suited for analyzing large datasets (Cassandra)
Graph - Used to store data best represented as a graph - InfiniteGraph
DevOps
methodology used as a set of practices and tools to automate the work of software developers
Airflow
open-source platform that creates, schedules, and monitors data workflows (ETL / ELT)
dbt
an open-source framework for running, testing and documenting SQL queries
Looker
dashboards for viewing and visualizing analytics dashboards
S3
scalable object storage
Redshift
Redshift - AmazonRedshiftuses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes,
CI/CD Technologies (Circle CI)
Continuous Integration, Continuous Deployment
Method that involves automation to stages of app development and deployment