Database Design Flashcards

1
Q

OLTP

A

Online Transaction Processing (OLTP)

operational system
- purpose is to support frequent transactions
- Enables the real-time execution of large numbers of database transactions by large numbers of people
- enables multi-user access to the same data while ensuring data integrity
- rapid data processing

OLTP systems are designed for use by frontline workers (e.g., cashiers, bank tellers, part desk clerks) or for customer self-service applications (e.g., online banking, e-commerce, travel reservations).

Google Cloud SQL = optimized for OLTP workloads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

OLAP

A

Online Analytical Processing (OLAP)

operational system optimized for conducting complex data analysis / supporting decision making

Enable complex queries
Typical read-intensive (rather than writing data)

Amazon Redshift = a data warehouse designed for OLAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Database Transaction

A

A change, insertion, deletion or query of data in a database

In OLTP, a defining characteristic of a database transaction is its atomicity (indivisibility).
A transaction either succeeds as a whole or fails - it cannot remain in a pending state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do OLTP and OLAP data work together

A

OLTP systems often serve as a source of information for OLAP systems

OLTP data is usually stored in an operational database, that is pulled and cleaned to create an OLAP data warehouse.

Without transactional data, data analysis can’t be done.

Analyses from OLAP systems inform business transactions, which can provide a basis for making improvements to the OLTP system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are OLAP systems and OLTP Systems Different?

A

OLTP systems use relational databases that can accommodate many concurrent users, frequent queries and updates

OLAP systems typically use data warehouses, which are optimized for complex queries involving multiple data facts from current and historical data

OLTP workloads involve a balance of read and write; OLAP workloads are read-intensive.

OLTP systems require frequent or concurrent backups; OLAP systems can be backed up far less frequently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Structured Data

A

Structured data is data whose elements are addressable for effective analysis. It is defined by schemas where datatypes and tables are defined.

Easier to analyze, but less flexible and not very scalable

EX: excel files, POS data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Semi-Structured Data

A

Semi-structured data is information that has some organizational properties that can make it easier to analyze, but doesn’t follow a larger schema.

Often has an ad-hoc, self describing nature

Examples: JSON, XML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Unstructured Data

A

Unstructured data is not organized in a predefined manner and doesn’t follow a model. It is not a good fit for most relational databases.

Most data in the world is unstructured.

Examples: photos, chat logs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Traditional Databases

A

Traditional databases are based on a fixed schema that is static in nature

Used for storing real-time relational, structured data for OLTP Systems

Schema-on-write - requires that data be validated against a schema before being written to the database

Example: MySQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Warehouses

A

Data warehouses are used for analyzing archived, structured data (OLAP)

Optimized for analytics, reading and aggregating data

Oftentimes read-only

Example: BigQuery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Blob Storage

A

BLOB = Binary Large Object Storage

Cloud storage for unstructured data
Blob storage keeps masses of data in storage areas called data lakes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Lakes

A

Data lakes are used for storing data of all structures without a set schema. Lets you store data without knowing what you’re going to use it for in the future

Cheaper because it uses object storage

Schema-on-read

You need to catalog the data or it can become a data swamp

You can run big-data analytics on data stored in a data lake using Apache Spark.

Example: Google Cloud Storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Object Storage

A

A computer storage architecture designed to handle large amounts of unstructured data

It designates data as distinct units, bundled w/ metadata and a unique ID that can be used to locate and access each unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

schema-on-write
vs
schema-on-read

A

In traditional databases, the table’s schema is imposed during the load time. Data that doesn’t match the schema is rejected.

Schema-on-read doesn’t require data to match any internal schema when it is loaded. The schema is created only once you are reading the data. More commonly used with data lakes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

MPP

A

Massively Parallel Processing

A storage structure designed to handle multiple operations simultaneously by several processing units. Each processing unit works independently with it’s own operating system and dedicated memory.

This is a data warehouse architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Object Storage

A

Object storage is a data storage architecture for handling large amounts of unstructured data. This data does not conform to, or cannot be easily organized into a traditional relational database with columns and rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Database Design

A

Database design is a collection of steps that help create, implement, and maintain a business’s data management systems.

The primary purpose of designing a database is to produce physical and logical models of designs for the proposed database system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Database Models

A

Database models provide high-level specs for database structure
The most popular model is the relational model, which defines rows as records and columns as attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Schemas

A

Schemas provide a blueprint for a database
They are implementations of the database model.
Schemas define the tables, fields, relationships and views that a database will have.
A schema must be respected when inserting data into a structured database

20
Q

Conceptual Data Model

A

The conceptual data model defines WHAT the system contains.
It is the highest-level, and thus the least detailed.
Defines entities, attributes, and the relationships between them
Typically created by the business stakeholders and data architects.
It’s purpose is to organize, scope and define business concepts and rules

21
Q

Logical Data Model

A

The logical data model defines HOW entities and relationships can be mapped into tables

The purpose is to develop a detailed structure for the data element and data entity relationships

We explicitly define: primary key, foreign key, parent-child relationships

22
Q

Physical Data Model

A

The Physical Data Model defines HOW the system will be implemented. It provides a layout of the actual database with all its components and services.

It’s based on the technology choices that are made

23
Q

Relational Database Model

A

Data in a Relational Database is organized into two-dimensional tables with rows and columns.

Primary keys in this model are what make a record unique, and foreign keys refer to another table’s primary key

24
Q

Dimensional Modeling

A

Adaptation of the Relational Model, but for Data Warehouse design

Optimized for OLAP queries (aggregate data, not being frequently updates)

Uses the “Star Schema”

Comprised of Dimensions and Facts

25
Q

Dimensions (Dimensional Modeling)

A

A Dimension Table contains dimensions of a fact in a dimensional model

Examples:
An Order_Id in a fact table will lead to an orders dimension table with order_id, order_time, order_price, etc.

26
Q

Facts (Dimensional Modeling)

A

Fact table is a primary table in a dimensional model. It holds the primary keys referenced by dimension tables.

Examples:

  1. Orders Table: order_id, customer_id, store_id,
27
Q

Star Schema

A

Star schemas are the simplest form of the dimensional model.
They are comprised of fact and dimension tables.
Fact tables hold records of metrics which are described further by dimension tables

A star schema is denormalized

28
Q

Snowflake Schema

A

The snowflake schema is an extension of the star schema.

Start schemas are extended over one dimension, whereas snowflake schemas are extended over many dimensions

A snowflake schema is more normalized than a star schema

29
Q

Normalization

A

Normalization is a database technique that divides tables into smaller tables and connects them via relationships.

The goal is to reduce redundancy and increase data integrity.

To normalize a table, you identify repeating groups of data and create new tables for them.

30
Q

Pros and Cons of Normalization

A

Pros:
- saves space because it prevents us from repeating data
- enforces data integrity because of referential integrity (CA vs. California)
- work well for OLTP because it prioritizes safe, fast insertion of data

Cons:
- Slower queries because there are many joins within the tables

31
Q

Types of Anomaly Errors In Denormalized Databases

A

Update Anomaly: Data inconsistency that arises when updating a database with redundancies

Insertion Anomaly - When we are unable to insert a new record because of missing attributes

Deletion Anomaly - When deletion of record(s) causes unintentional loss of data

32
Q

Database View

A

A view is a virtual table that is not part of the physical schema

The query (not the resulting data) is stored in memory

You can query a view as you would a normal table, without having to type out the same queries over and over

Views do not take up any storage aside from storing the query statement

View are also useful for access control and for masking the complexity of queries. This is essential for highly normalized schemas

33
Q

Materialized View

A

Materialized views store the query results rather than storing the query. They must be refreshed or you risk querying stale data.

Materialized views are very useful in data warehouses where data is updated less frequently, so you don’t have to worry about the view showing stale data. It can also help with queries that have long execution times.

34
Q

Table Partitioning

A

Breaks a table into segments that can make it easier to manage and query your data

This can help to improve query performance and control costs by reducing the number of bytes read by a query

Partitioning is part of the physical data model, because we are distributing the same data over several physical entities

35
Q

Vertical Partitioning

A

Vertical partitioning splits up a table vertically based on the columns

This can be done even if the table is totally normalized

36
Q

Horizontal Partitioning

A

Partitions table based on rows

37
Q

Sharding

A

When a horizontal partition is used to spread a table over several machines

For example, partition user data based on geographic location of users.
PRO: query can be easily directed to correct shard, meaning less scanning of the data
CON: can lead to unbalanced load, for example if distribution of users across regions is uneven

38
Q

Benefits of Partitioning Data

A

Rarely-accessed partitions can be moved to a slower medium

Optimizes indices, which increases the chance that heavily-used indices will remain accessible in memory

39
Q

SQL DBMS

A

Relational Data Base Management System, based on the relational model of data

Queried with SQL

Beneficial when working with structured data that will benefit from a pre-defined schema

40
Q

NoSQL DBMS

A

non-relational database management system
less structured and is document-centered rather than being table-centered

Also offers greater flexibility

41
Q

DevOps

A

methodology used as a set of practices and tools to automate the work of software developers

42
Q

Airflow

A

open-source platform that creates, schedules, and monitors data workflows (ETL / ELT)

43
Q

dbt

A

an open-source framework for running, testing and documenting SQL queries

44
Q

Looker

A

dashboards for viewing and visualizing analytics dashboards

45
Q

S3

A

scalable object storage

46
Q

Redshift

A

Redshift - AmazonRedshiftuses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes,

47
Q

CI/CD Technologies (Circle CI)

A

Continuous Integration, Continuous Deployment
Method that involves automation to stages of app development and deployment