Data Modeling Flashcards

1
Q

What is Data Modeling?

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another.

The process of creating data models for an information system

Data modeling can easily translate to database modeling, as this is the essential end state.

  1. Conceptual Data Modeling with Entity Mapping.
  2. Logical Data Modeling
  3. Physical Data Modeling (DDL: Data Definition Language)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Key points about Data Modeling

  • Data Organization: The organization of the data for your applications is extremely important and makes everyone’s life easier.
  • Use cases: Having a well thought out and organized data model is critical to how that data can later be used. Queries that could have been straightforward and simple might become complicated queries if data modeling isn’t well thought out.
  • Starting early: Thinking and planning ahead will help you be successful. This is not something you want to leave until the last minute.
  • Iterative Process: Data modeling is not a fixed process. It is iterative as new requirements and data are introduced. Having flexibility will help as new information becomes available.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Intro to Relational Databases

Relational Model: organizes data into one or more tables (or “relations”) of columns and rows, with a unique key identifying each row. Generally, each table represents one “entity type” (such as customer or product).

Relational Database is a digital database based on the relational model of data… a software system used to maintain relational databases is a relational database management system (RDBMS).

SQL (Structured Query Language) is the language used across almost all relation database system for querying and maintaining the database

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Relational DB - Basics

  • Database/Schema = Collection of tables
  • Tables/Relation = A group of rows sharing the same labeled elements (e.g. Customers)
  • Columns/Attribute = Labeled element (e.g. Name, email, city)
  • Rows/Tuple = A single item (e.g. Amanda, jdoe@xyz.com, NYC)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ACID Transactions

Properties of database transactions intended to guarantee validity even in the event of errors or power failures.

  • Atomicity: The whole transaction is processed or nothing is processed. A commonly cited example of an atomic transaction is money transactions between two bank accounts. The transaction of transferring money from one account to the other is made up of two operations. First, you have to withdraw money in one account, and second you have to save the withdrawn money to the second account. An atomic transaction, i.e., when either all operations occur or nothing occurs, keeps the database in a consistent state. This ensures that if either of those two operations (withdrawing money from the 1st account or saving the money to the 2nd account) fail, the money is neither lost nor created.
  • Consistency: Only transactions that abide by constraints and rules are written into the database, otherwise the database keeps the previous state. The data should be correct across all rows and tables.
  • Isolation: Transactions are processed independently and securely, order does not matter. A low level of isolation enables many users to access the data simultaneously, however this also increases the possibilities of concurrency effects (e.g., dirty reads or lost updates). On the other hand, a high level of isolation reduces these chances of concurrency effects, but also uses more system resources and transactions blocking each other.
  • Durability: Completed transactions are saved to database even in cases of system failure. A commonly cited example includes tracking flight seat bookings. So once the flight booking records a confirmed seat booking, the seat remains booked even if a system failure occurs.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When Not to Use a Relational Database

  • Have large amounts of data: Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. You are limited by how much you can scale and how much data you can store on one machine. You cannot add more machines like you can in NoSQL databases.
  • Need to be able to store different data type formats: Relational databases are not designed to handle unstructured data.
  • Need high throughput – fast reads: While ACID transactions bring benefits, they also slow down the process of reading and writing data. If you need very fast reads and writes, using a relational database may not suit your needs.
  • Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
  • Need high availability: The fact that relational databases are not distributed (and even when they are, they have a coordinator/worker architecture), they have a single point of failure. When that database goes down, a fail-over to a backup system occurs and takes time.
  • Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Basics of Apache Cassandra

  • Keyspace: Collection of Tables
  • Table: A group of partitions
  • Partition: Fundamental unit of access. Collection of row(s). How data is distributed.
  • Primary Key: made up of a partition key and clustering columns
  • Columns Clustering and Data. Labeled element
  • Rows: A single item
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Importance of Relational Model

  • Standardization of data model
  • Flexibility in adding and altering tables
  • Data Integrity
  • Standard Query Language (SQL)
  • Simplicity
  • Intuitive Organization
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Online Transactional Processing (OLTP):

is involved in the operation of a particular system. OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). It involves Queries accessing individual record like Update your Email in Company database.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Online Analytical Processing (OLAP):

deals with Historical Data or Archival Data. OLAP is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema). Sometime query need to access large amount of data in Management records like what was the profit of your company in last year.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Objectives of Normal Form

  1. To free the database from unwanted insertions, updates, and deletion dependencies.
  2. To reduce the need for refactoring the database as new types of data are introduced.
  3. To make the relational model more informative to users.
  4. To make the database neutral to the query statistics.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Normal Form

The process of normalization is a step by step process:

  • First Normal Form (1NF)
  • Second Normal Form (2NF)
  • Third Normal Form (3NF)

How to reach First Normal Form (1NF):

  • Atomic values: each cell contains unique and single values
  • Be able to add data without altering tables
  • Separate different relations into different tables
  • Keep relationships between tables together with foreign keys

Second Normal Form (2NF):

  • Have reached 1NF
  • All columns in the table must rely on the Primary Key

Third Normal Form (3NF):

  • Must be in 2nd Normal Form
  • No transitive dependencies
  • Remember, transitive dependencies you are trying to maintain is that to get from A-> C, you want to avoid going through B.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Denormalization

The process of trying to improve the read performance of a database at the expense of losing some write performance by adding redundant copies of data.

JOINS on the database allow for outstanding flexibility but are extremely slow. If you are dealing with heavy reads on your database, you may want to think about denormalizing your tables. You get your data into normalized form, and then you proceed with denormalization. So, denormalization comes after normalization.

Logical Design Change

  1. The designer is in charge of keeping data consistent
  2. Reads will be faster (select)
  3. Writes will be slower (insert, update, delete)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Normalization vs Denormalization

Normalization is about trying to increase data integrity by reducing the number of copies of the data. Data that needs to be added or updated will be done in as few places as possible.

  • The process of structuring a relational database in accordance with a series of normal forms in order to reduce data redundancy and increase data integrity.

Denormalization is trying to increase performance by reducing the number of joins between tables (as joins can be slow). Data integrity will take a bit of a potential hit, as there will be more copies of the data (to reduce JOINS).

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Fact and Dimension Tables

  • Work together to create an organized data model
  • While fact and dimension are not created differently in the DDL, they are conceptual and extremely important for organization.

Fact table consists of the measurements, metrics or facts of a business process.

Dimension table is a structure that categorizes facts and measures in order to enable users to answer business questions. Dimensions are people, products, place and time.

In the image below, the unique primary key for each Dimension table is included in the Fact table.

In this example, it helps to think about the Dimension tables providing the following information:

  • Where the product was bought? (Dim_Store table)
  • When the product was bought? (Dim_Date table)
  • What product was bought? (Dim_Product table)
  • The Fact table provides the metric of the business process (here Sales).
  • How many units of products were bought? (Fact_Sales table)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Star Schema

Star Schema is the simplest style of data mart schema. The star schema consists of one or more fact tables referencing any number of dimension tables.

  • Gets its name from the physical model resembling a star shape
  • A fact table is at its center
  • Dimension table surrounds the fact table representing the star’s points.

Benefits

  • Denormalized
  • Simplifies queries
  • Fast aggregations

Drawbacks

  • Issues that come with denormalization
  • Data Integrity
  • Decrease query flexibility
  • Many to many relationship
A
17
Q

Snowflake Schema

Logical arrangement of tables in a multidimensional database represented by centralized fact tables which are connected to multiple dimensions.

A complex snowflake shape emerges when the dimensions of a snowflake schema are elaborated, having multiple levels of relationships, child tables having multiple parents.

A
18
Q

Eventual Consistency

A consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

A
19
Q

CAP Theorem

A theorem in computer science that states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees of consistency, availability, and partition tolerance.

  • Consistency: Every read from the database gets the latest (and correct) piece of data or an error
  • Availability: Every request is received and a response is given – without a guarantee that the data is the latest update
  • Partition Tolerance: The system continues to work regardless of losing network connectivity between nodes

Is Eventual Consistency the opposite of what is promised by SQL database per the ACID principle?
Much has been written about how Consistency is interpreted in the ACID principle and the CAP theorem. Consistency in the ACID principle refers to the requirement that only transactions that abide by constraints and database rules are written into the database, otherwise the database keeps previous state. In other words, the data should be correct across all rows and tables. However, consistency in the CAP theorem refers to every read from the database getting the latest piece of data or an error.

A
20
Q

Data Modeling in Apache Cassandra

Denormalization of tables in Apache Cassandra is absolutely critical. The biggest take away when doing data modeling in Apache Cassandra is to think about your queries first. There are no JOINS in Apache Cassandra.

  • Denormalization is not just okay – it’s a must
  • Denormalization must be done for fast reads
  • Apache Cassandra has been optimized for fast writes
  • ALWAYS think Queries first
  • One table per query is a great strategy
  • Apache Cassandra does not allow for JOINs between tables
A