Data Modeling Flashcards
What is Data Modeling?
A data model is an abstract model that organizes elements of data and standardizes how they relate to one another.
The process of creating data models for an information system
Data modeling can easily translate to database modeling, as this is the essential end state.
- Conceptual Data Modeling with Entity Mapping.
- Logical Data Modeling
- Physical Data Modeling (DDL: Data Definition Language)
Key points about Data Modeling
- Data Organization: The organization of the data for your applications is extremely important and makes everyone’s life easier.
- Use cases: Having a well thought out and organized data model is critical to how that data can later be used. Queries that could have been straightforward and simple might become complicated queries if data modeling isn’t well thought out.
- Starting early: Thinking and planning ahead will help you be successful. This is not something you want to leave until the last minute.
- Iterative Process: Data modeling is not a fixed process. It is iterative as new requirements and data are introduced. Having flexibility will help as new information becomes available.
Intro to Relational Databases
Relational Model: organizes data into one or more tables (or “relations”) of columns and rows, with a unique key identifying each row. Generally, each table represents one “entity type” (such as customer or product).
Relational Database is a digital database based on the relational model of data… a software system used to maintain relational databases is a relational database management system (RDBMS).
SQL (Structured Query Language) is the language used across almost all relation database system for querying and maintaining the database
Relational DB - Basics
- Database/Schema = Collection of tables
- Tables/Relation = A group of rows sharing the same labeled elements (e.g. Customers)
- Columns/Attribute = Labeled element (e.g. Name, email, city)
- Rows/Tuple = A single item (e.g. Amanda, jdoe@xyz.com, NYC)
ACID Transactions
Properties of database transactions intended to guarantee validity even in the event of errors or power failures.
- Atomicity: The whole transaction is processed or nothing is processed. A commonly cited example of an atomic transaction is money transactions between two bank accounts. The transaction of transferring money from one account to the other is made up of two operations. First, you have to withdraw money in one account, and second you have to save the withdrawn money to the second account. An atomic transaction, i.e., when either all operations occur or nothing occurs, keeps the database in a consistent state. This ensures that if either of those two operations (withdrawing money from the 1st account or saving the money to the 2nd account) fail, the money is neither lost nor created.
- Consistency: Only transactions that abide by constraints and rules are written into the database, otherwise the database keeps the previous state. The data should be correct across all rows and tables.
- Isolation: Transactions are processed independently and securely, order does not matter. A low level of isolation enables many users to access the data simultaneously, however this also increases the possibilities of concurrency effects (e.g., dirty reads or lost updates). On the other hand, a high level of isolation reduces these chances of concurrency effects, but also uses more system resources and transactions blocking each other.
- Durability: Completed transactions are saved to database even in cases of system failure. A commonly cited example includes tracking flight seat bookings. So once the flight booking records a confirmed seat booking, the seat remains booked even if a system failure occurs.
When Not to Use a Relational Database
- Have large amounts of data: Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. You are limited by how much you can scale and how much data you can store on one machine. You cannot add more machines like you can in NoSQL databases.
- Need to be able to store different data type formats: Relational databases are not designed to handle unstructured data.
- Need high throughput – fast reads: While ACID transactions bring benefits, they also slow down the process of reading and writing data. If you need very fast reads and writes, using a relational database may not suit your needs.
- Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
- Need high availability: The fact that relational databases are not distributed (and even when they are, they have a coordinator/worker architecture), they have a single point of failure. When that database goes down, a fail-over to a backup system occurs and takes time.
- Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data.
Basics of Apache Cassandra
- Keyspace: Collection of Tables
- Table: A group of partitions
- Partition: Fundamental unit of access. Collection of row(s). How data is distributed.
- Primary Key: made up of a partition key and clustering columns
- Columns Clustering and Data. Labeled element
- Rows: A single item
Importance of Relational Model
- Standardization of data model
- Flexibility in adding and altering tables
- Data Integrity
- Standard Query Language (SQL)
- Simplicity
- Intuitive Organization
Online Transactional Processing (OLTP):
is involved in the operation of a particular system. OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). It involves Queries accessing individual record like Update your Email in Company database.
Online Analytical Processing (OLAP):
deals with Historical Data or Archival Data. OLAP is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema). Sometime query need to access large amount of data in Management records like what was the profit of your company in last year.
Objectives of Normal Form
- To free the database from unwanted insertions, updates, and deletion dependencies.
- To reduce the need for refactoring the database as new types of data are introduced.
- To make the relational model more informative to users.
- To make the database neutral to the query statistics.
Normal Form
The process of normalization is a step by step process:
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
How to reach First Normal Form (1NF):
- Atomic values: each cell contains unique and single values
- Be able to add data without altering tables
- Separate different relations into different tables
- Keep relationships between tables together with foreign keys
Second Normal Form (2NF):
- Have reached 1NF
- All columns in the table must rely on the Primary Key
Third Normal Form (3NF):
- Must be in 2nd Normal Form
- No transitive dependencies
- Remember, transitive dependencies you are trying to maintain is that to get from A-> C, you want to avoid going through B.
Denormalization
The process of trying to improve the read performance of a database at the expense of losing some write performance by adding redundant copies of data.
JOINS on the database allow for outstanding flexibility but are extremely slow. If you are dealing with heavy reads on your database, you may want to think about denormalizing your tables. You get your data into normalized form, and then you proceed with denormalization. So, denormalization comes after normalization.
Logical Design Change
- The designer is in charge of keeping data consistent
- Reads will be faster (select)
- Writes will be slower (insert, update, delete)
Normalization vs Denormalization
Normalization is about trying to increase data integrity by reducing the number of copies of the data. Data that needs to be added or updated will be done in as few places as possible.
- The process of structuring a relational database in accordance with a series of normal forms in order to reduce data redundancy and increase data integrity.
Denormalization is trying to increase performance by reducing the number of joins between tables (as joins can be slow). Data integrity will take a bit of a potential hit, as there will be more copies of the data (to reduce JOINS).
Fact and Dimension Tables
- Work together to create an organized data model
- While fact and dimension are not created differently in the DDL, they are conceptual and extremely important for organization.
Fact table consists of the measurements, metrics or facts of a business process.
Dimension table is a structure that categorizes facts and measures in order to enable users to answer business questions. Dimensions are people, products, place and time.
In the image below, the unique primary key for each Dimension table is included in the Fact table.
In this example, it helps to think about the Dimension tables providing the following information:
- Where the product was bought? (Dim_Store table)
- When the product was bought? (Dim_Date table)
- What product was bought? (Dim_Product table)
- The Fact table provides the metric of the business process (here Sales).
- How many units of products were bought? (Fact_Sales table)