Storage and Databases Flashcards
Relational Databases - Basics
Disk types:
- SSD is faster and more expensive than HDD
- SSD is used for frequently accessed/modified data, and HDD for rarely accessed/modified data
Relational database:
- A structured database where data is stored in tabular format
- Not all relational databases support SQL
Non-relational/NoSQL database: it’s free of imposed, tabular-like structure
Relational Databases - ACID
- Atomicity: the operations of the transaction will either succeed or all fail
- Consistency:
- The transaction cannot bring the database to an invalid state
- After the transaction is committed or rolled back, the rules for each record will still apply, and all future transactions will see the effect of the transaction - Isolation: the execution of multiple transactions concurrently will have the same effect as if they had been executed sequentially
- Durability: any committed transaction is written to non-volatile storage. It won’t be undone by any hardware issue
Relational Databases - Indexes
- Allow to perform certain queries faster
- Can typically only exist in relational databases
- They greatly sped up read queries with the downside of slightly longer writes, because they also take place in the relevant index
Relational Databases - Consistency types and tools
- Strong consistency: refers to the consistency of ACID
- Eventual consistency:
- Reads might return a view of the system that is stale
- Will guarantee that the state of the database eventually reflects writes within a time period (could be seconds or minutes) - Tools: Postgres, MySQL, MSSQL
Key-Value Stores
- A flexible NoSQL database that’s often used for caching and dynamic configuration.
- Tools: Etcd, Redis, ZooKeeper
Blob (Binary Large Object) Storage
- They allow to store and retrieve data based on the name of the blob (unstructured data)
- They might be slower than KV stores but values can be MB or GB large
- Used to store large binaries, database snapshots, images, or other static assets a website might have
- Only giant companies have infrastructure that supports it
- Tools: Google Cloud Storage, Amazon S3
Time Series database (TSDB) and Spatial database
Time Series database (TSDB):
- Optimized for storing and analyzing time-indexed data
- Time indexed data: data points that occur at a given moment of time
- Tools: InfluxDB, Prometheus
Spatial database:
- Optimized for storing and querying spatial data, like locations on a map
- Rely on spatial indexes like quadtree to quickly perform queries like finding all locations in the vicinity of a region
Graph Database 1
- Stores data following the graph data model
- Data entries can have explicitly defined relationships
- Performs complex and fast queries on deeply connected data
Graph Database 2
- Often preferred over relational databases when dealing with data points that naturally form a graph and have multiple levels of relationships
- Cypher:
- It’s a graph query language developed for the Neo4j graph database
- It’s the standard to be used in graph databases - Tools: Neo4j
Quadtree 1
- A tree data structure that’s most commonly used to index two-dimensional spatial data
- Each node has either zero (a leaf node) or four childer nodes
- Nodes:
- Contain some form of spatial data, like locations on a map, with a specified maximum capacity
- When nodes aren’t at capacity they remain as leaf nodes
- Once they reach capacity, they are given four children nodes, and their data entries are split between those children
Quadtree 2
- Good to query spatial data:
- It can be represented as a grid filled with rectangles that are recursively divided into four sub-rectangles
- Each node is represented by a rectangle, which represents a spatial region - Finding a location in a perfect quadtree runs in log(4)(x), where x is the total number of locations
Replication
- Act of duplicating data from one database server to others
- Most of the time used to increase redundancy and fault tolerance of regions or other types of locations
- Other times to move data closer to clients to decrease latency of specific data
Sharding - Basics
- Sometimes called data partitioning. Act of splitting a database into two or more pieces called shards
- Typically done to increase the throughput of the database
- A reverse proxy is usually used to route requests from application servers to database shards
Sharding - Strategies
- Based on the client’s region
- Based on the type of data being stored. For example, user data gets stored in one shard, payment data gets stored in another shard
- Based on the hash of a column. Only for structured data
Replication and Sharding - Hot Spot
- When distributing a workload across a set of servers, that workload spreads unevenly
- This can happen if the ‘sharding key’ or ‘hashing function’ are suboptimal, or if the workload is naturally skewed