L7 Flashcards
traditional ways of data storage
- structured, relation data
- rows and columns
- very expensive when high in volume, velocity, etc
- time consuming
- not scalable
- rigid check is made when data is inserted or updated which poses a problem for data arriving at high velocity
RDBMS, data warehouse/marts, SQL database
-> only 20% nowadays are traditional structured data
RDBMS vs non-RDBMS
data
schema
scalability
language
transaction
best for
examples
- stored in tables vs stored according to different models (key-value, wide-column store, multi-model, etc)
- supports rigid pre-defined data schema vs supports dynamic data schema
- vertically scalable vs horizontally scalable
- structured query language (SQL) vs NoSQL language
- ACID-compliant vs CAP theorem, possibly ACID compliant
- best for transactions and routine data analysis through complex queries vs best for storing and modelling all types of data
RDBMS ex: amazon redshift, oracle database, mySQL
non-RDBMS ex: amazon dynamoDB, oracle NoSQL database
ACID compliant
set of properties to guarantee data validity despite errors, etc in RDBMS
1. atomicity: either all changes made by transaction are made or none
2. consistency: prevents transactions from violating integrity constraints or database rules
3. isolation: transactions don’t affect each other
4. durability: finalized transactions are permanent and will survive system failures
CAP theorem
any distributed data store can only guarantee 2/3 of the following:
1. consistency: all users see same view of data
2. availability: all users can find replica in case of node failure
3. partition tolerance: works even in presence of network failure
def BD storage
infrastructure designed specifically to store, manage and retrieve massive amounts of data,
compute-and-storage architectue that collects and manages large datasets and enables real-time data analytics
5 methods to store BD
- Hadoop Distributed File System (HDFS)
- NoSQL
- NewSQL
- Clustered Network Attached Storage (NAS)
- Cloud Computing
HDFS
- one of the components of Apache Hadoop
- designed to run on commodity hardware
- HDFS is a cluster that is made up of many racks which are collections of datanodes (where data is stored) and one namenode (where indices and block names are stored)
- divides data in blocks (eg. of 128MB) and distributes across countless datanodes (computers)
- each block is replicated in more than one node -> fault-tolerance
HDFS Advantages
Pros:
- no single point of failure due to replication -> fault tolerance
- possible to communicate with closest available node to reduce latency and network traffic
- good at handling extremely large files
- streaming data access (data is read at constant velocity, not in batches) -> read-intesive
- commodity hardware
NoSQL database
what is it
features
- Not Only SQL
- designed to meet big data processing demands
- supports structure, semi and un-structured data
- better for lots of unstructured data
- CAP theorem
- BASE model
- Schemaless
- Horizontal scalability
BASE model
Basically Available: always abailable despite network failure
Soft state: node information could be temporarily inconsistent
Eventual consistency: changes will be updated on all nodes
Schemaless
- doesn’t require pre-defined structure to store data
- can be in any format (structured or un)
- no prior knowledge necessary about data
Horizontal Scalability
- scale-out (add more instances) vs scale-up (add more RAM, CPU)
- sharding (distribution of datasets into smaller chunks) and replication (for fault tolerance)
- capable of growing dynamically
4 types of NoSQL
- key-value store db
- column-store db
- document db
- graph db
Key-value store DB
- simplest and most efficient
- stores data as key-value pairs where key is unique identifier
- can be simple object or complex compound objects
- key is string and value is data
- similar to hash tables
- OLAP suitable (cubes of multidimensional data)
eg) Amazon DynamoDB
Column-store DB
- fields for each record are stores in long row following columns
- increases speed of query to access elements
- OLTP-suitable
eg) Cassandra