Beyond relational databases Flashcards
NoSQL: birth and main features
- Strozzi ‘98: lightweight shell based open-source RDBMS not using SQL standard.
- Johan Oskarasson’s event ‘09: about recent advances on non-relational databases
Features:
- No joins
- Schema-less
- Horizontal scalability
NoSQL comparison with SQL
- Table based vs Document-based, key-value pairs, graph, columnar storage
- Predefined schema vs Schema-less, good for unstructured/semi-structured data
- Vertically vs Horizontally scalavle
- SQL vs customer query languages.
- Complex queries based on joins vs No standard interface to perform complex queries
- Suitable for flat and structured data storage vs Complex (hierarchical) data, similar to JSON/XML
Key-value databases
- Simplest NoSQL data store.
- Match keys with values
- No structured
- Great performance
- Easily scaled
- Example: Redis, Riak, Memcached
Column-oriented databases
- Stored data in columnar format
- Column is, possibly, a complex attribute
- Key-value pairs sotred and retrieved on key in a parallel system (similar to indexes)
- Rows can be contructed to column values
- Useful for data-warehousing: easy for example to compute the average of the votes of exams
- Transparent to application
- Examples: Cassanda, DynamoDB
Graph databases
- Made up by vertex and edges
- Used to store information about networks
- Good fit for several real world applications
- Examples: Neo4J, OrentDB
Document Databases
- Database stores and retrieves documents
- keys are mapped to documents
- Documents are self-describing: include attribute and value
- Has hiererchical-tree nested strctures: maps, lists, datetime
- Documents are heterogenous
- Examples: MongoDB, CouchDB
CouchDB
- Document oriented database can be queried and indexed in a MapReduce fashion
- Offers incremental replication and bidirectional conflict detection and resolution.
- Written in Erlang: functional pl, ideal for concurrent/distributed systems. Allows for flexible design that is easily scalable.
- Provides RESTful JSON API that can be accessed from any environment.
MapReduce
- Distributed programming model
- Process large data sets with parallel algorithms on a cluster of common machines.
- Great for parallel jobs requiring pieces of computations to be exectued on all data records.
- Data locality in MapReduce refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.
Map function:
- Describes what values to extract from a single document
Reduce function:
- Describes what to do with a list of values associated with the same key
- Returns just one value
- Multiple levels of reduce are possible
CouchDB views
- Is the only way to query CouchDB
- Produced by MapReduce
- Predefined view for each db has
- document ID as key
- whole document as value
- no reduce
- Views are materialized as values sorted by key
- this allows the same db to have multiple primary indexes
- When writing CouchDB map functions, the primary goal is to build an index that stores related data under nearby index.
Replication
- Portion of the whole dataset (chunks) in different places
- Goals:
- Redundancy help surviving failures (availability)
- Better performance
- Approaches:
- Master-slave replication
- A-synchronous replication
Master-slave replication
- Master server takes all the writes, updates, inserts
- One or more Slave servers take all the reads
- Master is a single point of failure
- CouchDB supports Master-Master Replication
Synchronous/Asynchronous replication
Synchronous
- Before committing a transaction, master waits for all the slaves to commit
- Similar to 2PC
- Performance killer
- Trade-off: wait for a subset of slaves to commit (majority of them)
Asynchronous
- Master commits locally, it does not wait for any slave
- Each slave independently fetches updates from master, which may fail:
- if no slave has replicated, then you’ve lost the data
- if some slaves have replicated you’ve to reconcile
- Faster and unreliable
Key features of distributed databases
- Consistency
- all the databases provide the same data
- Availability
- failures don’t prevent survivors from continuiing to operate
- Partition tolerance
- systems continues to work despite arbitrary message loss, when connectivity failures cause network partitions
CAP Theorem
Also known as Brewer’s theorem, states that it is impossible for distributed systems to provide all three features at the same time.
- Think of two nodes on opposite sides of a partition
- Allowing at least one node to update will cause inconsitency, forfeiting C
- If we choose to preserve C, one side of the partition must be unavailable, forfeiting A
- Only when no network partition exists, it is possible to preserve both consistency and availability, forfeiting P.
CA without P
It is equivalent to local database being consistent and available. Distributed systems are not possible with this setup.
Partiotioning means having multiple independent systems that do not need to interact -> Local rather than global consistency.