Beyond relational databases Flashcards

Question 1

Q

NoSQL: birth and main features

Answer

A

Strozzi ‘98: lightweight shell based open-source RDBMS not using SQL standard.
Johan Oskarasson’s event ‘09: about recent advances on non-relational databases

Features:

No joins
Schema-less
Horizontal scalability

Question 2

Q

NoSQL comparison with SQL

Answer

A

Table based vs Document-based, key-value pairs, graph, columnar storage
Predefined schema vs Schema-less, good for unstructured/semi-structured data
Vertically vs Horizontally scalavle
SQL vs customer query languages.
Complex queries based on joins vs No standard interface to perform complex queries
Suitable for flat and structured data storage vs Complex (hierarchical) data, similar to JSON/XML

Question 3

Q

Key-value databases

Answer

A

Simplest NoSQL data store.
Match keys with values
No structured
Great performance
Easily scaled
Example: Redis, Riak, Memcached

Question 4

Q

Column-oriented databases

Answer

A

Stored data in columnar format
Column is, possibly, a complex attribute
Key-value pairs sotred and retrieved on key in a parallel system (similar to indexes)
Rows can be contructed to column values
Useful for data-warehousing: easy for example to compute the average of the votes of exams
Transparent to application
Examples: Cassanda, DynamoDB

Question 5

Q

Graph databases

Answer

A

Made up by vertex and edges
Used to store information about networks
Good fit for several real world applications
Examples: Neo4J, OrentDB

Question 6

Q

Document Databases

Answer

A

Database stores and retrieves documents
keys are mapped to documents
Documents are self-describing: include attribute and value
Has hiererchical-tree nested strctures: maps, lists, datetime
Documents are heterogenous
Examples: MongoDB, CouchDB

Question 7

Q

CouchDB

Answer

A

Document oriented database can be queried and indexed in a MapReduce fashion
Offers incremental replication and bidirectional conflict detection and resolution.
Written in Erlang: functional pl, ideal for concurrent/distributed systems. Allows for flexible design that is easily scalable.
Provides RESTful JSON API that can be accessed from any environment.

Question 8

Q

MapReduce

Answer

A

Distributed programming model
Process large data sets with parallel algorithms on a cluster of common machines.
Great for parallel jobs requiring pieces of computations to be exectued on all data records.
Data locality in MapReduce refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

Map function:

Describes what values to extract from a single document

Reduce function:

Describes what to do with a list of values associated with the same key
Returns just one value
Multiple levels of reduce are possible

Question 9

Q

CouchDB views

Answer

A

Is the only way to query CouchDB
Produced by MapReduce
Predefined view for each db has
- document ID as key
- whole document as value
- no reduce
Views are materialized as values sorted by key
- this allows the same db to have multiple primary indexes
When writing CouchDB map functions, the primary goal is to build an index that stores related data under nearby index.

Question 10

Q

Replication

Answer

A

Portion of the whole dataset (chunks) in different places
Goals:
- Redundancy help surviving failures (availability)
- Better performance
Approaches:
- Master-slave replication
- A-synchronous replication

Question 11

Q

Master-slave replication

Answer

A

Master server takes all the writes, updates, inserts
One or more Slave servers take all the reads
Master is a single point of failure
CouchDB supports Master-Master Replication

Question 12

Q

Synchronous/Asynchronous replication

Answer

A

Synchronous

Before committing a transaction, master waits for all the slaves to commit
Similar to 2PC
Performance killer
Trade-off: wait for a subset of slaves to commit (majority of them)

Asynchronous

Master commits locally, it does not wait for any slave
Each slave independently fetches updates from master, which may fail:
- if no slave has replicated, then you’ve lost the data
- if some slaves have replicated you’ve to reconcile
Faster and unreliable

Question 13

Q

Key features of distributed databases

Answer

A

Consistency
- all the databases provide the same data
Availability
- failures don’t prevent survivors from continuiing to operate
Partition tolerance
- systems continues to work despite arbitrary message loss, when connectivity failures cause network partitions

Question 14

Q

CAP Theorem

Answer

A

Also known as Brewer’s theorem, states that it is impossible for distributed systems to provide all three features at the same time.

Think of two nodes on opposite sides of a partition
Allowing at least one node to update will cause inconsitency, forfeiting C
If we choose to preserve C, one side of the partition must be unavailable, forfeiting A
Only when no network partition exists, it is possible to preserve both consistency and availability, forfeiting P.

Question 15

Q

CA without P

Answer

A

It is equivalent to local database being consistent and available. Distributed systems are not possible with this setup.

Partiotioning means having multiple independent systems that do not need to interact -> Local rather than global consistency.

Question 16

Q

CP without A

Answer

Study These Flashcards

A

Transaction locking
A system is allowed not to answer
Tolerates partitioning/faults because we block all responses if a partition occurs.
Once partition is healed and consistency can once again be verified, we can restore availability and leave this mode.
In short: block access to replica sets that are not in sync to achieve global consistency and correctness.

Question 17

Q

AP without C

Answer

Study These Flashcards

A

Best effort
We don’t care about global consitency, every part of the systems makes available what it knows.
Each part might be able to answer someone, even if the systems as a whole has been broken up into incommunicable regions (partitions)
Without consistency === without assurance of global consitency at all times.

Question 18

Q

Amazon CTO about CAP theorem

Answer

Study These Flashcards

A

Each node should be able to make decisions based only on its local state
Agreement algorithms will eventually become a bottleneck, if you are concerned with scalability

Question 19

Q

why 2 of 3 view is misleading

Answer

Study These Flashcards

A

partitions are rare, almost no reason to forfeit A or C if no partitions exist
Choice between A and C can occur many times at different granularities inside the same system, subsystems can make different choices, choice can change with operation and user involved.
The 3 properties are continuos, not binary. There are many levels of consistency and availability (0-100), even partitions have nuaances, including disagreemtn within the systems about wheter partitions exist.

Question 20

Q

BASE, and comparison with ACID

Answer

Study These Flashcards

A

Basically available
Soft state: state of system may change over time, even with no input, because of consistency model.
Eventual consistency: system will become consistent over time, given the systems doesn’t receive input for some time.

ACID focuses on consistency while BASE make explicit both choice and spectrum of availability wanted, that is the main focus of BASE.

Example:

DNS: Eventually consistent.

Question 21

Q

Conflict resolution on distributed databases

Answer

Study These Flashcards

A

Conflict on documents is detectable by all nodes involved, a local solution is provided.
CouchDB guarantees that each instance that sees the same conflict will come up with the same winning solution, using a deterministic algorithm to pick the winner.

Question 22

Q

MongoDB

Answer

Study These Flashcards

A

Pros:

Business ready representation of data
Flexible and rich
Mapping into developer-language objects

Cons:

Relations among documents are inefficient
Temptetion to go too much schema-free even with structured data

Question 23

Q

Hadoop

Answer

Study These Flashcards

A

Big-data everything platform.

Hadoop distributed file system
Hadoop YARN
Hadoop MapReduce

Feautres:

Distributed, fault-tolerant storage
Parallel, scalable processing
High availability
Low cost

Beyond relational databases Flashcards

(23 cards)