Beyond relational databases Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

NoSQL: birth and main features

A
  • Strozzi ‘98: lightweight shell based open-source RDBMS not using SQL standard.
  • Johan Oskarasson’s event ‘09: about recent advances on non-relational databases

Features:

  • No joins
  • Schema-less
  • Horizontal scalability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

NoSQL comparison with SQL

A
  • Table based vs Document-based, key-value pairs, graph, columnar storage
  • Predefined schema vs Schema-less, good for unstructured/semi-structured data
  • Vertically vs Horizontally scalavle
  • SQL vs customer query languages.
  • Complex queries based on joins vs No standard interface to perform complex queries
  • Suitable for flat and structured data storage vs Complex (hierarchical) data, similar to JSON/XML
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Key-value databases

A
  • Simplest NoSQL data store.
  • Match keys with values
  • No structured
  • Great performance
  • Easily scaled
  • Example: Redis, Riak, Memcached
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Column-oriented databases

A
  • Stored data in columnar format
  • Column is, possibly, a complex attribute
  • Key-value pairs sotred and retrieved on key in a parallel system (similar to indexes)
  • Rows can be contructed to column values
  • Useful for data-warehousing: easy for example to compute the average of the votes of exams
  • Transparent to application
  • Examples: Cassanda, DynamoDB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Graph databases

A
  • Made up by vertex and edges
  • Used to store information about networks
  • Good fit for several real world applications
  • Examples: Neo4J, OrentDB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Document Databases

A
  • Database stores and retrieves documents
  • keys are mapped to documents
  • Documents are self-describing: include attribute and value
  • Has hiererchical-tree nested strctures: maps, lists, datetime
  • Documents are heterogenous
  • Examples: MongoDB, CouchDB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

CouchDB

A
  • Document oriented database can be queried and indexed in a MapReduce fashion
  • Offers incremental replication and bidirectional conflict detection and resolution.
  • Written in Erlang: functional pl, ideal for concurrent/distributed systems. Allows for flexible design that is easily scalable.
  • Provides RESTful JSON API that can be accessed from any environment.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MapReduce

A
  • Distributed programming model
  • Process large data sets with parallel algorithms on a cluster of common machines.
  • Great for parallel jobs requiring pieces of computations to be exectued on all data records.
  • Data locality in MapReduce refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

Map function:

  • Describes what values to extract from a single document

Reduce function:

  • Describes what to do with a list of values associated with the same key
  • Returns just one value
  • Multiple levels of reduce are possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

CouchDB views

A
  • Is the only way to query CouchDB
  • Produced by MapReduce
  • Predefined view for each db has
    • document ID as key
    • whole document as value
    • no reduce
  • Views are materialized as values sorted by key
    • this allows the same db to have multiple primary indexes
  • When writing CouchDB map functions, the primary goal is to build an index that stores related data under nearby index.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Replication

A
  • Portion of the whole dataset (chunks) in different places
  • Goals:
    • Redundancy help surviving failures (availability)
    • Better performance
  • Approaches:
    • Master-slave replication
    • A-synchronous replication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Master-slave replication

A
  • Master server takes all the writes, updates, inserts
  • One or more Slave servers take all the reads
  • Master is a single point of failure
  • CouchDB supports Master-Master Replication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Synchronous/Asynchronous replication

A

Synchronous

  • Before committing a transaction, master waits for all the slaves to commit
  • Similar to 2PC
  • Performance killer
  • Trade-off: wait for a subset of slaves to commit (majority of them)

Asynchronous

  • Master commits locally, it does not wait for any slave
  • Each slave independently fetches updates from master, which may fail:
    • if no slave has replicated, then you’ve lost the data
    • if some slaves have replicated you’ve to reconcile
  • Faster and unreliable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Key features of distributed databases

A
  • Consistency
    • all the databases provide the same data
  • Availability
    • failures don’t prevent survivors from continuiing to operate
  • Partition tolerance
    • systems continues to work despite arbitrary message loss, when connectivity failures cause network partitions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CAP Theorem

A

Also known as Brewer’s theorem, states that it is impossible for distributed systems to provide all three features at the same time.

  • Think of two nodes on opposite sides of a partition
  • Allowing at least one node to update will cause inconsitency, forfeiting C
  • If we choose to preserve C, one side of the partition must be unavailable, forfeiting A
  • Only when no network partition exists, it is possible to preserve both consistency and availability, forfeiting P.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CA without P

A

It is equivalent to local database being consistent and available. Distributed systems are not possible with this setup.

Partiotioning means having multiple independent systems that do not need to interact -> Local rather than global consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

CP without A

A
  • Transaction locking
  • A system is allowed not to answer
  • Tolerates partitioning/faults because we block all responses if a partition occurs.
  • Once partition is healed and consistency can once again be verified, we can restore availability and leave this mode.
  • In short: block access to replica sets that are not in sync to achieve global consistency and correctness.
17
Q

AP without C

A
  • Best effort
  • We don’t care about global consitency, every part of the systems makes available what it knows.
  • Each part might be able to answer someone, even if the systems as a whole has been broken up into incommunicable regions (partitions)
  • Without consistency === without assurance of global consitency at all times.
18
Q

Amazon CTO about CAP theorem

A
  • Each node should be able to make decisions based only on its local state
  • Agreement algorithms will eventually become a bottleneck, if you are concerned with scalability
19
Q

why 2 of 3 view is misleading

A
  1. partitions are rare, almost no reason to forfeit A or C if no partitions exist
  2. Choice between A and C can occur many times at different granularities inside the same system, subsystems can make different choices, choice can change with operation and user involved.
  3. The 3 properties are continuos, not binary. There are many levels of consistency and availability (0-100), even partitions have nuaances, including disagreemtn within the systems about wheter partitions exist.
20
Q

BASE, and comparison with ACID

A
  • Basically available
  • Soft state: state of system may change over time, even with no input, because of consistency model.
  • Eventual consistency: system will become consistent over time, given the systems doesn’t receive input for some time.

ACID focuses on consistency while BASE make explicit both choice and spectrum of availability wanted, that is the main focus of BASE.

Example:

DNS: Eventually consistent.

21
Q

Conflict resolution on distributed databases

A

Conflict on documents is detectable by all nodes involved, a local solution is provided.
CouchDB guarantees that each instance that sees the same conflict will come up with the same winning solution, using a deterministic algorithm to pick the winner.

22
Q

MongoDB

A

Pros:

  • Business ready representation of data
  • Flexible and rich
  • Mapping into developer-language objects

Cons:

  • Relations among documents are inefficient
  • Temptetion to go too much schema-free even with structured data
23
Q

Hadoop

A

Big-data everything platform.

  • Hadoop distributed file system
  • Hadoop YARN
  • Hadoop MapReduce

Feautres:

  • Distributed, fault-tolerant storage
  • Parallel, scalable processing
  • High availability
  • Low cost