Week 5: NoSQL Databases and MongoDB Flashcards

1
Q

NoSQL

A

NoSQL databases are non-relational, highly-scalable and fault tolerant, designed for large, distributed, semi-structured and unstructured data, built mostly for queries and few asynchronous inserts and updates, and are accessible through API-based query interfaces and data-specific query languages.

Availability is favoured over consistency, approximate answers are acceptable, and overall the system is simpler and faster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ACID

A

Relational databases have the following 4 properties:

  1. Atomicity: each transaction is a single, indivisible unit.
  2. Consistency: the data is accurate and meets pre-existing requirements after each transaction.
  3. Isolation: concurrent transactions don’t affect each other.
  4. Durability: changes resulting from transactions are stored event in the event of failures.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

BASE

A

This acronym describes the properties of NoSQL databases.

  1. Basically Available: the client’s request will always be acknowledged. Availability is prioritised even if system failures may jeapordise successful completion of the client’s request.
  2. Soft State: the data may be inconsistent when its read.
  3. Eventually Consistent: read requests after write requests may not return consistent results, but they’ll be updated once changes are propagated to all notes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3 V’s of Big Data

A

Volume: NoSQL databases allow scaling out (adding more nodes to the commodity server).

Velocity: fast writes using schema-on-read (data are applied to the schema as they leave the database). This allows for low write latency (adding nodes decreases latency).

Variety: can store semi-structured and unstructured data (schema is loose or non-existent).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RDBMS vs NoSQL

A

Elastic Scaling:
- RDBMS scales up, with bigger server handling bigger loads.
- NoSQL scales out by distributing data across multiple hosts seamlessly.

Big Data:
- RDBMS doesn’t scale up well to handle big data.
- NoSQL is designed for big data.

DBA Specialists:
- RDBMS requires highly trained experts to monitor DB.
- NoSQL requires less management, automatically repairs itself, and has simpler data models.

Flexible Data Models:
- RDBMS needs careful schema change management.
- NoSQL databases don’t need complicated schema management.

Economic Cost:
- RDBMS relies on expensive proprietary servers to manage data.
- NoSQL uses clusters of cheap commodity servers to manage data and transaction volumes. the cost per gigabyte or transactions/second for NoSQL can be lower than the cost for RDBMS.

Lack of Expertise:
- There are plenty of experienced RDBMS developers.
- There are fewer NoSQL developers.

Analytics and Business Intelligence:
- RDBMS is designed for analytics.
- NoSQL is designed for the needs of Web 2.0, not for ad hoc data queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

NoSQL Database Types:

A

Key/Value: “Hashtable” of keys
Examples: redis, riak

Document: stores documents comprised of tagged elements
Examples: MongoDB, CouchDB

Column-family: each storage block contains data from one column
Examples: Cassandra, H-Base

Graph: stores graph-structured data (nodes and edges)
Examples: Neo4j, HyperGraphDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Key-value Databases

A

They store key value pairs, with keys being unique. Values are only retrievable using keys and are opaque to the database. Key-value pairs are organised into collections/buckets. Data are partitioned across nodes by keys. The partition for a key is determined by hashing the key.

Pros:
- Very fast, simple model, able to scale horizontally
- Good for unstructured data, fast read/writes, when a key suffices for identifying a value, no dependencies among values, and simple insert/delete/select operations.

Cons:
- Many data structures (objects) can’t be easily modelled as key-value pairs
- Not good for operations (search, filter, update) on individual attributes of a value, and operations on multiple keys in a single transaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Document Databases

A

These store documents in semi-structured form. A document is in a nested structure in JSON or XML format.

Suitable for:
- Semi-structured data with a flat or nested schema.
- Search for different values of the document.
- Updates on subsets of values.
- CRUD (Create, Read, Update, Delete) operations.
- Schema changes are likely.

Unsuitable for:
- Binary data.
- Updates on multiple documents in a single transaction.
- Joins between multiple documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Key-value vs Document Databases

A
  1. In document databases, each document has a unique key
  2. Document databases provide more support for value operations, as they’re aware of values, selection operations can retrieve fields or parts of values, subsets of values can be updated together, indexes are supported, and each document has a schema that can be inferred from the structure of the value.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Column-family Databases

A

These databases store columns, with each column having a name and value. Columns related to each other are grouped into rows. Rows don’t necessarily have a fixed schema or number of columns.

Suitable for:
- Data that has a tabular structure with many columns and sparsely populated rows.
- Columns that are interrelated and accessed together often.
- OLAP (Online Analytical Processing).
- Realtime random read-write is needed
Insert/select/update/delete operations.

Unsuitable for:
- Joins.
- ACID support is needed.
- Binary Data.
- SQL-compliant queries.
- Frequently changing query patterns that lead to column restructuring.

Applications:
- Data warehousing
- Data Mining
- Google BigTable
- RDF (Resource Description Framework)
- Info Retrieval
- Scientific Datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Graph Databases

A

Data is stored in a graph-like structure. Nodes represent entities and have sets of attributes. Edges represent relationships and have sets of attributes. These databases are optimised for representing connections, as adding and removing edges and attributes are easy. The underlying storage can be native graph storage, relational database, key/value database, document database, etc.

Suitable for:
- Data comprised of interconnected entities.
- Queries are based on entity relationships.
- Need to find groups of interconnected entities.
- Need to find distances between entities.

Unsuitable for:
- Joins.
- ACID support is needed.
- Binary data.
- SQL-compliant queries.
- Frequently changing query patterns that lead to column restructuring.

Applications
- Social
- Recommendation
- Geography

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MongoDB

A

It’s a document database. It’s hash-based, meaning that it stores hashes (system-assign _id) with keys and values for each document. MongoDB has a dynamic schema and uses the BSON (Binary JSON) format. It has API’s for many languages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

MongoDB: Insert

A

Example:
To insert a document with _id of 10, field item with value of “box”, and field quantity with a value of 20,

db.products.insert({_id:10,item:”box”,qty:20})

Example:
Inserting multiple documents,

db.inventory.insertMany([
{item.”journal”,qty:25,tags:[“blank”,”red”],size:{h:14,w:21,uom:”cm”}},
{item:”mat”,qty:85,tags:[“gray”],size:{h:27.9,w:35.5,uom:”cm”}}
])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

MongoDB: Find

A

Example:
Finding documents with a quantity greater than 4,

db.products.find{{qty:{$gt4}})

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

MongoDB: Update

A

Example:

db.books.update{
{_id:1},
{
$inc:{stock:5},
$set:{
item:”ABC123”,
“info.publisher”:”2222”,
tags:[“software”],
“ratings.1”:{by:”xyz”,rating:3}
}
}
}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

MongoDB: Remove

A

Example:
Remove all documents in the collection “products”,

db.products.remove({})

Example:
Remove all documents with item=box

db.products.remove({“item”:”box”})

Example:
Remove all documents with a quantity greater than 20,

db.products.remove{
{qty:{$gt:20}},
}

17
Q

MongoDB: Index Support

A

Users can create, view, and drop indexes.

Commands:
View
db.system.indexes.find()

Get Indexes on collectionA
db.collectionA.getIndexes()

Drop all indexes (other than required 1 on _d)
db.collectionA.dropIndexes()

Drop index with name “catIdx”
db.collectionA.dorpIndex(“catIdx”)

Create 2dsphere Index on “loc” field
db.collectionA.createIndex({loc:”2dsphere”})

18
Q

MongoDB: Replication

A

This is a feature of MongoDB. Multiple replicas of datasets are stored. This provides scalability, availability, and fault tolerance. The primary instance (replica) receives operation requests. The secondary instances apply operations to their data. If the primary instance doesn’t communicate with its secondaries for over 10 seconds, one of the secondary instances becomes the new primary instance after elections.

19
Q

MongoDB: Sharding

A

This process horizontally partitions the dataset into shards that are distributed across multiple nodes. Each node is only responsible for its shard. If shards are unavailable, partial reads and writes help with availability.

Benefits:
- Efficient reads and writes, they’re distributed across shards
- Storage capacity, each shard has part of the dataset
- High availability, partial read/write operations are performed if shards are unavailable

20
Q

MongoDB: App Server

A

Each App Server has a single Router(mongos), which acts as an interface between the applications and the shared cluster. It processes all requests and decides how the query is distributed based on the metadata from the config server.

21
Q

MongoDB: Config Servers (replica set)

A

These store the metadata and configuration settings for clusters. Config servers in shared clusters can be implemented as replica sets.

22
Q

MongoDB: Shard (replica set)

A

To benefit from replication, shards and config servers may be implemented as replica sets.

23
Q

MongoDB: MapReduce Functionality

db.orders.mapReduce(
function() {emit(this.cust_id,this.amount);},
function(key,values) {return Array.sum(values)},
{
query: {status: “A”},
out: “order_totals”
}
)

A

function() line maps value with the key and emits the key and value pair.

function(key,values) line reduces all values associated with a particular key to a single object.

query line selects the input documents to the map function.

out line is the location of the result.

“this” refers to the document that the map-reduce operation is processing

24
Q

CAP Theorem

A

This theorem applies to distributed systems and deals with the trade-offs between three properties;
1. Consistency
2. Availability
3. Partition Tolerance (the system continues to operate even if network partitions divide the system into isolated groups.

A distributed system can only guarantee 3 of the 4 ACID properties.