Week 5: NoSQL Databases and MongoDB Flashcards
NoSQL
NoSQL databases are non-relational, highly-scalable and fault tolerant, designed for large, distributed, semi-structured and unstructured data, built mostly for queries and few asynchronous inserts and updates, and are accessible through API-based query interfaces and data-specific query languages.
Availability is favoured over consistency, approximate answers are acceptable, and overall the system is simpler and faster.
ACID
Relational databases have the following 4 properties:
- Atomicity: each transaction is a single, indivisible unit.
- Consistency: the data is accurate and meets pre-existing requirements after each transaction.
- Isolation: concurrent transactions don’t affect each other.
- Durability: changes resulting from transactions are stored event in the event of failures.
BASE
This acronym describes the properties of NoSQL databases.
- Basically Available: the client’s request will always be acknowledged. Availability is prioritised even if system failures may jeapordise successful completion of the client’s request.
- Soft State: the data may be inconsistent when its read.
- Eventually Consistent: read requests after write requests may not return consistent results, but they’ll be updated once changes are propagated to all notes.
3 V’s of Big Data
Volume: NoSQL databases allow scaling out (adding more nodes to the commodity server).
Velocity: fast writes using schema-on-read (data are applied to the schema as they leave the database). This allows for low write latency (adding nodes decreases latency).
Variety: can store semi-structured and unstructured data (schema is loose or non-existent).
RDBMS vs NoSQL
Elastic Scaling:
- RDBMS scales up, with bigger server handling bigger loads.
- NoSQL scales out by distributing data across multiple hosts seamlessly.
Big Data:
- RDBMS doesn’t scale up well to handle big data.
- NoSQL is designed for big data.
DBA Specialists:
- RDBMS requires highly trained experts to monitor DB.
- NoSQL requires less management, automatically repairs itself, and has simpler data models.
Flexible Data Models:
- RDBMS needs careful schema change management.
- NoSQL databases don’t need complicated schema management.
Economic Cost:
- RDBMS relies on expensive proprietary servers to manage data.
- NoSQL uses clusters of cheap commodity servers to manage data and transaction volumes. the cost per gigabyte or transactions/second for NoSQL can be lower than the cost for RDBMS.
Lack of Expertise:
- There are plenty of experienced RDBMS developers.
- There are fewer NoSQL developers.
Analytics and Business Intelligence:
- RDBMS is designed for analytics.
- NoSQL is designed for the needs of Web 2.0, not for ad hoc data queries.
NoSQL Database Types:
Key/Value: “Hashtable” of keys
Examples: redis, riak
Document: stores documents comprised of tagged elements
Examples: MongoDB, CouchDB
Column-family: each storage block contains data from one column
Examples: Cassandra, H-Base
Graph: stores graph-structured data (nodes and edges)
Examples: Neo4j, HyperGraphDB
Key-value Databases
They store key value pairs, with keys being unique. Values are only retrievable using keys and are opaque to the database. Key-value pairs are organised into collections/buckets. Data are partitioned across nodes by keys. The partition for a key is determined by hashing the key.
Pros:
- Very fast, simple model, able to scale horizontally
- Good for unstructured data, fast read/writes, when a key suffices for identifying a value, no dependencies among values, and simple insert/delete/select operations.
Cons:
- Many data structures (objects) can’t be easily modelled as key-value pairs
- Not good for operations (search, filter, update) on individual attributes of a value, and operations on multiple keys in a single transaction.
Document Databases
These store documents in semi-structured form. A document is in a nested structure in JSON or XML format.
Suitable for:
- Semi-structured data with a flat or nested schema.
- Search for different values of the document.
- Updates on subsets of values.
- CRUD (Create, Read, Update, Delete) operations.
- Schema changes are likely.
Unsuitable for:
- Binary data.
- Updates on multiple documents in a single transaction.
- Joins between multiple documents.
Key-value vs Document Databases
- In document databases, each document has a unique key
- Document databases provide more support for value operations, as they’re aware of values, selection operations can retrieve fields or parts of values, subsets of values can be updated together, indexes are supported, and each document has a schema that can be inferred from the structure of the value.
Column-family Databases
These databases store columns, with each column having a name and value. Columns related to each other are grouped into rows. Rows don’t necessarily have a fixed schema or number of columns.
Suitable for:
- Data that has a tabular structure with many columns and sparsely populated rows.
- Columns that are interrelated and accessed together often.
- OLAP (Online Analytical Processing).
- Realtime random read-write is needed
Insert/select/update/delete operations.
Unsuitable for:
- Joins.
- ACID support is needed.
- Binary Data.
- SQL-compliant queries.
- Frequently changing query patterns that lead to column restructuring.
Applications:
- Data warehousing
- Data Mining
- Google BigTable
- RDF (Resource Description Framework)
- Info Retrieval
- Scientific Datasets
Graph Databases
Data is stored in a graph-like structure. Nodes represent entities and have sets of attributes. Edges represent relationships and have sets of attributes. These databases are optimised for representing connections, as adding and removing edges and attributes are easy. The underlying storage can be native graph storage, relational database, key/value database, document database, etc.
Suitable for:
- Data comprised of interconnected entities.
- Queries are based on entity relationships.
- Need to find groups of interconnected entities.
- Need to find distances between entities.
Unsuitable for:
- Joins.
- ACID support is needed.
- Binary data.
- SQL-compliant queries.
- Frequently changing query patterns that lead to column restructuring.
Applications
- Social
- Recommendation
- Geography
MongoDB
It’s a document database. It’s hash-based, meaning that it stores hashes (system-assign _id) with keys and values for each document. MongoDB has a dynamic schema and uses the BSON (Binary JSON) format. It has API’s for many languages.
MongoDB: Insert
Example:
To insert a document with _id of 10, field item with value of “box”, and field quantity with a value of 20,
db.products.insert({_id:10,item:”box”,qty:20})
Example:
Inserting multiple documents,
db.inventory.insertMany([
{item.”journal”,qty:25,tags:[“blank”,”red”],size:{h:14,w:21,uom:”cm”}},
{item:”mat”,qty:85,tags:[“gray”],size:{h:27.9,w:35.5,uom:”cm”}}
])
MongoDB: Find
Example:
Finding documents with a quantity greater than 4,
db.products.find{{qty:{$gt4}})
MongoDB: Update
Example:
db.books.update{
{_id:1},
{
$inc:{stock:5},
$set:{
item:”ABC123”,
“info.publisher”:”2222”,
tags:[“software”],
“ratings.1”:{by:”xyz”,rating:3}
}
}
}