The NoSQL Ecosystem Flashcards

1
Q

NoSQL: definition, according to the community

A

Not Only a SQL interface, referring to providing an alternative rather than a wholesale replacement for SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SQL’s expressiveness makes it challenging to ___

A

reason about the cost of each query, and thus the cost of a workload

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Application developers may find using relational data models to be challenging because ___

A

it may not be perfect for modeling every kind of data (i.e. lists, queues, sets, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If relational data grows past the capacity of one server, then ___

A

the tables in the database will have to be partitioned across computers, leading to denormalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Complex query logic is typically left to the application, resulting in ___

A

a data store with more predictable query performance because of lack of variability in queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Two characteristics of Google’s BigTable

A

hierarchical range-based partitioning scheme

strict consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Two characteristics of Amazon’s Dynamo

A

Maps keys to application-specific blobs of data

Loose consistency makes the partitioning model resilient to failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Considerations regarding NoSQL systems (SPACTSDD)

A
Scalability
Partitioning
Analytical workloads
Consistency
Transactional semantics
Single-server performance
Data and query model
Durability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The simplest form of a NoSQL store is a ___

A

key-value store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

key-value store

A

each key is mapped to a value containing arbitrary data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

store popularized by Redis

A

key-data structure store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

key-data structure store

A

assigns each value a type (i.e. integer, string, list, set, sorted set, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

store common to CouchDB, MongoDB, Riak

A

key-document store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

key-document store

A

map a key to some document that contains structured information in a JSON or JSON-like format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

key-document stores grant a lot of freedom in document modeling, however ___

A

application-based query logic can become exceedingly complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

store common to HBase, Cassandra

A

BigTable column family store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

column family store

A
  • complex key identifies a row containing data stored in one or more Column Families
  • each row can contain multiple columns with a CF
  • values within each column are timestamped
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

store common to HyperGraphDB, Neo4J

A

graph store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

exception to key-only lookup: MongoDB

A

allows indexing of data based on any number of properties and has a relatively high-level language for specifying which data to retrieve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

exception to key-only lookup: BigTable-based systems

A

support scanners to iterate over a column family and select particular items by a filter on a column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

exception to key-only lookup: CouchDB

A

allows creation of different views of the data and running MapReduce tasks across the table to facilitate more complex lookups and updates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

ACID

A

Atomic
Consistency
Isolation
Durability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Most NoSQL systems choose performance over ___

A

full ACID guarantees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Redis is an exception to NoSQL’s no-transaction trend, in that ___

A

it provides a MULTI command to combine multiple operations atomically and a WATCH command to allow isolation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
benefits of schema-free storage
supports less structured data requirements and requires less structured data requirements
26
after a few iterations of a project relying on sloppy-schema NoSQL systems, ___
data and schema versioning is usually present in application-level code
27
single-server durability
ensures that any data modification will survive a server restart or power loss
28
the OS may not immediately write data to an on-disk file, instead ___
buffering the write to group several writes together in a single operation
29
typical hard drives can perform ___ random accesses (seeks) per second
100 - 200
30
typical hard drives are limited to ___ of sequential writes
30 - 100 MB/s
31
ensuring efficient single-server durability means ___
limiting the number of random writes the system incurs and increasing the number of sequential writes per hard drive
32
Ideally, one wants to minimize ___ and maximize ___, all while ___
the number of writes between fsync calls the number of those writes that are sequential never telling the user their data has written until the write has been fsynced
33
Techniques for improving performance of single-server durability guarantees
Control fsync frequency Increase sequential writes by logging Increase throughput by grouping writes
34
Memcached offers ___ in exchange for ___
no on-disk durability extremely fast in-memory operations
35
Redis offers several options for ___
when to call fsync
36
To reduce random writes, some systems ___
append update operations to a sequentially written file (a log)
37
log-structured merge tree / log-structured hash table
combining logs and lookup data structures into one
38
Techniques such as log-structure merge trees / hash tables and modified B+ trees ___
result in improved write throughput, but require a periodic log compaction
39
group commit
grouping multiple concurrent updates within a short window into a single fsync call
40
benefit of group commit
increase in throughput, as multiple log appends can happen in a single fsync
41
drawback of group commit
higher latency per update, as users must wait on several concurrent updates for acknowledgement of their own update
42
Multi-server durability varies between systems as either ___ or ___
traditional primary-replica structure replication where multiple servers store copies of the data
43
scaling up
adding more RAM and disks to handle load on one machine
44
scaling out
replicate data and spread requests across multiple machines
45
the ideal horizontal scalability goal is
linear scalability
46
linear scalability
doubling the number of machines in your storage system doubles the query capacity of the system
47
sharding
the act of splitting your read and write workloads across multiple machines to scale out your storage system
48
sharding your data means that no one machine ___ but also is unable to ____
has to handle the write workload on the entire dataset answer queries about the entire dataset
49
sharding adds ___
system complexity
50
two ways to scale without sharding
read replicas caching
51
read replica structure
make copies of the data on multiple machines, while write requests go to a primary node
52
Generally, the less stringent the demands for freshness of content, the more you can ___
use read replicas to improve read-only query performance
53
___ and ___ allow you to scale up your read-heavy workloads
read replicas caching
54
To add memory to Memcached's cache pool:
just add another Memcached host
55
sharding through coordinators: Lounge and BigCouch
a coordinator distributes requests to individual CouchDB instances based on the key of the requested doc
56
sharding through coordinators: Gizzard
takes standalone data stores and arranges them in trees of any depth to partition keys by key range
57
NoSQL systems built around Dynamo's consistent hashing technique
Voldemort Riak Cassandra
58
consistent hashing
a kind of hashing such that when a hash table is resized, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots
59
range partitioning differs from consistent hashing in that ___
two keys that are next to each other in the key's sort order are likely to appear in the same partition
60
range partitioning allows active management of load by ___
having a load manager that can reduce the size of a range on an overloaded server
61
tablets in BigTable
stores a range of row keys and values within a column family, maintaining all necessary logs and data structures to answer queries
62
as BigTable's tablets change in size, ___
two small tablets may merge or a big tablet splits in two
63
a primary server in BigTable manages ___
tablet size, load, and availability
64
to recognize and handle machine failures, the BigTable paper describe the use of Chubby, which is ___
a distributed locking system for managing server membership and liveness
65
ZooKeeper is used in several Hadoop-based projects to ___
manage secondary leader servers and tablet server reassignment
66
BigTable employs a hierarchical approach to range-partitioning by ___
maintaining tablet assignment in a metadata table, which is also sharded into tablets
67
HBase uses BigTable's hierarchical approach to range-partitioning by ___
using HDFS to handle data storage, replication, and consistency, leaving the rest to servers
68
MongoDB handles range-partitioning by ___
using config nodes to specify key ranges, staying in sync with a two-phase commit protocol
69
Cassandra allows fast range scans over data by ___
preserving order in its partitioning, mapping data to the server directly managing its key range
70
Gizzard's routing servers ___
form routing hierarchies of any depth, assigning ranges of keys to servers below them in the hierarchy
71
range partitioning is the obvious choice when ___
one will be frequently be performing range scans over the keys of the data, avoiding random node jumps over the network
72
range partitioning requires the up-front cost of ___
maintaining routing and configuration nodes
73
when executed well, range partitioning data can be load-balanced ___
in small chunks which can be re-assigned in high-load situations
74
In practice, maintaining replicas are hard and the following will happen:
crash and get out of sync crash and never come back networks will partition two sets of replicas messages between machines will get delayed or lost
75
two major approaches to data consistency in NoSQL ecosystem
strong consistency | eventual consistency
76
systems that promote strong consistency ___
ensure that the replicas of a data item will always be able to come to consensus on the value of a key
77
the minimum R, W, and N choices for ensuring strong consistency while allowing temporary replica disagreements is
R + W = N + 1
78
in HDFS, a write cannot succeed until ___, while a read ___
it has been replicated to all N servers (W=N) will be satisfied by a single replica (R=1)
79
Dynamo-based systems use a type of versioning called
vector clocks
80
Voldemort handles conflicts by ___
returning multiple copies of the key to the requesting client application
81
Cassandra resolved conflicts by ___
using the most recently timestamped version of the data
82
Voldemort's and Cassandra's conflict resolution are both present in ___
Riak
83
CouchDB provides a hybrid of Voldemort's and Cassandra's conflict resolution:
it identifies a conflict and allows users to query for conflicted keys for manual repair, but deterministically picks a version to return to users until conflicts are repaired
84
read repair is handled in Dynamo-based systems by ___
repairing out-of-sync replicas of the data in the background while returning the non-conflicting data to the requestor
85
hinted handoff
assigning a node to temporarily take over an unavailable node's write workload, forwarding all those writes when the node is available again
86
Cassandra and Riak synchronize from one another using ___
Merkle trees
87
gossip
periodically (~1s) a node will communicate with a random node to exchange knowledge on other nodes' health