The NoSQL Ecosystem Flashcards

1
Q

NoSQL: definition, according to the community

A

Not Only a SQL interface, referring to providing an alternative rather than a wholesale replacement for SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SQL’s expressiveness makes it challenging to ___

A

reason about the cost of each query, and thus the cost of a workload

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Application developers may find using relational data models to be challenging because ___

A

it may not be perfect for modeling every kind of data (i.e. lists, queues, sets, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If relational data grows past the capacity of one server, then ___

A

the tables in the database will have to be partitioned across computers, leading to denormalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Complex query logic is typically left to the application, resulting in ___

A

a data store with more predictable query performance because of lack of variability in queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Two characteristics of Google’s BigTable

A

hierarchical range-based partitioning scheme

strict consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Two characteristics of Amazon’s Dynamo

A

Maps keys to application-specific blobs of data

Loose consistency makes the partitioning model resilient to failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Considerations regarding NoSQL systems (SPACTSDD)

A
Scalability
Partitioning
Analytical workloads
Consistency
Transactional semantics
Single-server performance
Data and query model
Durability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The simplest form of a NoSQL store is a ___

A

key-value store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

key-value store

A

each key is mapped to a value containing arbitrary data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

store popularized by Redis

A

key-data structure store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

key-data structure store

A

assigns each value a type (i.e. integer, string, list, set, sorted set, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

store common to CouchDB, MongoDB, Riak

A

key-document store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

key-document store

A

map a key to some document that contains structured information in a JSON or JSON-like format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

key-document stores grant a lot of freedom in document modeling, however ___

A

application-based query logic can become exceedingly complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

store common to HBase, Cassandra

A

BigTable column family store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

column family store

A
  • complex key identifies a row containing data stored in one or more Column Families
  • each row can contain multiple columns with a CF
  • values within each column are timestamped
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

store common to HyperGraphDB, Neo4J

A

graph store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

exception to key-only lookup: MongoDB

A

allows indexing of data based on any number of properties and has a relatively high-level language for specifying which data to retrieve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

exception to key-only lookup: BigTable-based systems

A

support scanners to iterate over a column family and select particular items by a filter on a column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

exception to key-only lookup: CouchDB

A

allows creation of different views of the data and running MapReduce tasks across the table to facilitate more complex lookups and updates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

ACID

A

Atomic
Consistency
Isolation
Durability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Most NoSQL systems choose performance over ___

A

full ACID guarantees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Redis is an exception to NoSQL’s no-transaction trend, in that ___

A

it provides a MULTI command to combine multiple operations atomically and a WATCH command to allow isolation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

benefits of schema-free storage

A

supports less structured data requirements and requires less structured data requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

after a few iterations of a project relying on sloppy-schema NoSQL systems, ___

A

data and schema versioning is usually present in application-level code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

single-server durability

A

ensures that any data modification will survive a server restart or power loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

the OS may not immediately write data to an on-disk file, instead ___

A

buffering the write to group several writes together in a single operation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

typical hard drives can perform ___ random accesses (seeks) per second

A

100 - 200

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

typical hard drives are limited to ___ of sequential writes

A

30 - 100 MB/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

ensuring efficient single-server durability means ___

A

limiting the number of random writes the system incurs and increasing the number of sequential writes per hard drive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Ideally, one wants to minimize ___ and maximize ___, all while ___

A

the number of writes between fsync calls

the number of those writes that are sequential

never telling the user their data has written until the write has been fsynced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Techniques for improving performance of single-server durability guarantees

A

Control fsync frequency
Increase sequential writes by logging
Increase throughput by grouping writes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Memcached offers ___ in exchange for ___

A

no on-disk durability

extremely fast in-memory operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Redis offers several options for ___

A

when to call fsync

36
Q

To reduce random writes, some systems ___

A

append update operations to a sequentially written file (a log)

37
Q

log-structured merge tree / log-structured hash table

A

combining logs and lookup data structures into one

38
Q

Techniques such as log-structure merge trees / hash tables and modified B+ trees ___

A

result in improved write throughput, but require a periodic log compaction

39
Q

group commit

A

grouping multiple concurrent updates within a short window into a single fsync call

40
Q

benefit of group commit

A

increase in throughput, as multiple log appends can happen in a single fsync

41
Q

drawback of group commit

A

higher latency per update, as users must wait on several concurrent updates for acknowledgement of their own update

42
Q

Multi-server durability varies between systems as either ___ or ___

A

traditional primary-replica structure

replication where multiple servers store copies of the data

43
Q

scaling up

A

adding more RAM and disks to handle load on one machine

44
Q

scaling out

A

replicate data and spread requests across multiple machines

45
Q

the ideal horizontal scalability goal is

A

linear scalability

46
Q

linear scalability

A

doubling the number of machines in your storage system doubles the query capacity of the system

47
Q

sharding

A

the act of splitting your read and write workloads across multiple machines to scale out your storage system

48
Q

sharding your data means that no one machine ___ but also is unable to ____

A

has to handle the write workload on the entire dataset

answer queries about the entire dataset

49
Q

sharding adds ___

A

system complexity

50
Q

two ways to scale without sharding

A

read replicas

caching

51
Q

read replica structure

A

make copies of the data on multiple machines, while write requests go to a primary node

52
Q

Generally, the less stringent the demands for freshness of content, the more you can ___

A

use read replicas to improve read-only query performance

53
Q

___ and ___ allow you to scale up your read-heavy workloads

A

read replicas

caching

54
Q

To add memory to Memcached’s cache pool:

A

just add another Memcached host

55
Q

sharding through coordinators: Lounge and BigCouch

A

a coordinator distributes requests to individual CouchDB instances based on the key of the requested doc

56
Q

sharding through coordinators: Gizzard

A

takes standalone data stores and arranges them in trees of any depth to partition keys by key range

57
Q

NoSQL systems built around Dynamo’s consistent hashing technique

A

Voldemort
Riak
Cassandra

58
Q

consistent hashing

A

a kind of hashing such that when a hash table is resized, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots

59
Q

range partitioning differs from consistent hashing in that ___

A

two keys that are next to each other in the key’s sort order are likely to appear in the same partition

60
Q

range partitioning allows active management of load by ___

A

having a load manager that can reduce the size of a range on an overloaded server

61
Q

tablets in BigTable

A

stores a range of row keys and values within a column family, maintaining all necessary logs and data structures to answer queries

62
Q

as BigTable’s tablets change in size, ___

A

two small tablets may merge or a big tablet splits in two

63
Q

a primary server in BigTable manages ___

A

tablet size, load, and availability

64
Q

to recognize and handle machine failures, the BigTable paper describe the use of Chubby, which is ___

A

a distributed locking system for managing server membership and liveness

65
Q

ZooKeeper is used in several Hadoop-based projects to ___

A

manage secondary leader servers and tablet server reassignment

66
Q

BigTable employs a hierarchical approach to range-partitioning by ___

A

maintaining tablet assignment in a metadata table, which is also sharded into tablets

67
Q

HBase uses BigTable’s hierarchical approach to range-partitioning by ___

A

using HDFS to handle data storage, replication, and consistency, leaving the rest to servers

68
Q

MongoDB handles range-partitioning by ___

A

using config nodes to specify key ranges, staying in sync with a two-phase commit protocol

69
Q

Cassandra allows fast range scans over data by ___

A

preserving order in its partitioning, mapping data to the server directly managing its key range

70
Q

Gizzard’s routing servers ___

A

form routing hierarchies of any depth, assigning ranges of keys to servers below them in the hierarchy

71
Q

range partitioning is the obvious choice when ___

A

one will be frequently be performing range scans over the keys of the data, avoiding random node jumps over the network

72
Q

range partitioning requires the up-front cost of ___

A

maintaining routing and configuration nodes

73
Q

when executed well, range partitioning data can be load-balanced ___

A

in small chunks which can be re-assigned in high-load situations

74
Q

In practice, maintaining replicas are hard and the following will happen:

A

crash and get out of sync
crash and never come back
networks will partition two sets of replicas
messages between machines will get delayed or lost

75
Q

two major approaches to data consistency in NoSQL ecosystem

A

strong consistency

eventual consistency

76
Q

systems that promote strong consistency ___

A

ensure that the replicas of a data item will always be able to come to consensus on the value of a key

77
Q

the minimum R, W, and N choices for ensuring strong consistency while allowing temporary replica disagreements is

A

R + W = N + 1

78
Q

in HDFS, a write cannot succeed until ___, while a read ___

A

it has been replicated to all N servers (W=N)

will be satisfied by a single replica (R=1)

79
Q

Dynamo-based systems use a type of versioning called

A

vector clocks

80
Q

Voldemort handles conflicts by ___

A

returning multiple copies of the key to the requesting client application

81
Q

Cassandra resolved conflicts by ___

A

using the most recently timestamped version of the data

82
Q

Voldemort’s and Cassandra’s conflict resolution are both present in ___

A

Riak

83
Q

CouchDB provides a hybrid of Voldemort’s and Cassandra’s conflict resolution:

A

it identifies a conflict and allows users to query for conflicted keys for manual repair, but deterministically picks a version to return to users until conflicts are repaired

84
Q

read repair is handled in Dynamo-based systems by ___

A

repairing out-of-sync replicas of the data in the background while returning the non-conflicting data to the requestor

85
Q

hinted handoff

A

assigning a node to temporarily take over an unavailable node’s write workload, forwarding all those writes when the node is available again

86
Q

Cassandra and Riak synchronize from one another using ___

A

Merkle trees

87
Q

gossip

A

periodically (~1s) a node will communicate with a random node to exchange knowledge on other nodes’ health