Cloud Storage - Mabel Flashcards

Question 1

Q

What constraints do we let go of when entering the “NoSQL universe”

(Expecting 4)

Answer

A

Tabular Integrity
Domain Integrity
Atomic Integrity
Normal Forms

Question 2

Q

In NoSQL, “data denormalization” design covers which 2 concepts?

Answer

A

Heterogeneous Data
Nested Data

This can be referred to broadly as “Denormalised Data”

Question 3

Q

Define “Heterogeneous Data”

Answer

A

Heterogeneous data does not fulfil domain integrity (it may not even have a schema) and also not relational integrity.

Question 4

Q

Define “Nested Data”

Answer

A

Nested data is not in first normal form (violating atomic integrity). For example tables inside tables

Question 5

Q

What are the two main paradigms to store data?

Answer

A

“Traditional” Database
Data lakes

Question 6

Q

“Traditional” Databases

Answer

A

e.g PostgreSQL
E(xtract)
T(ransform)
L(oad)

Question 7

Q

Data Lakes

Answer

A

Read directly from a file system (in situ)
Stored “as is”
More convenient if you only want to read the data (“read-intensive”/OLAP)
e.g using pandas in Python

Question 8

Q

How is data stored locally?

Answer

A

Files are organised in a heirarchy (files and directories)
File content is stored and read in blocks (roughly 4kB)

Question 9

Q

How does local storage scale?

Answer

A

Storage on Local Machine
LAN(NAS) (Harddrive accesible from multiple places on a network)
Does not support WAN

NAS = network-attached storage
WAN = wide-area

1,000 to 1,000,000 files ok on a laptop, but 1,000,000,000 will break

LAN = local-area network

Question 10

Q

What are ways we can make storage scale?

Answer

A

Get rid of heirarchy
make the metadata flexible: attributes can differ from file to file
(no schema)
use simple, data model: a flat list of files (called objects) identified with an identifier (ID); blocks are not exposed to the user,
use a large number of cheap machines rather than some “super- computer”

Question 11

Q

Scaling Up vs Scaling Out

Answer

A

Scaling Up - A bigger machine: more memory, more or faster CPU cores, a larger disk
Scaling Out - One can buy more, similar machines and share the work across them

Scaling Out price increases linearly

A better way is to optimise code which should always be done first

Question 12

Q

Data Centers (in numbers)

Answer

A

1,000-100,000 machines in a data center
1-200 cores per server

100,000 seems to be a hard limit - electricity consumption and cooling

Question 13

Q

Servers (in numbers)
How many cores?
How much local storage?
How much RAM?

Answer

A

Server = Node = Machine
1 and 64 cores per server
1-30 TB local storage per server
16GB - 24 TB of RAM per server

Laptops typically have up to 24 cores

Question 14

Q

Networks (in numbers)
Network Bandwidth?

Answer

A

The network bandwidth goes from 1 to 200 Gbit/s (HPC allows for higher)

Bandwidth is the highest within the same cluster

Bits for network as opposed to typical bytes

Question 15

Q

How do we measure distance in data centers?

Answer

A

Rack Units
Rack Servers can be between 1-4 RUs
A cluster is just a room filled with racks put next to each other.

Edit: shouldnt this be number of hops between nodes?

Question 16

Q

Describe Amazon S3

Answer

A

Simple Storage Service
Objects and Buckets - IDs (Can PUT, GET, DELETE)
An object can be at most 5 TB
Only possible to upload an object in a single chunk if it is less than 5 GB
By default users get 100 buckets

5TB is size that fits on a single disk typically

Object just means file

Question 17

Q

S3 Service Level Agreements (SLAs)

Answer

A

Durability - S3 loses less than 1 in 100 billion
Availability - S3 will be available > 99.99% of year (1h/year)

99% = < 4 days
99.999% = six minutes

99.9% = < 10hrs

Question 18

Q

What is the CAP theorem?

Yet another impossibility triangle

Answer

A

Consistency - Machines will all have the same answer (atomic consistency - all nodes see the same data)
Availability - If people make requests they get an answer
Partition Tolerance - If network gets partitioned, subnetworks still allow delivery to customers

Can’t have all 3 in a network partition
Will either be AP or CP
CP - not available until network reconnected
AP - not consistent until reconnected (eventual consistency)
Unavailable /= Partition Intolerant

Question 19

Q

RestAPI

REpresentational State Transfer

Answer

A

Sending queries over HTTPs - Rest API supports integration with many host languages.

Generally successful response status codes are 200-299 and client error response status codes are 400-499

Requests have Method, URI, [Header], [Body]. Responses have Status Code, [Header], [Body]

Question 20

Q

Deconstruct this URI/IRI
http://www.example.com/api/collection/foobar?id=foobar#head

Answer

A

“http” is the scheme;
“//www.example.com“ is the authority
“/api/collection/foobar” is the path
“?id=foobar” is the query
“#head” is the fragment

Question 21

Q

Describe the main HTTP methods

Answer

A

GET - no body, returns representation of resource, makes no changes
PUT - creates or updates a resource from a representation of a newer version of it
DELETE - no body, this method deletes a resource
POST - just generic, if you don’t know what to use, use this

Question 22

Q

Which of the following most generally designates a relation that is transitive, reflexive, and antisymmetric?
* A total order
* A preorder
* An equivalence relation
* A (non-strict) partial order

Lecture Question - Antisymmetric means if a!=b and a -> b then b /-> a

Answer

A

A (non-strict) partial order

Partial Order is a DAG
Preorder is only Transitive and Reflexive (an antisymmetric Preorder is a Partial Order, a symmetric Preorder is an equivalence relation)
Total Order is a partial order where everything is in relation to something else

Question 23

Q

Azure Blob Storage

Answer

A

Object IDs are given by Account + Container + Blob
Object API - Block/Append/Page
Blocks are for data - e.g datasets
Append is for logging
Limits
* 190.7 TB block
* 195 GB append
* 8 TB page

Blob = Binary Large OBject

Question 24

Q

What is a storage stamp?

(Azure)

Answer

A

10-20 racks * 18 storage nodes/rack (30 PB)
kept below 70/80% storage capacity

Question 25

Q

What is storage replication like within/inbetween stamps?

Answer

A

Inter-stamp replication (asynchronous) of the partition layer
Intra-stamp replication (synchronous) of the stream layer

Question 26

Q

Why are there data centers in many different regions?

Answer

A

Optimise latency
Resillience to natural catastrophes

Question 27

Q

Why do we not use cloud storage for relational databases?

Answer

A

Latency - S3 can take around 100-200ms to query as opposed to 1-9ms in a typical database

Question 28

Q

How are key-value stores used?

Answer

A

Similar model to object storage
Objects are now 400KB (Dynamo)
Streaming systems use systems like S3 Blob storage and download the smaller objects in a “stream”

Question 29

Q

What does the basic API for key-value storage look like?

Answer

A

get(key, context) -> value, context
put(key,value, context)

delete also is in API

Question 30

Q

Why do we use Key-Value stores?

Answer

A

Its API is considerably simpler than that of a relational database (which comes with query languages)
It does not ensure atomic consistency; instead, it guarantees eventual consistency
A key-value store scales out well, in that it is very fast also at large scales.

AP

Question 31

Q

Amazon Dynamo

Answer

A

Amazon Dynamo is a specific key-value based on the Chord protocol, which is a Distributed Hash Table.

Distributed Hash Tables are generally highly scalable, robust against failure and self organizing. However, they do not provide any support for range queries, guarantees on data integrity (which is pushed to the user), and do not deal with any security issues.

On the physical level, a distributed hash table is made of nodes (the machines we have in a data center, piled up in racks) that work following a few design principles.

Question 32

Q

What design principles does a Distributed Hash Table follow?

Answer

A

Incremental stability. Nodes can enter and leave in a way that does not afect the stability of the system
Symmetry. All machines run the same software and have exactly the same behaviour
3.** Decentralization**. There is no “central node” that orchestrates the others
Heterogeneity. The nodes may have different CPU power, amounts of memory, etc.

Only peer-to-peer network shown in lectures

Question 33

Q

How many bits are Dynamo hashes?

Question 34

Q

Describe how Amazon Dynamo functions

Answer

A

Dynamo generates 128 bit hashes (16 bytes)

IDs are organised in a ring and machines get a random positition on ring. When objects are hashed, they are stored on the machine that is reached first when travelling clockwise around the ring

pages 67 onwards in textbook

Question 35

Q

What happens when a new machine joins the ring?

Answer

A

The new machine gets assigned a position. This position is in the domain of responsibility of some existing node m.

This ring interval is now going to be split: everything between m and n remains within the responsibility of m, but the other half of the domain of responsibility is newly the responsibility of n.

The corresponding data needs to be transferred from m to n.

Question 36

Q

What happens when a machine leaves the ring?

Answer

A

Graceful Exit The node (n) informs the rest of the cluster before leaving. Then, it will transfer all its interval of responsibility to the next node clockwise (m) on the ring.

Abrupt Exit Data will be gone if no data replication is in place.

Question 37

Q

How do we do data replication in a ring?

Answer

A

Data within an interval will not be stored only the node that follow the interval clockwise, but on the next N nodes that follow the interval clockwise. Equivalently explained, a node is responsible not only for the interval that follows it counterclockwise, but for the next N intervals that follow it counterclockwise.

Question 38

Q

How do we find out which node has our data in Amazon Dynamo?

Answer

A

Preference Lists

Each node knows, for every key (or key range), which node(s) are responsible (and hold a copy) of it. This is done by associating every key (key range) with a list of nodes, by decreasing priority (going down the ring clockwise) (the highest priority node is the first in the ring and the first assigned machine)

Question 39

Q

Explain:
R+W>N

Answer

A

Upon reading a value, R is the minimum number of nodes (among the N nodes responsible for the key) from which a copy of the value must be obtained and compared. Upon writing a new value, W is the minimum number of nodes (among the N nodes responsible for the key) that must have acknowledged receipt and confirmed storage of the value before the put or delete transaction succeeds.

At least N nodes in each preference list

Question 40

Q

Describe how an intial request for data is handled in Dynamo

Answer

A

Connect to load balancer
Directed to a random node on the ring
Random node connects you to “coordinator node” - node first om preference list
Connects to N-1 nodes hosting replicas
Stops when R of them have responded

R is the min number of nodes
read from

Question 41

Q

Describe the Pros and Cons of Distributed Hash Tables

Answer

A

Pro
* Highly scalable
* Robust against failure
* Self organising
Con
* Lookup, no search
* Data Integrity
* Security Issues

Question 42

Q

What are some potential issues with Distributed Hash Tables? And what is the solution?

Looking for 2

Answer

A

Poor distribution - All machines are in one section of ring
Heterogeneous Performance

Tokens!! :) We split up the ring into tokens (some mulptiple of the number of machines) and assign tokens randomly to nodes.

Question 43

Q

How does adding/deletion of nodes work with tokens?

Answer

A

An added node gets assigned tokens to take over from other nodes (not all tokens have to come from same node)

A deleted node gets its tokens redistributed to other nodes in the network (not all tokens have to go to the same node)

Question 44

Q

What is a vector clock?

Answer

A

In AP systems, where partitions may exist, vector clocks are used to reconcile versions following a DAG structure.

A vector clock can logically be seen as a map from nodes (machines) to integers, i.e., the version number is incremented per machine rather than globally. A version number increases for a machine when this machine processes the corresponding entry (key-value) and updates it

pages 74-76 of the textbook illustrate this
+ tutorial 2

Question 45

Q

How to answer this type of question?

Given a list of versions, draw the version DAG that the coordinator node will build for returning available versions.

Answer

A

If no partial ordering exists between nodes, then those nodes should be on the same level, i.e., they are both valid versions.
If there is an edge between two nodes, then the parent node should be smaller than the child.
You cannot have skip connections, i.e., there cannot be an edge from an ancestor node (excluding the parent node) directly to a child node.
Transitive edges shouldn’t be present in the version DAG.
Each edge represents the update of exactly one entry of the vector clock.

Question 46

Q

How do we define a partial order with regards to vector clocks?

Answer

A

Vector clocks can be compared to each other with a partial order relation ≤

A vector clock is <= another vector clock if for each machine, the associated integer in the first vector clock is also smaller than or equal to the associated integer in the second vector clock.

e.g {“A”:2,”C”:1} ≤ {“A”:3,”B”:1,”C”:1}

Question 47

Q

Order SSD, HDD, SRAM w.r.t
* Price
* Speed (Read/Write)
* Capacity

Solid State Drive, Hard Drive Disk, Synchr. Dynam. Random Access Memory

Answer

A

From cheapset to most expensive: HDD SSD SDRAM
From slowest to fastest (in terms of read/write speed): HDD SSD SDRAM
By their capacity in increasing order: SDRAM SSD HDD

Question 48

Q

Cloud Storage - Mabel Flashcards

Mabel to Fill