Cloud Storage - Mabel Flashcards

Mabel to Fill

1
Q

What constraints do we let go of when entering the “NoSQL universe”

(Expecting 4)

A
  • Tabular Integrity
  • Domain Integrity
  • Atomic Integrity
  • Normal Forms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In NoSQL, “data denormalization” design covers which 2 concepts?

A
  • Heterogeneous Data
  • Nested Data

This can be referred to broadly as “Denormalised Data”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define “Heterogeneous Data”

A

Heterogeneous data does not fulfil domain integrity (it may not even have a schema) and also not relational integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define “Nested Data”

A

Nested data is not in first normal form (violating atomic integrity). For example tables inside tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two main paradigms to store data?

A
  • “Traditional” Database
  • Data lakes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

“Traditional” Databases

A

e.g PostgreSQL
E(xtract)
T(ransform)
L(oad)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Lakes

A

Read directly from a file system (in situ)
Stored “as is”
More convenient if you only want to read the data (“read-intensive”/OLAP)
e.g using pandas in Python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is data stored locally?

A
  • Files are organised in a heirarchy (files and directories)
  • File content is stored and read in blocks (roughly 4kB)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does local storage scale?

A
  • Storage on Local Machine
  • LAN(NAS) (Harddrive accesible from multiple places on a network)
    Does not support WAN

NAS = network-attached storage
WAN = wide-area

1,000 to 1,000,000 files ok on a laptop, but 1,000,000,000 will break

LAN = local-area network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are ways we can make storage scale?

A
  • Get rid of heirarchy
  • make the metadata flexible: attributes can differ from file to file
    (no schema)
  • use simple, data model: a flat list of files (called objects) identified with an identifier (ID); blocks are not exposed to the user,
  • use a large number of cheap machines rather than some “super- computer”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Scaling Up vs Scaling Out

A

Scaling Up - A bigger machine: more memory, more or faster CPU cores, a larger disk
Scaling Out - One can buy more, similar machines and share the work across them

Scaling Out price increases linearly

A better way is to optimise code which should always be done first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Centers (in numbers)

A

1,000-100,000 machines in a data center
1-200 cores per server

100,000 seems to be a hard limit - electricity consumption and cooling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Servers (in numbers)
How many cores?
How much local storage?
How much RAM?

A

Server = Node = Machine
1 and 64 cores per server
1-30 TB local storage per server
16GB - 24 TB of RAM per server

Laptops typically have up to 24 cores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Networks (in numbers)

A

The network bandwidth goes from 1 to 200 Gbit/s (HPC allows for higher)

Bandwidth is the highest within the same cluster

Bits for network as opposed to typical bytes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do we measure distance in data centers?

A

Rack Units
Rack Servers can be between 1-4 RUs
A cluster is just a room filled with racks put next to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe Amazon S3

A

Simple Storage Service
Objects and Buckets - IDs (Can PUT, GET, DELETE)
An object can be at most 5 TB
Only possible to upload an object in a single chunk if it is less than 5 GB
By default users get 100 buckets

5TB is size that fits on a single disk typically

Object just means file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Service Level Agreements (SLAs)

A

Durability - S3 loses less than 1 in 100 billion
Availability - S3 will be available > 99.99% of year (1h/year)

99.9% = < 10hrs
99% = < 4 days
99.999% = six minutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the CAP theorem?

Yet another impossibility triangle

A
  • Consistency - Machines will all have the same answer (atomic consistency - all nodes see the same data)
  • Availability - If people make requests they get an answer
  • Partition Tolerance - If network gets partitioned, subnetworks still allow delivery to customers

Can’t have all 3 in a network partition
Will either be AP or CP
CP - not available until network reconnected
AP - not consistent until reconnected (eventual consistency)
Unavailable /= Partition Intolerant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

RestAPI

REpresentational State Transfer

A

Sending queries over HTTPs - Rest API supports integration with many host languages.

Generally successful response status codes are 200-299 and client error response status codes are 400-499

Requests have Method, URI, [Header], [Body]. Responses have Status Code, [Header], [Body]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Deconstruct this URI/IRI
http://www.example.com/api/collection/foobar?id=foobar#head

A
  • “http” is the scheme;
  • “//www.example.com“ is the authority
  • “/api/collection/foobar” is the path
  • “?id=foobar” is the query
  • “#head” is the fragment
21
Q

Describe the main HTTP methods

A
  • GET - no body, returns representation of resource, makes no changes
  • PUT - creates or updates a resource from a representation of a newer version of it
  • DELETE - no body, this method deletes a resource
  • POST - just generic, if you don’t know what to use, use this
22
Q

Which of the following most generally designates a relation that is transitive, reflexive, and antisymmetric?
* A total order
* A preorder
* An equivalence relation
* A (non-strict) partial order

Lecture Question - Antisymmetric means if a!=b and a -> b then b /-> a

A

A (non-strict) partial order

Partial Order is a DAG
Preorder is only Transitive and Reflexive (an antisymmetric Preorder is a Partial Order, a symmetric Preorder is an equivalence relation)
Total Order is a partial order where everything is in relation to something else

23
Q

Azure Blob Storage

A

Object IDs are given by Account + Container + Blob
Object API - Block/Append/Page
Blocks are for data - e.g datasets
Append is for logging
Limits
* 190.7 TB block
* 195 GB append
* 8 TB page

Blob = Binary Large OBject

24
Q

What is a storage stamp?

(Azure)

A

10-20 racks * 18 storage nodes/rack (30 PB)
kept below 70/80% storage capacity

25
Q

What is storage replication like within/inbetween stamps?

A

Inter-stamp replication (asynchronous) of the partition layer
Intra-stamp replication (synchronous) of the stream layer

26
Q

Why are there data centers in many different regions?

A
  1. Optimise latency
  2. Resillience to natural catastrophes
27
Q

Why do we not use cloud storage for relational databases?

A

Latency - S3 can take around 100-200ms to query as opposed to 1-9ms in a typical database

28
Q

How are key-value stores used?

A

Similar model to object storage
Objects are now 400KB (Dynamo)
Streaming systems use systems like S3 Blob storage and download the smaller objects in a “stream”

29
Q

What does the basic API for key-value storage look like?

A

get(key, context) -> value, context
put(key,value, context)

delete also is in API

30
Q

Why do we use Key-Value stores?

A
  • Its API is considerably simpler than that of a relational database (which comes with query languages)
  • It does not ensure atomic consistency; instead, it guarantees eventual consistency
  • A key-value store scales out well, in that it is very fast also at large scales.

AP

31
Q

Amazon Dynamo

A

Amazon Dynamo is a specific key-value based on the Chord protocol, which is a Distributed Hash Table.

Distributed Hash Tables are generally highly scalable, robust against failure and self organizing. However, they do not provide any support for range queries, guarantees on data integrity (which is pushed to the user), and do not deal with any security issues.

On the physical level, a distributed hash table is made of nodes (the machines we have in a data center, piled up in racks) that work following a few design principles.

32
Q

What design principles does a Distributed Hash Table follow?

A
  1. Incremental stability. Nodes can enter and leave in a way that does not afect the stability of the system
  2. Symmetry. All machines run the same software and have exactly the same behaviour
    3.** Decentralization**. There is no “central node” that orchestrates the others
  3. Heterogeneity. The nodes may have different CPU power, amounts of memory, etc.

Only peer-to-peer network shown in lectures

33
Q

How many bits are Dynamo hashes?

A

128 bits

34
Q

Describe how Amazon Dynamo functions

A

Dynamo generates 128 bit hashes (16 bytes)

IDs are organised in a ring and machines get a random positition on ring. When objects are hashed, they are stored on the machine that is reached first when travelling clockwise around the ring

pages 67 onwards in textbook

35
Q

What happens when a new machine joins the ring?

A

The new machine gets assigned a position. This position is in the domain of responsibility of some existing node m.

This ring interval is now going to be split: everything between m and n remains within the responsibility of m, but the other half of the domain of responsibility is newly the responsibility of n.

The corresponding data needs to be transferred from m to n.

36
Q

What happens when a machine leaves the ring?

A

Graceful Exit The node (n) informs the rest of the cluster before leaving. Then, it will transfer all its interval of responsibility to the next node clockwise (m) on the ring.

Abrupt Exit Data will be gone if no data replication is in place.

37
Q

How do we do data replication in a ring?

A

Data within an interval will not be stored only the node that follow the interval clockwise, but on the next N nodes that follow the interval clockwise. Equivalently explained, a node is responsible not only for the interval that follows it counterclockwise, but for the next N intervals that follow it counterclockwise.

38
Q

How do we find out which node has our data in Amazon Dynamo?

A

Preference Lists

Each node knows, for every key (or key range), which node(s) are responsible (and hold a copy) of it. This is done by associating every key (key range) with a list of nodes, by decreasing priority (going down the ring clockwise) (the highest priority node is the first in the ring and the first assigned machine)

39
Q

Explain:
R+W>N

A

Upon reading a value, R is the minimum number of nodes (among the N nodes responsible for the key) from which a copy of the value must be obtained and compared. Upon writing a new value, W is the minimum number of nodes (among the N nodes responsible for the key) that must have acknowledged receipt and confirmed storage of the value before the put or delete transaction succeeds.

At least N nodes in each preference list

40
Q

Describe how an intial request for data is handled in Dynamo

A
  1. Connect to load balancer
  2. Directed to a random node on the ring
  3. Random node connects you to “coordinator node” - node first om preference list
  4. Connects to N-1 nodes hosting replicas
  5. Stops when R of them have responded

R is the min number of nodes
read from

41
Q

Describe the Pros and Cons of Distributed Hash Tables

A

Pro
* Highly scalable
* Robust against failure
* Self organising
Con
* Lookup, no search
* Data Integrity
* Security Issues

42
Q

What are some potential issues with Distributed Hash Tables? And what is the solution?

Looking for 2

A
  1. Poor distribution - All machines are in one section of ring
  2. Heterogeneous Performance

Tokens!! :) We split up the ring into tokens (some mulptiple of the number of machines) and assign tokens randomly to nodes.

43
Q

How does adding/deletion of nodes work with tokens?

A

An added node gets assigned tokens to take over from other nodes (not all tokens have to come from same node)

A deleted node gets its tokens redistributed to other nodes in the network (not all tokens have to go to the same node)

44
Q

What is a vector clock?

A

In AP systems, where partitions may exist, vector clocks are used to reconcile versions following a DAG structure.

A vector clock can logically be seen as a map from nodes (machines) to integers, i.e., the version number is incremented per machine rather than globally. A version number increases for a machine when this machine processes the corresponding entry (key-value) and updates it

pages 74-76 of the textbook illustrate this
+ tutorial 2

45
Q

How to answer this type of question?

Given a list of versions, draw the version DAG that the coordinator node will build for returning available versions.

A
  • If no partial ordering exists between nodes, then those nodes should be on the same level, i.e., they are both valid versions.
  • If there is an edge between two nodes, then the parent node should be smaller than the child.
  • You cannot have skip connections, i.e., there cannot be an edge from an ancestor node (excluding the parent node) directly to a child node.
  • Transitive edges shouldn’t be present in the version DAG.
  • Each edge represents the update of exactly one entry of the vector clock.
46
Q

How do we define a partial order with regards to vector clocks?

A

Vector clocks can be compared to each other with a partial order relation ≤

A vector clock is <= another vector clock if for each machine, the associated integer in the first vector clock is also smaller than or equal to the associated integer in the second vector clock.

e.g {“A”:2,”C”:1} ≤ {“A”:3,”B”:1,”C”:1}

47
Q

Order SSD, HDD, SRAM w.r.t
* Price
* Speed (Read/Write)
* Capacity

Solid State Drive, Hard Drive Disk, Synchr. Dynam. Random Access Memory

A
  • From cheapset to most expensive: HDD SSD SDRAM
  • From slowest to fastest (in terms of read/write speed): HDD SSD SDRAM
  • By their capacity in increasing order: SDRAM SSD HDD
48
Q
A