Cloud Storage - Mabel Flashcards
Mabel to Fill
What constraints do we let go of when entering the “NoSQL universe”
(Expecting 4)
- Tabular Integrity
- Domain Integrity
- Atomic Integrity
- Normal Forms
In NoSQL, “data denormalization” design covers which 2 concepts?
- Heterogeneous Data
- Nested Data
This can be referred to broadly as “Denormalised Data”
Define “Heterogeneous Data”
Heterogeneous data does not fulfil domain integrity (it may not even have a schema) and also not relational integrity.
Define “Nested Data”
Nested data is not in first normal form (violating atomic integrity). For example tables inside tables
What are the two main paradigms to store data?
- “Traditional” Database
- Data lakes
“Traditional” Databases
e.g PostgreSQL
E(xtract)
T(ransform)
L(oad)
Data Lakes
Read directly from a file system (in situ)
Stored “as is”
More convenient if you only want to read the data (“read-intensive”/OLAP)
e.g using pandas in Python
How is data stored locally?
- Files are organised in a heirarchy (files and directories)
- File content is stored and read in blocks (roughly 4kB)
How does local storage scale?
- Storage on Local Machine
- LAN(NAS) (Harddrive accesible from multiple places on a network)
Does not support WAN
NAS = network-attached storage
WAN = wide-area
1,000 to 1,000,000 files ok on a laptop, but 1,000,000,000 will break
LAN = local-area network
What are ways we can make storage scale?
- Get rid of heirarchy
- make the metadata flexible: attributes can differ from file to file
(no schema) - use simple, data model: a flat list of files (called objects) identified with an identifier (ID); blocks are not exposed to the user,
- use a large number of cheap machines rather than some “super- computer”
Scaling Up vs Scaling Out
Scaling Up - A bigger machine: more memory, more or faster CPU cores, a larger disk
Scaling Out - One can buy more, similar machines and share the work across them
Scaling Out price increases linearly
A better way is to optimise code which should always be done first
Data Centers (in numbers)
1,000-100,000 machines in a data center
1-200 cores per server
100,000 seems to be a hard limit - electricity consumption and cooling
Servers (in numbers)
How many cores?
How much local storage?
How much RAM?
Server = Node = Machine
1 and 64 cores per server
1-30 TB local storage per server
16GB - 24 TB of RAM per server
Laptops typically have up to 24 cores
Networks (in numbers)
Network Bandwidth?
The network bandwidth goes from 1 to 200 Gbit/s (HPC allows for higher)
Bandwidth is the highest within the same cluster
Bits for network as opposed to typical bytes
How do we measure distance in data centers?
Rack Units
Rack Servers can be between 1-4 RUs
A cluster is just a room filled with racks put next to each other.
Edit: shouldnt this be number of hops between nodes?
Describe Amazon S3
Simple Storage Service
Objects and Buckets - IDs (Can PUT, GET, DELETE)
An object can be at most 5 TB
Only possible to upload an object in a single chunk if it is less than 5 GB
By default users get 100 buckets
5TB is size that fits on a single disk typically
Object just means file
S3 Service Level Agreements (SLAs)
Durability - S3 loses less than 1 in 100 billion
Availability - S3 will be available > 99.99% of year (1h/year)
99% = < 4 days
99.999% = six minutes
99.9% = < 10hrs
What is the CAP theorem?
Yet another impossibility triangle
- Consistency - Machines will all have the same answer (atomic consistency - all nodes see the same data)
- Availability - If people make requests they get an answer
- Partition Tolerance - If network gets partitioned, subnetworks still allow delivery to customers
Can’t have all 3 in a network partition
Will either be AP or CP
CP - not available until network reconnected
AP - not consistent until reconnected (eventual consistency)
Unavailable /= Partition Intolerant
RestAPI
REpresentational State Transfer
Sending queries over HTTPs - Rest API supports integration with many host languages.
Generally successful response status codes are 200-299 and client error response status codes are 400-499
Requests have Method, URI, [Header], [Body]. Responses have Status Code, [Header], [Body]
Deconstruct this URI/IRI
http://www.example.com/api/collection/foobar?id=foobar#head
- “http” is the scheme;
- “//www.example.com“ is the authority
- “/api/collection/foobar” is the path
- “?id=foobar” is the query
- “#head” is the fragment
Describe the main HTTP methods
- GET - no body, returns representation of resource, makes no changes
- PUT - creates or updates a resource from a representation of a newer version of it
- DELETE - no body, this method deletes a resource
- POST - just generic, if you don’t know what to use, use this
Which of the following most generally designates a relation that is transitive, reflexive, and antisymmetric?
* A total order
* A preorder
* An equivalence relation
* A (non-strict) partial order
Lecture Question - Antisymmetric means if a!=b and a -> b then b /-> a
A (non-strict) partial order
Partial Order is a DAG
Preorder is only Transitive and Reflexive (an antisymmetric Preorder is a Partial Order, a symmetric Preorder is an equivalence relation)
Total Order is a partial order where everything is in relation to something else
Azure Blob Storage
Object IDs are given by Account + Container + Blob
Object API - Block/Append/Page
Blocks are for data - e.g datasets
Append is for logging
Limits
* 190.7 TB block
* 195 GB append
* 8 TB page
Blob = Binary Large OBject
What is a storage stamp?
(Azure)
10-20 racks * 18 storage nodes/rack (30 PB)
kept below 70/80% storage capacity
What is storage replication like within/inbetween stamps?
Inter-stamp replication (asynchronous) of the partition layer
Intra-stamp replication (synchronous) of the stream layer
Why are there data centers in many different regions?
- Optimise latency
- Resillience to natural catastrophes
Why do we not use cloud storage for relational databases?
Latency - S3 can take around 100-200ms to query as opposed to 1-9ms in a typical database
How are key-value stores used?
Similar model to object storage
Objects are now 400KB (Dynamo)
Streaming systems use systems like S3 Blob storage and download the smaller objects in a “stream”
What does the basic API for key-value storage look like?
get(key, context) -> value, context
put(key,value, context)
delete also is in API
Why do we use Key-Value stores?
- Its API is considerably simpler than that of a relational database (which comes with query languages)
- It does not ensure atomic consistency; instead, it guarantees eventual consistency
- A key-value store scales out well, in that it is very fast also at large scales.
AP
Amazon Dynamo
Amazon Dynamo is a specific key-value based on the Chord protocol, which is a Distributed Hash Table.
Distributed Hash Tables are generally highly scalable, robust against failure and self organizing. However, they do not provide any support for range queries, guarantees on data integrity (which is pushed to the user), and do not deal with any security issues.
On the physical level, a distributed hash table is made of nodes (the machines we have in a data center, piled up in racks) that work following a few design principles.
What design principles does a Distributed Hash Table follow?
- Incremental stability. Nodes can enter and leave in a way that does not afect the stability of the system
-
Symmetry. All machines run the same software and have exactly the same behaviour
3.** Decentralization**. There is no “central node” that orchestrates the others - Heterogeneity. The nodes may have different CPU power, amounts of memory, etc.
Only peer-to-peer network shown in lectures
How many bits are Dynamo hashes?
128 bits
Describe how Amazon Dynamo functions
Dynamo generates 128 bit hashes (16 bytes)
IDs are organised in a ring and machines get a random positition on ring. When objects are hashed, they are stored on the machine that is reached first when travelling clockwise around the ring
pages 67 onwards in textbook
What happens when a new machine joins the ring?
The new machine gets assigned a position. This position is in the domain of responsibility of some existing node m.
This ring interval is now going to be split: everything between m and n remains within the responsibility of m, but the other half of the domain of responsibility is newly the responsibility of n.
The corresponding data needs to be transferred from m to n.
What happens when a machine leaves the ring?
Graceful Exit The node (n) informs the rest of the cluster before leaving. Then, it will transfer all its interval of responsibility to the next node clockwise (m) on the ring.
Abrupt Exit Data will be gone if no data replication is in place.
How do we do data replication in a ring?
Data within an interval will not be stored only the node that follow the interval clockwise, but on the next N nodes that follow the interval clockwise. Equivalently explained, a node is responsible not only for the interval that follows it counterclockwise, but for the next N intervals that follow it counterclockwise.
How do we find out which node has our data in Amazon Dynamo?
Preference Lists
Each node knows, for every key (or key range), which node(s) are responsible (and hold a copy) of it. This is done by associating every key (key range) with a list of nodes, by decreasing priority (going down the ring clockwise) (the highest priority node is the first in the ring and the first assigned machine)
Explain:
R+W>N
Upon reading a value, R is the minimum number of nodes (among the N nodes responsible for the key) from which a copy of the value must be obtained and compared. Upon writing a new value, W is the minimum number of nodes (among the N nodes responsible for the key) that must have acknowledged receipt and confirmed storage of the value before the put or delete transaction succeeds.
At least N nodes in each preference list
Describe how an intial request for data is handled in Dynamo
- Connect to load balancer
- Directed to a random node on the ring
- Random node connects you to “coordinator node” - node first om preference list
- Connects to N-1 nodes hosting replicas
- Stops when R of them have responded
R is the min number of nodes
read from
Describe the Pros and Cons of Distributed Hash Tables
Pro
* Highly scalable
* Robust against failure
* Self organising
Con
* Lookup, no search
* Data Integrity
* Security Issues
What are some potential issues with Distributed Hash Tables? And what is the solution?
Looking for 2
- Poor distribution - All machines are in one section of ring
- Heterogeneous Performance
Tokens!! :) We split up the ring into tokens (some mulptiple of the number of machines) and assign tokens randomly to nodes.
How does adding/deletion of nodes work with tokens?
An added node gets assigned tokens to take over from other nodes (not all tokens have to come from same node)
A deleted node gets its tokens redistributed to other nodes in the network (not all tokens have to go to the same node)
What is a vector clock?
In AP systems, where partitions may exist, vector clocks are used to reconcile versions following a DAG structure.
A vector clock can logically be seen as a map from nodes (machines) to integers, i.e., the version number is incremented per machine rather than globally. A version number increases for a machine when this machine processes the corresponding entry (key-value) and updates it
pages 74-76 of the textbook illustrate this
+ tutorial 2
How to answer this type of question?
Given a list of versions, draw the version DAG that the coordinator node will build for returning available versions.
- If no partial ordering exists between nodes, then those nodes should be on the same level, i.e., they are both valid versions.
- If there is an edge between two nodes, then the parent node should be smaller than the child.
- You cannot have skip connections, i.e., there cannot be an edge from an ancestor node (excluding the parent node) directly to a child node.
- Transitive edges shouldn’t be present in the version DAG.
- Each edge represents the update of exactly one entry of the vector clock.
How do we define a partial order with regards to vector clocks?
Vector clocks can be compared to each other with a partial order relation ≤
A vector clock is <= another vector clock if for each machine, the associated integer in the first vector clock is also smaller than or equal to the associated integer in the second vector clock.
e.g {“A”:2,”C”:1} ≤ {“A”:3,”B”:1,”C”:1}
Order SSD, HDD, SRAM w.r.t
* Price
* Speed (Read/Write)
* Capacity
Solid State Drive, Hard Drive Disk, Synchr. Dynam. Random Access Memory
- From cheapset to most expensive: HDD SSD SDRAM
- From slowest to fastest (in terms of read/write speed): HDD SSD SDRAM
- By their capacity in increasing order: SDRAM SSD HDD