Cloud Storage - Mabel Flashcards
Mabel to Fill
What constraints do we let go of when entering the “NoSQL universe”
(Expecting 4)
- Tabular Integrity
- Domain Integrity
- Atomic Integrity
- Normal Forms
In NoSQL, “data denormalization” design covers which 2 concepts?
- Heterogeneous Data
- Nested Data
This can be referred to broadly as “Denormalised Data”
Define “Heterogeneous Data”
Heterogeneous data does not fulfil domain integrity (it may not even have a schema) and also not relational integrity.
Define “Nested Data”
Nested data is not in first normal form (violating atomic integrity). For example tables inside tables
What are the two main paradigms to store data?
- “Traditional” Database
- Data lakes
“Traditional” Databases
e.g PostgreSQL
E(xtract)
T(ransform)
L(oad)
Data Lakes
Read directly from a file system (in situ)
Stored “as is”
More convenient if you only want to read the data (“read-intensive”/OLAP)
e.g using pandas in Python
How is data stored locally?
- Files are organised in a heirarchy (files and directories)
- File content is stored and read in blocks (roughly 4kB)
How does local storage scale?
- Storage on Local Machine
- LAN(NAS) (Harddrive accesible from multiple places on a network)
Does not support WAN
NAS = network-attached storage
WAN = wide-area
1,000 to 1,000,000 files ok on a laptop, but 1,000,000,000 will break
LAN = local-area network
What are ways we can make storage scale?
- Get rid of heirarchy
- make the metadata flexible: attributes can differ from file to file
(no schema) - use simple, data model: a flat list of files (called objects) identified with an identifier (ID); blocks are not exposed to the user,
- use a large number of cheap machines rather than some “super- computer”
Scaling Up vs Scaling Out
Scaling Up - A bigger machine: more memory, more or faster CPU cores, a larger disk
Scaling Out - One can buy more, similar machines and share the work across them
Scaling Out price increases linearly
A better way is to optimise code which should always be done first
Data Centers (in numbers)
1,000-100,000 machines in a data center
1-200 cores per server
100,000 seems to be a hard limit - electricity consumption and cooling
Servers (in numbers)
How many cores?
How much local storage?
How much RAM?
Server = Node = Machine
1 and 64 cores per server
1-30 TB local storage per server
16GB - 24 TB of RAM per server
Laptops typically have up to 24 cores
Networks (in numbers)
The network bandwidth goes from 1 to 200 Gbit/s (HPC allows for higher)
Bandwidth is the highest within the same cluster
Bits for network as opposed to typical bytes
How do we measure distance in data centers?
Rack Units
Rack Servers can be between 1-4 RUs
A cluster is just a room filled with racks put next to each other.
Describe Amazon S3
Simple Storage Service
Objects and Buckets - IDs (Can PUT, GET, DELETE)
An object can be at most 5 TB
Only possible to upload an object in a single chunk if it is less than 5 GB
By default users get 100 buckets
5TB is size that fits on a single disk typically
Object just means file
Service Level Agreements (SLAs)
Durability - S3 loses less than 1 in 100 billion
Availability - S3 will be available > 99.99% of year (1h/year)
99.9% = < 10hrs
99% = < 4 days
99.999% = six minutes
What is the CAP theorem?
Yet another impossibility triangle
- Consistency - Machines will all have the same answer (atomic consistency - all nodes see the same data)
- Availability - If people make requests they get an answer
- Partition Tolerance - If network gets partitioned, subnetworks still allow delivery to customers
Can’t have all 3 in a network partition
Will either be AP or CP
CP - not available until network reconnected
AP - not consistent until reconnected (eventual consistency)
Unavailable /= Partition Intolerant
RestAPI
REpresentational State Transfer
Sending queries over HTTPs - Rest API supports integration with many host languages.
Generally successful response status codes are 200-299 and client error response status codes are 400-499
Requests have Method, URI, [Header], [Body]. Responses have Status Code, [Header], [Body]