Big Data Lecture 03 Cloud Storage Flashcards
What is the issue with big datasets?
Do not fit on a single machine, e.g. the Sloan Digital Sky Survey dataset has 273TB of data in 680 000 directories and 176 000 000 files.
What things are broken in the world of NoSQL?
<ul><li>Relational integrity,</li><li>domain integrity,</li><li>atomic integrity (1st normal form),</li><li>2nd/3rd/Boyce-Codd normal form.</li></ul>
What new data properties are added in NoSQL?
<ul><li>Heterogenous data (no schema),</li><li>nested data (break atomic integrity),</li><li>denormalized data (no normal forms).</li></ul>
Describe the tech stack of big data systems.
Data Stores
Data models
What is ETL?
When loading data into traditional database, we need to Extract, Transform and Load (ETL it).
What is a data lake? What file operations is it meant for?
As opposed to traditional database, it reads data directly from the file system.<br></br><br></br>Meant for reading, querying not for editing.
How are files stored in a file system?
File content is stored in blocks, usually of 4kB, if a file is not exactly of this side, a whole block is taken up anyway.
What networks does local storage use?
<ul><li>Local machine,</li><li>LAN (local area network), NAS (drive on network),</li><li>not WAN (wide area network).</li></ul>
Principle of scaling from simple data storage into data lake?
<ul><li>Throw away folder structure, use flat objects,</li><li>give data unique ID (key-value model).</li></ul>
How to scale a system expensively and cheaply?
<ul><li>Expensively: scale up - buy a larger, stronger machine,</li><li>cheaply: scale out - buy many cheap machines,</li><li>be smart! Optimize your code!</li></ul>
What are the constraints on the data centers? What are the number of machines and cores?
Due to electricity grid and cooling:<br></br><ul><li>1000 - 100 000 machines in a data center,</li><li>1-200 CPU cores per machine.</li></ul>
How big is a local storage, memory and bandwidth per server?
<ul><li>1-30 TB of storage,</li><li>16GB-24TB of RAM,</li><li>1-200 Gbit/s.</li></ul>
How are servers stored in a data center?
They are in server racks, one rack has 42 rack units.<br></br><br></br>This ensures modularity, as we can stack servers storage and routers into the same rack.<br></br><br></br>Each unit has 1-4 rack units.
Describe S3 data storage model.
<ul><li>Data is stored in buckets, each has a (worldwide) unique ID,</li><li>files (max. 5 TB) are stored as objects in the buckets, denoted by (in-the-bucket) unique ID.</li></ul>
What guarantees does S3 offer in SLA (service level agreement)?
<ul><li>Durability: 99.999999999% (lose 1 in 1011 objects),</li><li>availability: 99.99% (down 1h/year),</li><li>response time: < 10 ms in 99.9% of cases (not mean or average).</li></ul>
Explain CAP theorem.
Impossibility triangle - storage system cannot be:<br></br><ul><li><i>C</i>onsistent (all data agree in all backups and versions),</li><li><i>A</i>vailable (reachable with low latency),</li><li><i>P</i>artition tolerant (breaking up network),</li></ul><div>all at the same time.</div>
What are REST APIs?
<b>Representational state transfer</b>: peer-2-peer HTML-style protocol for file transfer.
How are resources reffered to? What are parts of it?
Using URI (uniform resource identifier), which has<br></br><ul><li>scheme: https</li><li>domain:</li><li>path: api/collection/foo/object/bar</li><li>query: ?id=foobar</li><li>fragment: #head</li></ul>
What HTTP methods are there and what do they do? Are they idempotent?
<ul><li>GET: obtains resource,</li><li>PUT: stores resource,</li><li>DELETE: deletes resource,</li><li>POST: anything else.</li></ul>
<div>Only POST is not idempotent.</div>
How does the HTTP protocol work?
<ul><li>Request is send with header and body,</li><li>Response is issues with status code, header and body.</li></ul>
Do URIs on Cloud Storage use file structure with slashes?
No, but you can use slashes to create logical structure for yourself.
Do data centers get filled up to full?
No, they are filled up to 70-80% then resources are rellocated / new center has to be built.
What is intra-stamp replication?
Synchronous method of duplication of data, done on client upload.
What is inter-stamp replication?
After user has finished uploading, the resources are duplicated to different places in the data center asynchronously.
Why are data centers spread around different regions?
- To optimize user latency,<br></br>2. increase resilience to natural catastrophes.
Is object storage a database?
No, the retrieval takes too long (>100 ms, cf. <10 ms for typical databases), so we cannot do our lovely operations.
What are key-value stores?
- Similar data model to object storage,<br></br>2. but with smaller objects,<br></br>3. and no metadata.
Why and how do we simplify to key-value store?
We require:<br></br><ul><li>simplicity,</li><li>only eventual consistency,</li></ul><div>to obtain</div><ul><li>increased performance,</li><li>scalability.</li></ul>
How do we query key-value store?
Associative array, aka map, since hash-map is not scalable to multiple machines.
What is the design principle of incremental stability?
Possibility to add and remove machines from the network.
What is the design principle of symmetry?
All machines run the same system on board.
What is the design principle of decentralization?
There is no central machine in the system.
What is the design principle of heterogeneity?
The machines can be different, have different resources, and it is okay.
How are nodes connected in a data center?
As a peer-2-peer network.
How are files assigned to devices in a cluster? How are new nodes added and removed?
They are hashed (uniformly), and then on a logical circle machines take care of certain data range. <br></br><br></br>The range is cut up / merged and the data is transferred.<br></br><br></br>The data is redundant, the node ranges overlap (over N ranges).
How does the protocol know where to find the data?
The nodes hold a list of pointers:<br></br><ul><li>Chord: finger tables (powers of two of what is where).</li><li>Dynamo: preferrence lists (every node knows about the ranges of all other nodes).</li></ul>
What are R, W and N and how do they relate?
<ul><li>N - number of data duplicates,</li><li>R - number of nodes each node reads from,</li><li>W - number of nodes each node write to (synchronously).</li></ul>
<div>It must be that at each moment: R + W > N, then we know that if we are reading from + writing to more nodes then there are, then if there is conflict we will know.</div>
How does the data lake system handle a request for data?
<ul><li>First load balancer assigns a random node to be asked,</li><li>the random node figures out who is the coordinator (e.g first node with data) and asks them,</li><li>coordinator redirects the request to N-1 nodes hosting replicas.</li></ul>
What are pros and cons of distributed hash tables?
+: highly scalable, robust against failure, self-organizing,<br></br>-: no lookup or search, data integrity, security issues.
How to increase the number of nodes and the elasticity of a distributed storage system?
Do not depend on machines for duplication, but introduce tokens (multiple for each machines) as a form of virtualization. Those can now have the same role as nodes have had previously.
How are tokens spread over machines?
They are distributed over different machines to increase robustness.
What is the concept of a vector clock? How are conflicts resolved?
Each node in the system keeps a counter of its modifications, which is incremented whenever the node writes something. This is passed onto other nodes/client on request.<br></br><br></br>There can different nodes writing, and they increase their own counter at each time.<br></br><br></br>The vectors form a directed acyclic graph (DAG), so on comparison of two versions, if we cannot compare (all are maximal elements, but not suprema) we need to merge (done by user).
What is the difference of Amazon and Azure mindsets?
Amazon has many different services, that all do one thing, and you get rerouted to them.<br></br><br></br>Azure has all hardware being able to do everything, and doing a bit of everything.
Describe Azure data storage.
<ul><li>Objects, are denoted by account, container and blob,</li><li>there are 3 types of blobs, Block Blob (file, at most 190.7 TB), Append Blobs (at most 195 BG)
and Page Blobs (for storing and accessing the memory of virtual