Big Data Lecture 03 Cloud Storage Flashcards
What is the issue with big datasets?
Do not fit on a single machine, e.g. the Sloan Digital Sky Survey dataset has 273TB of data in 680 000 directories and 176 000 000 files.
What things are broken in the world of NoSQL?
<ul><li>Relational integrity,</li><li>domain integrity,</li><li>atomic integrity (1st normal form),</li><li>2nd/3rd/Boyce-Codd normal form.</li></ul>
What new data properties are added in NoSQL?
<ul><li>Heterogenous data (no schema),</li><li>nested data (break atomic integrity),</li><li>denormalized data (no normal forms).</li></ul>
Describe the tech stack of big data systems.
<img></img>
What is ETL?
When loading data into traditional database, we need to Extract, Transform and Load (ETL it).
What is a data lake? What file operations is it meant for?
As opposed to traditional database, it reads data directly from the file system.<br></br><br></br>Meant for reading, querying not for editing.
How are files stored in a file system?
File content is stored in blocks, usually of 4kB, if a file is not exactly of this side, a whole block is taken up anyway.
What networks does local storage use?
<ul><li>Local machine,</li><li>LAN (local area network), NAS (drive on network),</li><li>not WAN (wide area network).</li></ul>
Principle of scaling from simple data storage into data lake?
<div>Simplify!</div>
<ul><li>Throw away folder structure, use flat objects,</li><li>give data unique ID (key-value model).</li></ul>
How to scale a system expensively and cheaply?
<ul><li>Expensively: scale up - buy a larger, stronger machine,</li><li>cheaply: scale out - buy many cheap machines,</li><li>be smart! Optimize your code!</li></ul>
What are the constraints on the data centers? What are the number of machines and cores?
Due to electricity grid and cooling:<br></br><ul><li>1000 - 100 000 machines in a data center,</li><li>1-200 CPU cores per machine.</li></ul>
How big is a local storage, memory and bandwidth per server?
<ul><li>1-30 TB of storage,</li><li>16GB-24TB of RAM,</li><li>1-200 Gbit/s.</li></ul>
How are servers stored in a data center?
They are in server racks, one rack has 42 rack units.<br></br><br></br>This ensures modularity, as we can stack servers storage and routers into the same rack.<br></br><br></br>Each unit has 1-4 rack units.
Describe S3 data storage model.
<ul><li>Data is stored in buckets, each has a (worldwide) unique ID,</li><li>files (max. 5 TB) are stored as objects in the buckets, denoted by (in-the-bucket) unique ID.</li></ul>
What guarantees does S3 offer in SLA (service level agreement)?
<ul><li>Durability: 99.999999999% (lose 1 in 1011 objects),</li><li>availability: 99.99% (down 1h/year),</li><li>response time: < 10 ms in 99.9% of cases (not mean or average).</li></ul>
Explain CAP theorem.
Impossibility triangle - storage system cannot be:<br></br><ul><li><i>C</i>onsistent (all data agree in all backups and versions),</li><li><i>A</i>vailable (reachable with low latency),</li><li><i>P</i>artition tolerant (breaking up network),</li></ul><div>all at the same time.</div>
What are REST APIs?
<b>Representational state transfer</b>: peer-2-peer HTML-style protocol for file transfer.