Big Data Lecture 03 Cloud Storage Flashcards

Question 1

Q

What is the issue with big datasets?

Answer

A

Do not fit on a single machine, e.g. the Sloan Digital Sky Survey dataset has 273TB of data in 680 000 directories and 176 000 000 files.

Question 2

Q

What things are broken in the world of NoSQL?

Answer

A

<ul><li>Relational integrity,</li><li>domain integrity,</li><li>atomic integrity (1st normal form),</li><li>2nd/3rd/Boyce-Codd normal form.</li></ul>

Question 3

Q

What new data properties are added in NoSQL?

Answer

A

<ul><li>Heterogenous data (no schema),</li><li>nested data (break atomic integrity),</li><li>denormalized data (no normal forms).</li></ul>

Question 4

Q

Describe the tech stack of big data systems.

Answer

A

UI
Querying
Data Stores
Indexing
Processing
Validation
Data models
Syntax
Encoding
Storage

Question 5

Q

What is ETL?

Answer

A

When loading data into traditional database, we need to Extract, Transform and Load (ETL it).

Question 6

Q

What is a data lake? What file operations is it meant for?

Answer

A

As opposed to traditional database, it reads data directly from the file system. Meant for reading, querying not for editing.

Question 7

Q

How are files stored in a file system?

Answer

A

File content is stored in blocks, usually of 4kB, if a file is not exactly of this side, a whole block is taken up anyway.

Question 8

Q

What networks does local storage use?

Answer

A

<ul><li>Local machine,</li><li>LAN (local area network), NAS (drive on network),</li><li>not WAN (wide area network).</li></ul>

Question 9

Q

Principle of scaling from simple data storage into data lake?

Answer

A

<div>Simplify!</div>

<ul><li>Throw away folder structure, use flat objects,</li><li>give data unique ID (key-value model).</li></ul>

Question 10

Q

How to scale a system expensively and cheaply?

Answer

A

<ul><li>Expensively: scale up - buy a larger, stronger machine,</li><li>cheaply: scale out - buy many cheap machines,</li><li>be smart! Optimize your code!</li></ul>

Question 11

Q

What are the constraints on the data centers? What are the number of machines and cores?

Answer

A

Due to electricity grid and cooling: <ul><li>1000 - 100 000 machines in a data center,</li><li>1-200 CPU cores per machine.</li></ul>

Question 12

Q

How big is a local storage, memory and bandwidth per server?

Answer

A

<ul><li>1-30 TB of storage,</li><li>16GB-24TB of RAM,</li><li>1-200 Gbit/s.</li></ul>

Question 13

Q

How are servers stored in a data center?

Answer

A

They are in server racks, one rack has 42 rack units. This ensures modularity, as we can stack servers storage and routers into the same rack. Each unit has 1-4 rack units.

Question 14

Q

Describe S3 data storage model.

Answer

A

<ul><li>Data is stored in buckets, each has a (worldwide) unique ID,</li><li>files (max. 5 TB) are stored as objects in the buckets, denoted by (in-the-bucket) unique ID.</li></ul>

Question 15

Q

What guarantees does S3 offer in SLA (service level agreement)?

Answer

A

<ul><li>Durability: 99.999999999% (lose 1 in 10¹¹ objects),</li><li>availability: 99.99% (down 1h/year),</li><li>response time: < 10 ms in 99.9% of cases (not mean or average).</li></ul>

Question 16

Q

Explain CAP theorem.

Answer

A

Impossibility triangle - storage system cannot be: <ul><li>Consistent (all data agree in all backups and versions),</li><li>Available (reachable with low latency),</li><li>Partition tolerant (breaking up network),</li></ul><div>all at the same time.</div>

Question 17

Q

What are REST APIs?

Answer

A

Representational state transfer: peer-2-peer HTML-style protocol for file transfer.

Question 18

Q

How are resources reffered to? What are parts of it?

Answer

A

Using URI (uniform resource identifier), which has <ul><li>scheme: https</li><li>domain: www.example.com</li><li>path: api/collection/foo/object/bar</li><li>query: ?id=foobar</li><li>fragment: #head</li></ul>

Question 19

Q

What HTTP methods are there and what do they do? Are they idempotent?

Answer

A

<ul><li>GET: obtains resource,</li><li>PUT: stores resource,</li><li>DELETE: deletes resource,</li><li>POST: anything else.</li></ul>

<div>Only POST is not idempotent.</div>

Question 20

Q

How does the HTTP protocol work?

Answer

A

<ul><li>Request is send with header and body,</li><li>Response is issues with status code, header and body.</li></ul>

Question 21

Q

Do URIs on Cloud Storage use file structure with slashes?

Answer

A

No, but you can use slashes to create logical structure for yourself.

Question 22

Q

Do data centers get filled up to full?

Answer

A

No, they are filled up to 70-80% then resources are rellocated / new center has to be built.

Question 23

Q

What is intra-stamp replication?

Answer

A

Synchronous method of duplication of data, done on client upload.

Question 24

Q

What is inter-stamp replication?

Answer

A

After user has finished uploading, the resources are duplicated to different places in the data center asynchronously.

Question 25

Q

Why are data centers spread around different regions?

Answer

A

To optimize user latency, 2. increase resilience to natural catastrophes.

Question 26

Q

Is object storage a database?

Answer

A

No, the retrieval takes too long (>100 ms, cf. <10 ms for typical databases), so we cannot do our lovely operations.

Question 27

Q

What are key-value stores?

Answer

A

Similar data model to object storage, 2. but with smaller objects, 3. and no metadata.

Question 28

Q

Why and how do we simplify to key-value store?

Answer

A

We require: <ul><li>simplicity,</li><li>only eventual consistency,</li></ul><div>to obtain</div><ul><li>increased performance,</li><li>scalability.</li></ul>

Question 29

Q

How do we query key-value store?

Answer

A

Associative array, aka map, since hash-map is not scalable to multiple machines.

Question 30

Q

What is the design principle of incremental stability?

Answer

A

Possibility to add and remove machines from the network.

Question 31

Q

What is the design principle of symmetry?

Answer

A

All machines run the same system on board.

Question 32

Q

What is the design principle of decentralization?

Answer

A

There is no central machine in the system.

Question 33

Q

What is the design principle of heterogeneity?

Answer

A

The machines can be different, have different resources, and it is okay.

Question 34

Q

How are nodes connected in a data center?

Answer

A

As a peer-2-peer network.

Question 35

Q

How are files assigned to devices in a cluster? How are new nodes added and removed?

Answer

A

They are hashed (uniformly), and then on a logical circle machines take care of certain data range. The range is cut up / merged and the data is transferred. The data is redundant, the node ranges overlap (over N ranges).

Question 36

Q

How does the protocol know where to find the data?

Answer

A

The nodes hold a list of pointers: <ul><li>Chord: finger tables (powers of two of what is where).</li><li>Dynamo: preferrence lists (every node knows about the ranges of all other nodes).</li></ul>

Question 37

Q

What are R, W and N and how do they relate?

Answer

A

<ul><li>N - number of data duplicates,</li><li>R - number of nodes each node reads from,</li><li>W - number of nodes each node write to (synchronously).</li></ul>

<div>It must be that at each moment: R + W > N, then we know that if we are reading from + writing to more nodes then there are, then if there is conflict we will know.</div>

Question 38

Q

How does the data lake system handle a request for data?

Answer

A

<ul><li>First load balancer assigns a random node to be asked,</li><li>the random node figures out who is the coordinator (e.g first node with data) and asks them,</li><li>coordinator redirects the request to N-1 nodes hosting replicas.</li></ul>

Question 39

Q

What are pros and cons of distributed hash tables?

Answer

A

+: highly scalable, robust against failure, self-organizing, -: no lookup or search, data integrity, security issues.

Question 40

Q

How to increase the number of nodes and the elasticity of a distributed storage system?

Answer

A

Do not depend on machines for duplication, but introduce tokens (multiple for each machines) as a form of virtualization. Those can now have the same role as nodes have had previously.

Question 41

Q

How are tokens spread over machines?

Answer

A

They are distributed over different machines to increase robustness.

Question 42

Q

What is the concept of a vector clock? How are conflicts resolved?

Answer

A

Each node in the system keeps a counter of its modifications, which is incremented whenever the node writes something. This is passed onto other nodes/client on request. There can different nodes writing, and they increase their own counter at each time. The vectors form a directed acyclic graph (DAG), so on comparison of two versions, if we cannot compare (all are maximal elements, but not suprema) we need to merge (done by user).

Question 43

Q

What is the difference of Amazon and Azure mindsets?

Answer

A

Amazon has many different services, that all do one thing, and you get rerouted to them. Azure has all hardware being able to do everything, and doing a bit of everything.

Question 44

Q

Describe Azure data storage.

Answer

A

<ul><li>Objects, are denoted by account, container and blob,</li><li>there are 3 types of blobs, Block Blob (file, at most 190.7 TB), Append Blobs (at most 195 BG)
and Page Blobs (for storing and accessing the memory of virtual
machines). </li></ul>