Big Data Lecture 03 Cloud Storage Flashcards

1
Q

What is the issue with big datasets?

A

Do not fit on a single machine, e.g. the Sloan Digital Sky Survey dataset has 273TB of data in 680 000 directories and 176 000 000 files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What things are broken in the world of NoSQL?

A

<ul><li>Relational integrity,</li><li>domain integrity,</li><li>atomic integrity (1st normal form),</li><li>2nd/3rd/Boyce-Codd normal form.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What new data properties are added in NoSQL?

A

<ul><li>Heterogenous data (no schema),</li><li>nested data (break atomic integrity),</li><li>denormalized data (no normal forms).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the tech stack of big data systems.

A

<img></img>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is ETL?

A

When loading data into traditional database, we need to Extract, Transform  and Load (ETL it).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a data lake? What file operations is it meant for?

A

As opposed to traditional database, it reads data directly from the file system.<br></br><br></br>Meant for reading, querying not for editing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are files stored in a file system?

A

File content is stored in blocks, usually of 4kB, if a file is not exactly of this side, a whole block is taken up anyway.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What networks does local storage use?

A

<ul><li>Local machine,</li><li>LAN (local area network), NAS (drive on network),</li><li>not WAN (wide area network).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Principle of scaling from simple data storage into data lake?

A

<div>Simplify!</div>

<ul><li>Throw away folder structure, use flat objects,</li><li>give data unique ID (key-value model).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to scale a system expensively and cheaply?

A

<ul><li>Expensively: scale up - buy a larger, stronger machine,</li><li>cheaply: scale out - buy many cheap machines,</li><li>be smart! Optimize your code!</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the constraints on the data centers? What are the number of machines and cores?

A

Due to electricity grid and cooling:<br></br><ul><li>1000 - 100 000 machines in a data center,</li><li>1-200 CPU cores per machine.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How big is a local storage, memory and bandwidth per server?

A

<ul><li>1-30 TB of storage,</li><li>16GB-24TB of RAM,</li><li>1-200 Gbit/s.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are servers stored in a data center?

A

They are in server racks, one rack has 42 rack units.<br></br><br></br>This ensures modularity, as we can stack servers storage and routers into the same rack.<br></br><br></br>Each unit has 1-4 rack units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe S3 data storage model.

A

<ul><li>Data is stored in buckets, each has a (worldwide) unique ID,</li><li>files (max. 5 TB) are stored as objects in the buckets, denoted by (in-the-bucket) unique ID.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What guarantees does S3 offer in SLA (service level agreement)?

A

<ul><li>Durability: 99.999999999% (lose 1 in 1011&nbsp;objects),</li><li>availability: 99.99% (down 1h/year),</li><li>response time: &lt; 10 ms in 99.9% of cases (not mean or average).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain CAP theorem.

A

Impossibility triangle - storage system cannot be:<br></br><ul><li><i>C</i>onsistent (all data agree in all backups and versions),</li><li><i>A</i>vailable (reachable with low latency),</li><li><i>P</i>artition tolerant (breaking up network),</li></ul><div>all at the same time.</div>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are REST APIs?

A

<b>Representational state transfer</b>: peer-2-peer HTML-style protocol for file transfer. 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How are resources reffered to? What are parts of it?

A

Using URI (uniform resource identifier), which has<br></br><ul><li>scheme: https</li><li>domain: www.example.com</li><li>path: api/collection/foo/object/bar</li><li>query: ?id=foobar</li><li>fragment: #head</li></ul>

19
Q

What HTTP methods are there and what do they do? Are they idempotent?

A

<ul><li>GET: obtains resource,</li><li>PUT: stores resource,</li><li>DELETE: deletes resource,</li><li>POST: anything else.</li></ul>

<div>Only POST&nbsp;is not idempotent.</div>

20
Q

How does the HTTP protocol work?

A

<ul><li>Request is send with header and body,</li><li>Response is issues with status code, header and body.</li></ul>

21
Q

Do URIs on Cloud Storage use file structure with slashes?

A

No, but you can use slashes to create logical structure for yourself.

22
Q

Do data centers get filled up to full?

A

No, they are filled up to 70-80% then resources are rellocated / new center has to be built.

23
Q

What is intra-stamp replication?

A

Synchronous method of duplication of data, done on client upload.

24
Q

What is inter-stamp replication?

A

After user has finished uploading, the resources are duplicated to different places in the data center asynchronously.

25
Q

Why are data centers spread around different regions?

A
  1. To optimize user latency,<br></br>2. increase resilience to natural catastrophes.
26
Q

Is object storage a database?

A

No, the retrieval takes too long (>100 ms, cf. <10 ms for typical databases), so we cannot do our lovely operations.

27
Q

What are key-value stores?

A
  1. Similar data model to object storage,<br></br>2. but with smaller objects,<br></br>3. and no metadata.
28
Q

Why and how do we simplify to key-value store?

A

We require:<br></br><ul><li>simplicity,</li><li>only eventual consistency,</li></ul><div>to obtain</div><ul><li>increased performance,</li><li>scalability.</li></ul>

29
Q

How do we query key-value store?

A

Associative array, aka map, since hash-map is not scalable to multiple machines.

30
Q

What is the design principle of incremental stability?

A

Possibility to add and remove machines from the network.

31
Q

What is the design principle of symmetry?

A

All machines run the same system on board.

32
Q

What is the design principle of decentralization?

A

There is no central machine in the system.

33
Q

What is the design principle of heterogeneity?

A

The machines can be different, have different resources, and it is okay.

34
Q

How are nodes connected in a data center?

A

As a peer-2-peer network.

35
Q

How are files assigned to devices in a cluster? How are new nodes added and removed?

A

They are hashed (uniformly), and then on a logical circle machines take care of certain data range. <br></br><br></br>The range is cut up / merged and the data is transferred.<br></br><br></br>The data is redundant, the node ranges overlap (over N ranges).

36
Q

How does the protocol know where to find the data?

A

The nodes hold a list of pointers:<br></br><ul><li>Chord: finger tables (powers of two of what is where).</li><li>Dynamo: preferrence lists (every node knows about the ranges of all other nodes).</li></ul>

37
Q

What are R, W and N and how do they relate?

A

<ul><li>N - number of data duplicates,</li><li>R - number of nodes each node reads from,</li><li>W - number of nodes each node write to (synchronously).</li></ul>

<div>It must be that at each moment: R + W &gt; N, then we know that if we are reading from + writing to more nodes then there are, then if there is conflict we will know.</div>

38
Q

How does the data lake system handle a request for data?

A

<div><br></br></div>

<ul><li>First load balancer assigns a random node to be asked,</li><li>the random node figures out who is the coordinator (e.g first node with data) and asks them,</li><li>coordinator redirects the request to N-1 nodes hosting replicas.</li></ul>

39
Q

What are pros and cons of distributed hash tables?

A

+: highly scalable, robust against failure, self-organizing,<br></br>-: no lookup or search, data integrity, security issues.

40
Q

How to increase the number of nodes and the elasticity of a distributed storage system?

A

Do not depend on machines for duplication, but introduce tokens (multiple for each machines) as a form of virtualization. Those can now have the same role as nodes have had previously.

41
Q

How are tokens spread over machines?

A

They are distributed over different machines to increase robustness.

42
Q

What is the concept of a vector clock? How are conflicts resolved?

A

Each node in the system keeps a counter of its modifications, which is incremented whenever the node writes something. This is passed onto other nodes/client on request.<br></br><br></br>There can different nodes writing, and they increase their own counter at each time.<br></br><br></br>The vectors form a directed acyclic graph (DAG), so on comparison of two versions, if we cannot compare (all are maximal elements, but not suprema) we need to merge (done by user).

43
Q

What is the difference of Amazon and Azure mindsets?

A

Amazon has many different services, that all do one thing, and you get rerouted to them.<br></br><br></br>Azure has all hardware being able to do everything, and doing a bit of everything.

44
Q

Describe Azure data storage.

A

<ul><li>Objects, are denoted by account, container and blob,</li><li>there are 3 types of blobs, Block Blob (file, at most 190.7 TB), Append Blobs (at most 195 BG)
and Page Blobs (for storing and accessing the memory of virtual
machines).<br></br></li></ul>