Chapters 3&4 Knowledge Testers Flashcards by Mabel Wylie

Do you know how storage and memory technologies (HDD, SDD and RAM) compare in terms of capacity and throughput?

Capacity (least to most): RAM, SSD,HDD
Throughput: (slow to fast): HDD,SSD,RAM

How well did you know this?

Not at all

Perfectly

Do you know the difference between data and metadata?

metadata: small relational table with the following attributes
- name of the file, access rights, owner, group of owner, last modification time, creation time, size

Data: content of the files

How well did you know this?

Not at all

Perfectly

Order of magnitude that can be achieved in terms of number of objects, and object size?

S3 - 100 buckets, object is 5TB, chunk is 5GB

How well did you know this?

Not at all

Perfectly

Can you name a few big players for cloud-based object storage (vendors, consumers)?

Amazon S3, Microsoft Azure

How well did you know this?

Not at all

Perfectly

Can you describe the features of S3 and Azure Blob Storage on a high level? Do you know what a bucket and object are? What block blob storage, append blob storage and page blob storage are and how they work?

S3: buckets that contain objects (up to 5TB, fit on single disk)
- no hierarchies for objects

Azure: public documentation, IDed by account, container, blobs.
- organized in storage stamps (10-20 racks) 30PB
- exposes more details to users
- block: store data
- append: for logging
- page: for storing and accessing the memory of virtual machines

How well did you know this?

Not at all

Perfectly

Describe what the most important SLA (Service Level Agreement) parameters mean (e.g., latency, availability, durability) as well as their typical range?

latency: how fast your data will reach you. response times (eg. 99.9% of requests will be served in 30 seconds) S3 has no latency guarantees
availability: How often your data will be available for you (99.99% of year, 52 minutes with no availability)
durability: how much data will be lost (less than 1 in 1 billion or 99.999999999% not lost (nine 9s after decimal))

How well did you know this?

Not at all

Perfectly

Do you know what each letter stands for in CAP?

Consistency: (Atomic) at any point in time the same request to any server returns the same results (all nodes see same data)
Availability: system available at all times for requests
Partition tolerance: System continues to function even if network linking its machines is occasionally partitioned

How well did you know this?

Not at all

Perfectly

Can you explain why, for large amounts of data, CAP becomes rel- evant over ACID?

ACID: for data all stored on one machine for relational databases
CAP: For big data where many machines are used and where replicas of nodes must be made and updated

How well did you know this?

Not at all

Perfectly

Can you explain what a REST API is, what resources and methods are?

data stores ecpose their functionality through APIs. REST = REpresentational State Transfer. “HTTP” done right”.
Resources: anything: document, pdf, person. refered to with a URI (uniform Resource Identifier)
Methods: normal methods/functions that do stuff. Get, put, delete, post.

How well did you know this?

Not at all

Perfectly

Can you describe a typical use case for object storage?

shopping carts for large online stores

How well did you know this?

Not at all

Perfectly

Can you explain the difference between block storage and object storage?

object -> flat, key-value pairs
HUGE (billions/trillions) amounts of big objects (5TB)
block -> hierarchies
->a lot (Millions) of HUGE files (>5TB)

How well did you know this?

Not at all

Perfectly

Can you explain the difference between the (logical) key-value model and a file system?

key-value: flat no hierarchies
file system: hierarchies

How well did you know this?

Not at all

Perfectly

Do you know the order of magnitude of a block size for a local filesystem and for a distributed file system? Can you explain the rationale behind them with respect to latency and throughput?

Local: 4kB
DFS block size: 64 or 128 MB
large enough that time is not lost in latency waiting for a block to arrive
small enough for a large file to be conveniently spread over many machines (parallel access) to improve throughput. Also small enough that if there is an error and they must be sent again it can be

How well did you know this?

Not at all

Perfectly

Can you contrast centralized architectures to decentralized (peer-to- peer) architectures?

Decentralized: all machines can talk to each other.
Centralized: there is a central machine. all other machines talk to the central machine, one node is special (queen bee).
the others nodes are basic bitches and interchangeable

How well did you know this?

Not at all

Perfectly

Can you explain the HDFS architecture, what a NameNode is and what a DataNode is, how blocks are replicated?

HDFS: distributed files system: hierarchy of files over multiple machines
NameNode: the special node that all other nodes communicate with. stores namespace, mapping from file to list of its blocks, mapping from block to locations of replicas
DataNode: store data blocks on local disk, communicate with namenode through regular heartbeats. (datanode always initiates contact). datanodes communicate with each other through replication pipelines

Blocks are replicated through replication pipilines between datanodes. blocks are sent in smaller packets in a streaming fashion. The original node sends the first copy of the data to node #2, node #2 then propogates on. Original node does NOT send data to all the nodes.

How well did you know this?

Not at all

Perfectly

Can you sketch how the various components communicate with each other (client, NameNode, DataNode)?

Study These Flashcards

NameNode: never initiates communication (like women w sex). namenode responds to heartbeats
DataNode: communicate with namenode through regular heartbeats (always initiates). communicate with each other through replication pipelines
Blocks are replicated through replication pipilines between datanodes. blocks are sent in smaller packets in a streaming fashion. The original node sends the first copy of the data to node #2, node #2 then propogates on. Original node does NOT send data to all the nodes.
Client communicates with namenode and datanode

Can you point to the single points of failure of HDFS and explain how they can be addressed?

Study These Flashcards

If the namenode fucks off were screwed. Keep track of edit log, merge into a bigger log periodically. We use a “phantom NameNode” that keeps the exact same structures in its memory as the real NameNode, and performs checkpoints periodically. It is possible to set it up so the phantom NameNode to be able to instantly take over from the NameNode in case of a crash.

Can you explain how the NameNode stores the file system namespace, in memory and on disk? In particular, can you explain how the namespace file and the edit log work together at startup time and how they get modified once the system is up and running?

Study These Flashcards

HDFS does not follow a key-value model: instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system. This is stored on NameNode. The file namespace containing the directory and file hierarchy as well as the mapping from files to block IDs is backed up to a so-called snapshot. The snapshot and edit log are stored either locally or on a network- attached drive (not HDFS itself). For more resilience, they can also be copied over to more backup locations. If the NameNode crashes, it can be restarted, the snapshot can be loaded back into memory to get the file namespace and the mapping of the files to block IDs. Then the edit log can be replayed in order to apply the latest
changes

Can you explain what a Standby NameNode is? (Note: it has many predecessors that only have historical relevance in the development of HDFS: Backup NameNode, Secondary NameNode, Checkpoint NameNode, etc, but this is not important for the course)

Study These Flashcards

We use a Standby NameNode keeps the exact same structures in its memory as the real NameNode, and performs checkpoints periodically. It is possible to set it up so the standby NameNode to be able to instantly take over from the NameNode in case of a crash.

where does HDFS shine and why?

Study These Flashcards

PBs of data. Write once, read many. fault tolerance. throughput

Do you know that HDFS files are updated by appending atomically and why?

Study These Flashcards

simplifies consistency, aids in batch processing, simplifies logging

Do you know how HDFS performs in terms of throughput and latency?

Study These Flashcards

optimized for high throughput

What are the main benefits of HDFS?

Study These Flashcards

handles massive files, streaming, scalable, fault tolerance, high availability, works well with scaling out

Describe the limitations of traditional (local) file systems?

Study These Flashcards

local file systems must fit on one machine, creating size constraints. Single disks cannot store big datasets.

contrast or relate object storage with block storage, a file system, and the key-value model

Object Storage: - Store each object -> data, metadata, unique identifier -good with unstructured data -images, videos, backups, data lakes Block storage: store data in fixed sized blocks - read/write blocks directly - requires system to manage files and directories File System: - organizes data into hierarchy structure of directories - built on top of block storage -every day general purpose storage Key-Value model: - key: identifier, value: data - access data using key, no scheme or hierarchy - caching, session data, quick lookups, Object store can be seen as key-value store, object stores -> large key value -> small and fast

What is a data centre is made of (numbers)

1000 - 10,000s of machines - most cannot handle 100,000+ due to electricity issues and cooling - trend less servers more cores - server can be called node has 1-64 cores (increasing) - memory on node between 16GB - 6TB - SSD/HDD 1-20 TB - bandwidth 1-100GB/s - nodes - flat rectangular boxes, piled in a rack - cluster - room of racks -module/node in a rack can be server or pure storage or network switch - height of module standardized 1-4 RU (rack units)

Do you know rough, typical numbers (per-node storage capacity, memory, number of cores, etc.)?

- 1000 - 10,000s of machines - 1-64 cores (increasing) - memory on node between 16GB - 6TB - SSD/HDD 1-20 TB - bandwidth 1-100GB/s - height of module standardized 1-4 RU (rack units)

What is Object Storage

A solution to RDMS to store a lot of data. Can store data in blocks to improve latency. Information is stored on a cluster of machines

benefits of object storage

Typically latency improves and more fault tolerant. Many services provide SLA - service level agreement - agreement to prevent data loss and ensure availability S3 no promises on latency

why is scaling out less expensive than scaling up?

scale up is buy new expensive computer, scale out is add cheap new computer to your existing clusters of computers

explain what aspects of the design of object storage enable scaling out and why?

There are no hierarchies. Data is split into blocks, can put different blocks on different machines. Data is already broken up into pieces. Works well for heterogeneous machines

explain the three different ways, on the physical level, to deal with more data.

Scale up - improve machines - more capacity scale out - buy more machines Write better code - improve code efficiency

What is the difference between storage, memory, CPU and network and how the three are paramount in a cluster?

Storage - disk (SSDs) can store data up to a few terabytes - local file system - non-volatile Memory - volatile - "working" memory - temporary CPU - calculations/computations network - a disk can be made accessible through network (LAN or WAN)

Chapters 3&4 Knowledge Testers Flashcards

Questions at the end of each section (33 cards)