Chapters 3&4 Knowledge Testers Flashcards
Questions at the end of each section
Do you know how storage and memory technologies (HDD, SDD and RAM) compare in terms of capacity and throughput?
Capacity (least to most): RAM, SSD,HDD
Throughput: (slow to fast): HDD,SSD,RAM
Do you know the difference between data and metadata?
metadata: small relational table with the following attributes
- name of the file, access rights, owner, group of owner, last modification time, creation time, size
Data: content of the files
Order of magnitude that can be achieved in terms of number of objects, and object size?
S3 - 100 buckets, object is 5TB, chunk is 5GB
Can you name a few big players for cloud-based object storage (vendors, consumers)?
Amazon S3, Microsoft Azure
Can you describe the features of S3 and Azure Blob Storage on a high level? Do you know what a bucket and object are? What block blob storage, append blob storage and page blob storage are and how they work?
S3: buckets that contain objects (up to 5TB, fit on single disk)
- no hierarchies for objects
Azure: public documentation, IDed by account, container, blobs.
- organized in storage stamps (10-20 racks) 30PB
- exposes more details to users
- block: store data
- append: for logging
- page: for storing and accessing the memory of virtual machines
Describe what the most important SLA (Service Level Agreement) parameters mean (e.g., latency, availability, durability) as well as their typical range?
latency: how fast your data will reach you. response times (eg. 99.9% of requests will be served in 30 seconds) S3 has no latency guarantees
availability: How often your data will be available for you (99.99% of year, 52 minutes with no availability)
durability: how much data will be lost (less than 1 in 1 billion or 99.999999999% not lost (nine 9s after decimal))
Do you know what each letter stands for in CAP?
Consistency: (Atomic) at any point in time the same request to any server returns the same results (all nodes see same data)
Availability: system available at all times for requests
Partition tolerance: System continues to function even if network linking its machines is occasionally partitioned
Can you explain why, for large amounts of data, CAP becomes rel- evant over ACID?
ACID: for data all stored on one machine for relational databases
CAP: For big data where many machines are used and where replicas of nodes must be made and updated
Can you explain what a REST API is, what resources and methods are?
data stores ecpose their functionality through APIs. REST = REpresentational State Transfer. “HTTP” done right”.
Resources: anything: document, pdf, person. refered to with a URI (uniform Resource Identifier)
Methods: normal methods/functions that do stuff. Get, put, delete, post.
Can you describe a typical use case for object storage?
shopping carts for large online stores
Can you explain the difference between block storage and object storage?
object -> flat, key-value pairs
HUGE (billions/trillions) amounts of big objects (5TB)
block -> hierarchies
->a lot (Millions) of HUGE files (>5TB)
Can you explain the difference between the (logical) key-value model and a file system?
key-value: flat no hierarchies
file system: hierarchies
Do you know the order of magnitude of a block size for a local filesystem and for a distributed file system? Can you explain the rationale behind them with respect to latency and throughput?
Local: 4kB
DFS block size: 64 or 128 MB
large enough that time is not lost in latency waiting for a block to arrive
small enough for a large file to be conveniently spread over many machines (parallel access) to improve throughput. Also small enough that if there is an error and they must be sent again it can be
Can you contrast centralized architectures to decentralized (peer-to- peer) architectures?
Decentralized: all machines can talk to each other.
Centralized: there is a central machine. all other machines talk to the central machine, one node is special (queen bee).
the others nodes are basic bitches and interchangeable
Can you explain the HDFS architecture, what a NameNode is and what a DataNode is, how blocks are replicated?
HDFS: distributed files system: hierarchy of files over multiple machines
NameNode: the special node that all other nodes communicate with. stores namespace, mapping from file to list of its blocks, mapping from block to locations of replicas
DataNode: store data blocks on local disk, communicate with namenode through regular heartbeats. (datanode always initiates contact). datanodes communicate with each other through replication pipelines
Blocks are replicated through replication pipilines between datanodes. blocks are sent in smaller packets in a streaming fashion. The original node sends the first copy of the data to node #2, node #2 then propogates on. Original node does NOT send data to all the nodes.
Can you sketch how the various components communicate with each other (client, NameNode, DataNode)?
NameNode: never initiates communication (like women w sex). namenode responds to heartbeats
DataNode: communicate with namenode through regular heartbeats (always initiates). communicate with each other through replication pipelines
Blocks are replicated through replication pipilines between datanodes. blocks are sent in smaller packets in a streaming fashion. The original node sends the first copy of the data to node #2, node #2 then propogates on. Original node does NOT send data to all the nodes.
Client communicates with namenode and datanode
Can you point to the single points of failure of HDFS and explain how they can be addressed?
If the namenode fucks off were screwed. Keep track of edit log, merge into a bigger log periodically. We use a “phantom NameNode” that keeps the exact same structures in its memory as the real NameNode, and performs checkpoints periodically. It is possible to set it up so the phantom NameNode to be able to instantly take over from the NameNode in case of a crash.
Can you explain how the NameNode stores the file system namespace, in memory and on disk? In particular, can you explain how the namespace file and the edit log work together at startup time and how they get modified once the system is up and running?
HDFS does not follow a key-value model: instead, an HDFS cluster organizes its files as a hierarchy, called the file namespace. Files are thus organized in directories, similar to a local file system. This is stored on NameNode. The file namespace containing the directory and file hierarchy as well as the mapping from files to block IDs is backed up to a so-called snapshot. The snapshot and edit log are stored either locally or on a network- attached drive (not HDFS itself). For more resilience, they can also be copied over to more backup locations. If the NameNode crashes, it can be restarted, the snapshot can be loaded back into memory to get the file namespace and the mapping of the files to block IDs. Then the edit log can be replayed in order to apply the latest
changes
Can you explain what a Standby NameNode is? (Note: it has many predecessors that only have historical relevance in the development of HDFS: Backup NameNode, Secondary NameNode, Checkpoint NameNode, etc, but this is not important for the course)
We use a Standby NameNode keeps the exact same structures in its memory as the real NameNode, and performs checkpoints periodically. It is possible to set it up so the standby NameNode to be able to instantly take over from the NameNode in case of a crash.
where does HDFS shine and why?
PBs of data. Write once, read many. fault tolerance. throughput
Do you know that HDFS files are updated by appending atomically and why?
simplifies consistency, aids in batch processing, simplifies logging
Do you know how HDFS performs in terms of throughput and latency?
optimized for high throughput
What are the main benefits of HDFS?
handles massive files, streaming, scalable, fault tolerance, high availability, works well with scaling out
Describe the limitations of traditional (local) file systems?
local file systems must fit on one machine, creating size constraints. Single disks cannot store big datasets.
contrast or relate object storage with block storage, a file system, and the key-value model
Object Storage:
- Store each object -> data, metadata, unique identifier
-good with unstructured data
-images, videos, backups, data lakes
Block storage: store data in fixed sized blocks
- read/write blocks directly
- requires system to manage files and directories
File System:
- organizes data into hierarchy structure of directories
- built on top of block storage
-every day general purpose storage
Key-Value model:
- key: identifier, value: data
- access data using key, no scheme or hierarchy
- caching, session data, quick lookups,
Object store can be seen as key-value store, object stores -> large
key value -> small and fast
What is a data centre is made of (numbers)
1000 - 10,000s of machines
- most cannot handle 100,000+ due to electricity issues and cooling
- trend less servers more cores
- server can be called node has 1-64 cores (increasing)
- memory on node between 16GB - 6TB
- SSD/HDD 1-20 TB
- bandwidth 1-100GB/s
- nodes - flat rectangular boxes, piled in a rack
- cluster - room of racks
-module/node in a rack can be server or pure storage or network switch
- height of module standardized 1-4 RU (rack units)
Do you know rough, typical numbers (per-node storage capacity, memory, number of cores, etc.)?
- 1000 - 10,000s of machines
- 1-64 cores (increasing)
- memory on node between 16GB - 6TB
- SSD/HDD 1-20 TB
- bandwidth 1-100GB/s
- height of module standardized 1-4 RU (rack units)
What is Object Storage
A solution to RDMS to store a lot of data. Can store data in blocks to improve latency. Information is stored on a cluster of machines
benefits of object storage
Typically latency improves and more fault tolerant.
Many services provide SLA - service level agreement - agreement to prevent data loss and ensure availability
S3 no promises on latency
why is scaling out less expensive than scaling up?
scale up is buy new expensive computer, scale out is add cheap new computer to your existing clusters of computers
explain what aspects of the design of object storage enable scaling out and why?
There are no hierarchies. Data is split into blocks, can put different blocks on different machines. Data is already broken up into pieces. Works well for heterogeneous machines
explain the three different ways, on the physical level, to deal with more data.
Scale up - improve machines - more capacity
scale out - buy more machines
Write better code - improve code efficiency
What is the difference between storage, memory, CPU and network and how the three are paramount in a cluster?
Storage - disk (SSDs) can store data up to a few terabytes - local file system - non-volatile
Memory - volatile - “working” memory - temporary
CPU - calculations/computations
network - a disk can be made accessible through network (LAN or WAN)