Storage: DFS Flashcards

Question 1

Q

Distributed File System

Answer

A

An abstraction for data stored across multiple machines to appear as a unified storage system

Question 2

Q

Goals of DFS

Answer

A

Split files
Hide complexity
Have Fault-tolerance

Question 3

Q

Hadoop Distributed File System (4)

Answer

A

Most widely used DFS
Clusters of commodity
hardware
Handling up to petabytes of
data
Designed for high
throughput batch (not low
latency access)

Question 4

Q

Blocks

Answer

A

Smallest unit of storage that can be read or written
- default size 64-128MB

Question 5

Q

How does HDFS ensure data consistency?

Answer

A

Write-once, Read-many model

Question 6

Q

Hadoop default Fault-tolerance

Answer

A

Blocks duplicated at factor of three to ensure data remains accessible if one machine fails (hardware failure = NORM!)

Question 7

Q

Namenode (NN) (2)

Answer

A

Master in master-slave architecture
1. Store metadata about
location of specific blocks.
2. Control client access to data

Question 8

Q

Datanodes (DN)

Answer

A

The slaves because they store and process the actual data. Send periodic “heartbeats” to update master

Question 9

Q

The big issue with Master-Slave architecture

Answer

A

Single Point of Failure (SPOF), there is only one namenode that maintains the filesystem tree

Question 10

Q

High Availability and how it’s achieved with Master-Slave Architecture

Answer

A

A system that can tolerate faults

Two separate machines as NNs:
- 1 Active State
- 1 Standby State

Question 11

Q

Besides HA, another reason it’s beneficial to configure additional name nodes

Answer

A

of blocks in system is limited by RAM of NN, since they store metadata about blocks in memory

Question 12

Q

Erasure Coding

Answer

A

A way to store less redundant data by splitting into smaller data cells called “Stripes”

with

Parity cells as backup pieces to help recover data.

If you lose some data cells, you can still rebuild thanks to parity cells. (Data cells are, of course split across nodes)

Question 13

Q

Pros of erasure coding (2)

Answer

A

Reduce data redundancy
from 200% to 50%
Faster writes without
replicating

Question 14

Q

Cons of Erasure Coding (3)

Answer

A

Higher cpu cost for reads
and writes
Longer recovery time in
case of failure
Loss of data locality (too
much splitting!!)

Question 15

Q

For which types of datasets does erasure coding work best?

Answer

A

Those with low I/O activities (not HOT or interactive)

Question 16

Q

HDFS cons in general (2)

Answer

A

High latency
Not good if there
are many small files
limited by memory in NN.

Question 17

Q

Standard data formats (3)

Answer

A

(Structured) CSV
(Semi-structured) JSON files 3. (Unstructured) textual files

Question 18

Q

3 benefits of big data specific file formats (BCS)

Answer

A

Binary serialization
Compression
Splittability

Question 19

Q

Row-oriented file format

Answer

A

Store data in rows so it’s easy to read or write full record of fields.

Question 20

Q

Good application for Row oriented format

Answer

A

OLTP

Updating customers profile, adding a new order (access to all fields for the record)

Question 21

Q

Column-oriented file format

Answer

A

store data by columns makes easier compression because same data type is stored in same record (type specific encodings)

Easier to access specific features of data for aggregation (reduced I/O for analytical queries)

Question 22

Q

Good application for column oriented format

Answer

A

OLAP

Average salary of employees by department (get all instances for one field with one query!)

Question 23

Q

Row-oriented file formats examples (3)

Answer

A

SequenceFiles (key-value pairs, designed for MapReduce)
Apache Thrift (good communication between programs)
Apache Avro (schema evolution)

Question 24

Q

Column-oriented file formats examples (2)

Answer

A

ORC (first on Hadoop, designed for Hive - sql on hadoop)
Apache Parquet (schema evolution)

Question 25

Q

Parquet

Answer

A

Columnar storage format for efficient querying of nested structures in a flat format