Storage: DFS Flashcards

1
Q

Distributed File System

A

An abstraction for data stored across multiple machines to appear as a unified storage system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Goals of DFS

A
  1. Split files
  2. Hide complexity
  3. Have Fault-tolerance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hadoop Distributed File System (4)

A
  • Most widely used DFS
  • Clusters of commodity
    hardware
  • Handling up to petabytes of
    data
  • Designed for high
    throughput batch (not low
    latency access)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Blocks

A

Smallest unit of storage that can be read or written
- default size 64-128MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does HDFS ensure data consistency?

A

Write-once, Read-many model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hadoop default Fault-tolerance

A

Blocks duplicated at factor of three to ensure data remains accessible if one machine fails (hardware failure = NORM!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Namenode (NN) (2)

A

Master in master-slave architecture
1. Store metadata about
location of specific blocks.
2. Control client access to data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Datanodes (DN)

A

The slaves because they store and process the actual data. Send periodic “heartbeats” to update master

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The big issue with Master-Slave architecture

A

Single Point of Failure (SPOF), there is only one namenode that maintains the filesystem tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

High Availability and how it’s achieved with Master-Slave Architecture

A

A system that can tolerate faults

Two separate machines as NNs:
- 1 Active State
- 1 Standby State

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Besides HA, another reason it’s beneficial to configure additional name nodes

A

of blocks in system is limited by RAM of NN, since they store metadata about blocks in memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Erasure Coding

A

A way to store less redundant data by splitting into smaller data cells called “Stripes”

with

Parity cells as backup pieces to help recover data.

If you lose some data cells, you can still rebuild thanks to parity cells. (Data cells are, of course split across nodes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Pros of erasure coding (2)

A
  1. Reduce data redundancy
    from 200% to 50%
  2. Faster writes without
    replicating
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cons of Erasure Coding (3)

A
  1. Higher cpu cost for reads
    and writes
  2. Longer recovery time in
    case of failure
  3. Loss of data locality (too
    much splitting!!)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

For which types of datasets does erasure coding work best?

A

Those with low I/O activities (not HOT or interactive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HDFS cons in general (2)

A
  1. High latency
  2. Not good if there
    are many small files
    limited by memory in NN.
17
Q

Standard data formats (3)

A
  1. (Structured) CSV
  2. (Semi-structured) JSON files 3. (Unstructured) textual files
18
Q

3 benefits of big data specific file formats (BCS)

A
  1. Binary serialization
  2. Compression
  3. Splittability
19
Q

Row-oriented file format

A

Store data in rows so it’s easy to read or write full record of fields.

20
Q

Good application for Row oriented format

A

OLTP

Updating customers profile, adding a new order (access to all fields for the record)

21
Q

Column-oriented file format

A

store data by columns makes easier compression because same data type is stored in same record (type specific encodings)

Easier to access specific features of data for aggregation (reduced I/O for analytical queries)

22
Q

Good application for column oriented format

A

OLAP

Average salary of employees by department (get all instances for one field with one query!)

23
Q

Row-oriented file formats examples (3)

A

SequenceFiles (key-value pairs, designed for MapReduce)
Apache Thrift (good communication between programs)
Apache Avro (schema evolution)

24
Q

Column-oriented file formats examples (2)

A

ORC (first on Hadoop, designed for Hive - sql on hadoop)
Apache Parquet (schema evolution)

25
Q

Parquet

A

Columnar storage format for efficient querying of nested structures in a flat format