Storage: DFS Flashcards

1
Q

Distributed File System

A

abstraction for data stored across multiple machines to appear as a unified storage system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Goals of DFS

A
  1. hide complexity (abstraction)
  2. Splitting files (BD support)
  3. Fault-tolerance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hadoop Distributed File System

A

One of the most widely used DFS, running on clusters of commodity hardware handling up to petabytes of data, designed for high throughput batch processing rather than low latency access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Blocks

A

smallest units of storage that can be read or written, default 64-128MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does HDFS ensure data consistency?

A

write-once, read-many model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Fault tolerance

A

hardware failure is the norm, blocks duplicated at factor of three to ensure data remains accessible if one machine fails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Namenode (NN)

A

The master node in master-slave architecture, stores metadata about location of specific blocks. Controls access to files by clients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Datanodes (DN)

A

The slaves because they store and process the actual data. Send periodic “heartbeats” to update master

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the catch with master-slave architecture?

A

Single Point of Failure (SPOF), there is only one namenode that maintains the filesystem tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

High Availability

A

a system that can tolerate faults, achieved by configuring two separate machines as NNs (active state and standy state) with shared storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

another reason its beneficial to configure additional name nodes

A

number of blocks is limited by size of NN, since they store metadata in their memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Erasure Coding

A

A way to store less redundant data by splitting into smaller “data cells” or stripes and along with “parity cells” as backup pieces to help recover data. If you lose some data cells, you can still rebuild thanks to parity cells. Data cells are, of course split across nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

pros of erasure coding (2)

A

reduce data redundancy from 200% to 50%
faster writes (stripes are distributed, smaller than full replication)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

cons of erasure coding (3)

A

higher cpu cost for reads/writes
longer recovery time in case of failure
loss of data locality (too much splitting!!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

where does erasure coding work best (type of dataset)

A

datasets with low I/0 activities (not HOT or interactive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HDFS cons in general (2)

A

high latency
not good if there are many small files, because file number is limited by memory in NN

17
Q

data usually comes in standard file formats such as:

A

(Structured) CSV, (semi-structured) JSON files, (unstructured) textual files.

18
Q

3 benefits of big data specific file formats (BCS)

A

binary serialization
compression
splittability

19
Q

Row-oriented file format

A

store data in rows, making it easy to read or write full record of fields (SequenceFiles, apache thrift, Apache Avro)

20
Q

example of row orientation application

A

OLTP, updating customers profile, adding a new order (access to all fields for the record)

21
Q

Column-oriented file format

A

store data by columns, easier compression because same data type stored in same record (type specific encodings), easier to access specific features of data for aggregation (reduced I/O for analytical queries)

22
Q

example of column orientation application

A

OLAP, average salary of employees by department (get all instances for one field with one query!)

23
Q

Row-oriented file formats examples

A

SequenceFiles (key-value pairs, designed for MapReduce)
Apache Thrift (good communication between programs)
Apache Avro (schema evolution)

24
Q

Column-oriented file formats examples

A

ORC (first on Hadoop, designed for hive)
Apache Parquet (schema evolution)

25
Q

Parquet

A

columnar storage format for efficient querying of nested structures in a flat format

26
Q

when did twitter and cloudera release their joint effort of parquet

A

2013

27
Q

how does parquet store nested data in columnar format

A

map the schema to a list of columns