General Flashcards

Question 1

Q

What is AVRO?

Answer

A

It is a row-oriented storage format mainly used within the Hadoop framework.
Stores data definition in JSON. Data itself is stored in binary making it compact and efficient. It can handle schema evolution, which are schemas the change over time.

Question 2

Q

What is Parquet?

Answer

A

Stores data in a columnar format designed for efficient data storage and retrieval. It is especially good for queries that read only particular columns. Supports complex data types.

Question 3

Q

What is ORC?

Answer

A

ORC is a columnar file format optimized for reading, write, and processing data in Hive. It was in fact created as part of a initiative to speed up Hive.

Question 4

Q

Things to consider before using AVRO, Parquet or ORC

Answer

A

Read/Write Operations: Row-based data formats are overall better for storing write-intensive data because appending new records is easier. If only a small subset of columns will be queries frequently, columnar formats will be better.

Compression: Columnar formats are better because storing the same type of values together allow more efficient compression. ORC has the best compression rate of all three, thanks to its stripes.

Schema Evolution: Adding/Dropping columns and changing column names is something that happens frequently in Big Data. In this case, AVRO is the best option.

Nested Columns: If you have a lot of complex nested columns and only need to query a subset of the subcolumns, Parquet would be a better option.

Platform: ORC is usually used with Hive; Parquet is usually used with Spark and Avro is usually used with Kafka.

Question 5

Q

What is serialization and deserialization?

Answer

A

Serialization is the process of converting a data object into a sequence of bytes to more easily save or transmit it.
Deserialization is the process of constructing a data structure or object from a sequence of bytes.

Question 6

Q

What are the types of data?

Answer

A

Structured Data - Data with a defined model and structure like a table in a database
Semi-Structured Data - Data with an apparent pattern like XML and JSON since it does not have a fixed schema
Unstructured Data - No inherent structure and usually stored as binary files like images

Question 7

Q

What is Schema on Read and Schema on Write?

Answer

A

Schema on Write refers that a schema is imposed when loading the data. For example, when you write to the DB, the schema is checked and it rejects if it does not conform to the schema.

Fast reads
Slower loads
Structured
SQL

Schema on Read refers that a schema is imposed when reading the data after writing the data as it is without any changes or transformations.

Slower reads
Fast loads
Structured / Unstructured
NoSQL

Question 8

Q

What are some of the compression formats?

Answer

A

GZIP: Uses more CPU resources than Snappy or LZO but provides a higher compression ratio. It is often a good choice for cold data, which is accessed infrequently.
BZIP2: Produce more compression than GZIP for some types of files, at the cost of some speed when compressing and decompressing,.
LZO: Focuses on decompression speed at low CPU usage and higher compression at the cost of more CPU. Works better for hot data, which is accessed frequently.
SNAPPY: It aims for very high speeds and reasonable compression. Often performs better than LZO. Works better for hot data, which is accessed frequently.

Question 9

Q

What features should a Big Data solution include?

Answer

A

Scalable
Fault Tolerant
High Availability
Data is widely accessible for secure
Supports analytics, data science and content applications
Supports worklflow automation
Be self healing
Integrates with legacy applications

Question 10

Q

What’s the difference between High Availability and Fault Tolerance?

Answer

A

A fault tolerant environment has no service interruption but a significantly higher cost while a highly available environment has a minimal service interruption.

General Flashcards

(10 cards)