General Flashcards

1
Q

What is AVRO?

A

It is a row-oriented storage format mainly used within the Hadoop framework.
Stores data definition in JSON. Data itself is stored in binary making it compact and efficient. It can handle schema evolution, which are schemas the change over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Parquet?

A

Stores data in a columnar format designed for efficient data storage and retrieval. It is especially good for queries that read only particular columns. Supports complex data types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is ORC?

A

ORC is a columnar file format optimized for reading, write, and processing data in Hive. It was in fact created as part of a initiative to speed up Hive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Things to consider before using AVRO, Parquet or ORC

A

Read/Write Operations: Row-based data formats are overall better for storing write-intensive data because appending new records is easier. If only a small subset of columns will be queries frequently, columnar formats will be better.

Compression: Columnar formats are better because storing the same type of values together allow more efficient compression. ORC has the best compression rate of all three, thanks to its stripes.

Schema Evolution: Adding/Dropping columns and changing column names is something that happens frequently in Big Data. In this case, AVRO is the best option.

Nested Columns: If you have a lot of complex nested columns and only need to query a subset of the subcolumns, Parquet would be a better option.

Platform: ORC is usually used with Hive; Parquet is usually used with Spark and Avro is usually used with Kafka.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is serialization and deserialization?

A

Serialization is the process of converting a data object into a sequence of bytes to more easily save or transmit it.
Deserialization is the process of constructing a data structure or object from a sequence of bytes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the types of data?

A

Structured Data - Data with a defined model and structure like a table in a database
Semi-Structured Data - Data with an apparent pattern like XML and JSON since it does not have a fixed schema
Unstructured Data - No inherent structure and usually stored as binary files like images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Schema on Read and Schema on Write?

A

Schema on Write refers that a schema is imposed when loading the data. For example, when you write to the DB, the schema is checked and it rejects if it does not conform to the schema.

  • Fast reads
  • Slower loads
  • Structured
  • SQL

Schema on Read refers that a schema is imposed when reading the data after writing the data as it is without any changes or transformations.

  • Slower reads
  • Fast loads
  • Structured / Unstructured
  • NoSQL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some of the compression formats?

A

GZIP: Uses more CPU resources than Snappy or LZO but provides a higher compression ratio. It is often a good choice for cold data, which is accessed infrequently.
BZIP2: Produce more compression than GZIP for some types of files, at the cost of some speed when compressing and decompressing,.
LZO: Focuses on decompression speed at low CPU usage and higher compression at the cost of more CPU. Works better for hot data, which is accessed frequently.
SNAPPY: It aims for very high speeds and reasonable compression. Often performs better than LZO. Works better for hot data, which is accessed frequently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What features should a Big Data solution include?

A
  • Scalable
  • Fault Tolerant
  • High Availability
  • Data is widely accessible for secure
  • Supports analytics, data science and content applications
  • Supports worklflow automation
  • Be self healing
  • Integrates with legacy applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What’s the difference between High Availability and Fault Tolerance?

A

A fault tolerant environment has no service interruption but a significantly higher cost while a highly available environment has a minimal service interruption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly