Bigdata Lineage Flashcards

1
Q

What is Hadoop

A
  • HDFS
  • YARN
  • MapReduce
  • Hadoop Commons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is HBase

A

Database on top of Hadoop. Random read/write access. Store large tables (1M columns, 1B rows) atop of commodity hardware. Bigtable-like capabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hadoop history

A

2006
Yahoo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

HBase history

A

2007
Hadoop subproject

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Dynamo

A

Dynamo is a paper from Amazon. The paper describes an internal database at Amazon to handle their scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where does the term NoSQL comes from

A
  1. Dynamo paper helped launch the ‘NoSQL’ movement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cassandra history

A

2008, DataStax. Origin: Facebook. 2 Facebook engineers from Amazon made much of the dev)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Cassandra

A

Column-oriented, scalable database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a column oriented database

A

Excels in handing time series, «group by»

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Apache Pig

A

High-level scripting language to generate MapReduce jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Apache Pig history

A
  1. Yahoo. Dead in 2017 (latest release 0.17)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why Pig is dead?

A

Preference SQL:
- Hive
- spark SQL
Performance et écosystème: spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Apache Hive history

A
  1. Facebook.
    Tend to be replaced by solutions designed for cloud storage such Iceberg, Hudi and Delta Lake
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Apache Hive

A

Querying tool, SQL-like interface (HiveQL). Creates MapReduce jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Relation between Pig and Hive

A

Both provides a high-level solution to create MapReduce jobs. Pig uses a specific language (Pig Latin) while Hive uses a SQL-like language (HiveQL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Lakehouse architecture solutions

A

Hive, Iceberg, Hudi, Delta Lake

17
Q

Next generation solutions compared with Hive

A

Iceberg, Hudi and Delta Lake
They offer ACID transactions, schema evolution and data versioning. Designed for cloud storage

18
Q

What came after Hadoop

A

What slowed down Hadoop:

Cloud storage alternatives to HDFS
New Processing frameworks such as Spark
Docker, K8s
Hadoop Ozone

But Hadoop is still widely used

19
Q

What is Apache Impala

A

SQL engine for data stored in HDFS
Like spark but before spark - 2013, Cloudera
Process data in memory, in contrast to Hive
Replaced by Spark

20
Q

Impala vs Hive vs Spark

A

Impala: massively parallel processing in memory just like Spark
Hive: not in memory
Spark: not coupled with HDFS: general purpose