Bigdata Lineage Flashcards
What is Hadoop
- HDFS
- YARN
- MapReduce
- Hadoop Commons
What is HBase
Database on top of Hadoop. Random read/write access. Store large tables (1M columns, 1B rows) atop of commodity hardware. Bigtable-like capabilities
Hadoop history
2006
Yahoo
HBase history
2007
Hadoop subproject
What is Dynamo
Dynamo is a paper from Amazon. The paper describes an internal database at Amazon to handle their scale.
Where does the term NoSQL comes from
- Dynamo paper helped launch the ‘NoSQL’ movement
Cassandra history
2008, DataStax. Origin: Facebook. 2 Facebook engineers from Amazon made much of the dev)
What is Cassandra
Column-oriented, scalable database
What is a column oriented database
Excels in handing time series, «group by»
What is Apache Pig
High-level scripting language to generate MapReduce jobs
Apache Pig history
- Yahoo. Dead in 2017 (latest release 0.17)
Why Pig is dead?
Preference SQL:
- Hive
- spark SQL
Performance et écosystème: spark
Apache Hive history
- Facebook.
Tend to be replaced by solutions designed for cloud storage such Iceberg, Hudi and Delta Lake
What is Apache Hive
Querying tool, SQL-like interface (HiveQL). Creates MapReduce jobs
Relation between Pig and Hive
Both provides a high-level solution to create MapReduce jobs. Pig uses a specific language (Pig Latin) while Hive uses a SQL-like language (HiveQL)