Week 7: HIVE and Apache SparkSQL Flashcards

Question

HiveQL: Selecting Specific Columns

Answer 1

SELECT column-name FROM table;

Answer 2

SELECT DISTINCT column_name FROM table;

Answer 3

SELECT col1, col2 FROM table ORDER BY col2;

Answer 4

SELECT col1, col2 FROM table ORDER BY col2 DESC;

Answer 5

SELECT COUNT(*) FROM table;

Answer 6

SELECT owner, COUNT(*) FROM table GROUP BY owner;

Answer 7

SELECT MAX(col_name) AS label FROM table;

Answer 8

SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)

Answer 9

The mapper sends all rows with the same key to a single reducer, and the reducer does the join. If many rows have the same key, then the efficiency drops. To make JOIN operations more efficient, keep the smaller table data in memory first, join with a chunk of the larger table data each time. As such, everything is done in memory, without the reduce step required in the more common join scenarios. Example code: set hive.auto.convert.join=true; SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol WHERE s.symbol = 'AAPL';

Answer 10

It describes how to load the data from the file into a representation that makes it look like a table. It's implemented using JAVA. There are several built-in serializers/deserializers.

Answer 11

Doesn't fully materialise an object, until individual attributes are necessary. Reduces the overhead to create unnecessary objects in Hive. This increases performance.

Answer 12

It's a new module in Apache Spark that integrates relational processing with Spark's functional programming API. It allows Spark programmers to leverage the benefits of relational processing, and it lets SQL users call complex analytics libraries in Spark. SparkSQL runs as a library on top of Spark, exposes interfaces accessible through JDBC and command-line, and exposes the DataFrame API which is accessible through different programming languages. It tries to bridge the relational processing model with the native RDD's in Spark by using a DataFrame API that can perform relational operations on both external data sources and Spark's built-in RDD's, and a highly extensible optimiser, Catalyst. Pros: 1. Has access to MapReduce, a low-level programming language. 2. Integrated with SQL, a declarative language. 3. Great for ETL (Extract, Transform, and Load) to and from various semi or unstructured data sources, and advanced analytics that are hard to express in relational systems. Cons: 1. MapReduce isn't the best fit for data warehousing operations for Big Data.

Answer 13

It's a distributed collection of rows with the same schema, similar to a table in RDBMS. It can built from external data sources or RDD's, support relational operators (e.g. where, grouby) as well as Spark operations. It's evaluated lazily, with each DataFrame object representing a logical plan to compute a dataset but no execution occurs until the user calls an output operation. This enables optimisation. DataFrames use operators (i.e. ===), and can also be registered as temporary SQL tables and queried using SQL. Example code: ctx = new HiveContext() users = ctx.table("users") young = users.where(users("age" < 21) println(young.count())

Answer 14

SparkSQL uses a nested data model based on HIVE. It supports different data types, which allows for modelling data from HIVE, RDBMS, JSON, and native objects in JAVA/Python/Scala.

Answer 15

1. Code can be rewritten in different languages and benefit from optimisations across the whole logical plan. 2. Programmers can use control structures (if, loops) 3. Logical plan is analysed eagerly although query results are computed lazily. SparkSQL reports an error as soon as the user types an invalid line of code.

Answer 16

SparkSQL infers the schema of the native objects of a programming language automatically. This allows running relational operations on existing Spark programmes, and combines RDD's with external structured data.

Answer 17

It's a query optimiser built on Scala, uses Scala's pattern matching features, and represents Abstract Syntax Trees (AST's) and applies rules to manipulate them. Example code: tree.transform{ case Add(Literal(c1), Literal(c2)) -> Literal(c1+c2) }

Week 7: HIVE and Apache SparkSQL Flashcards

(41 cards)