SQL Flashcards
SQL is pronounced
Spark Sequel
Spark SQL extends RDDs to a
“DataFrame” object
DataFrames contain _____ objects
row
DataFrames have a ____, which leads to more efficient storage
schema
DataFrames can run _____ queries
SQL
parquet is a
popular column data store object
Spark SQL can read and write
Hive, JSON, parquet
import
from pyspark.sql import SQLContext, Row
To use SQL first thing you do is create a
Hive context
create a Hive context
hiveContext = HiveContext(sc)
get Hive data from JSON
inputData = hiveContext.jsonFile(dataFile)
JSON is pronounced
Jay Sahn
infer schema from inputData
inputData.registerTempTable(“myStructuredStuff”)
run a query and make a DataFrame
myResultDataFrame = hiveContext.sql(‘”“‘SELECT foo FROM bar ORDER BY footer’””’)
alternative to HiveContext
SQLContext