Core Azure Services Flashcards
What is Spark Context?
Spark Context (SparkContext) is the main entry point for Spark functionality, allowing interaction with the Spark cluster.
What role does Spark Context play in a Spark application?
Spark Context sets up internal services and establishes a connection to the Spark execution environment.
How is Spark Context initialized in a Spark application?
Spark Context is typically created using the SparkSession object in Spark 2.x versions.
What functionalities are available through Spark Context?
Spark Context provides access to various functionalities like creating RDDs (Resilient Distributed Datasets), performing transformations, and executing actions on distributed data.
What is the significance of Spark Context’s stop() method?
The stop() method is used to shut down the Spark application, releasing resources acquired by the Spark Context.
Types of spark contexts
SparkContext (Legacy):
In earlier versions of Spark (1.x and prior), SparkContext was the main entry point for creating RDDs and basic operations. It’s now considered the legacy way of interacting with Spark, superseded by SparkSession.
SQLContext:
SQLContext is used to work with structured data in Spark. It provides a way to interact with Spark SQL, enabling the execution of SQL queries against Spark dataframes and tables.
HiveContext:
HiveContext is an extension of SQLContext, allowing interaction with HiveQL. It enables access to Hive tables and metadata, making it convenient for those familiar with Hive.
SparkSession:
Introduced in Spark 2.0, SparkSession unifies all the different contexts (like SQLContext, HiveContext, StreamingContext) into a single entry point. It provides a unified interface to access Spark features, including DataFrame APIs, SQL, and streaming capabilities.
StreamingContext:
StreamingContext is used for creating DStreams (Discretized Streams) to process real-time streaming data. It allows the application to process streaming data in a way similar to working with RDDs.
PysparkShell Context:
This context is specific to the PySpark interactive shell. It initializes and provides the Spark execution environment for Python code.
SparkSession
What is the unified entry point introduced in Spark 2.0?
SparkSession is the unified entry point in Spark 2.0, combining different contexts (like SQLContext) into a single interface.
What functionalities does SparkSession provide?
It provides access to DataFrame APIs, SQL, and streaming capabilities, unifying various Spark features.
Which context is used to work with structured data in Spark?
SQLContext is used to work with structured data in Spark, providing access to Spark SQL functionalities.
What does SQLContext enable in Spark?
It enables the execution of SQL queries against Spark DataFrames and tables.
What is the main entry point for Spark functionality?
SparkContext (SparkContext) is the main entry point for Spark functionality, allowing interaction with the Spark cluster.
What does SparkContext primarily set up?
It sets up internal services and establishes a connection to the Spark execution environment.
What is the primary abstraction in Apache Spark?
RDD is the primary abstraction in Apache Spark, representing a collection of elements distributed across multiple nodes in a cluster.
How are RDDs resilient?
RDDs are resilient because they can recover lost data due to their lineage information, which allows for their reconstruction in case of failure.
What operations can you perform on RDDs?
RDDs support two types of operations: transformations (which create new RDDs from existing ones) and actions (which perform computations and return values).