Spark -- interview questions Flashcards
Q: What is Spark Core?
A: It has all the basic functionalities of Spark, like - memory management, fault recovery, interacting with storage systems, scheduling tasks, etc.
Q: What do you understand by SchemaRDD?
A: An RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.
Q: What do you understand by receivers in Spark Streaming?
Receivers are special entities in Spark Streaming that consume data from various data sources and move them to Apache Spark. Receivers are usually created by streaming contexts as long-running tasks on various executors and scheduled to operate in a round-robin manner with each receiver taking a single core.
We invite the big data community to share the most frequently asked Apache Spark Interview questions and answers, in the comments below - to ease big data job interviews for all prospective analytics professionals.
What operations are supported by RDD?
Transformations and Action
Q: What is Spark SQL?
A: SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing. Through this module, Spark executes relational SQL queries on the data. The core of the component supports an altogether different RDD called SchemaRDD, composed of row objects and schema objects defining the data type of each column in the row. It is similar to a table in a relational database.
Q: What is Spark Executor?
A: When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
Q: What is SparkContext in Apache Spark? What is the need of SparkContext? What are the responsibilities of SparkContext?
A: A SparkContext is a client of Spark’s execution environment and it acts as the master of the Spark application. SparkContext sets up internal services and establishes a connection to a Spark execution environment. You can create RDDs, accumulators and broadcast variables, access Spark services and run jobs (until SparkContext stops) after the creation of SparkContext. Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.
Q: List the functions of Spark SQL?
A: Spark SQL is capable of:
Loading data from a variety of structured sources
Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more
Q: What is a Spark DataFrame?
It’s the main object used in Spark.SQL
A Spark data frame is an advanced RDD table object that consists of row objects, and a schema (columns)
Data = the info (data), frame = schema (name columns). Similar to an AVRO file.
In Spark, DataFrame is a collection of distributed data over the network with some schema. We can understand it as the data formatted as a row/column manner.
DataFrame can be created from Hive data, JSON file, CSV, Structured data or raw data that can be framed in structured data. z
We can also create a DataFrame from RDD if some schema can be applied to that RDD.
A temporary view or table can also be created from DataFrame as it has data and schema. We can also run SQL on created table/view to get the faster result.
Q: What are the advantages of DataFrame in Apache Spark?
A: DataFrame is a distributed collection of data. In DataFrames, data is organized in the named column. The DataFrame API’s are available in various programming languages. For example, Java, Scala, Python, and R. and he DataFrame API’s are available in various programming languages. For example Java, Scala, Python, and R.
What is Spark Core?
❑ Spark Core is the base engine for large-scale parallel and distributed data processing. Java, Scala, and Python APIs offer a platform for distributed ETL application development.
❑ Spark Core performs various important functions like memory management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage systems. Further, additional libraries, built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:
➢ Memory management and fault recovery
➢ Scheduling, distributing and monitoring jobs on a cluster
➢ Interacting with storage systems
What are the four libraries of Spark SQL
1) Data Source API – the universal API
➢ It has built-in support for Hive, Avro, JSON, JDBC, Parquet, etc.
➢ It supports third party integration through Spark packages.
2) DataFrame API
➢ It is lazily evaluated like SparkRDD.
➢ It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to multi- node clusters.
3) Interpreter & Optimizer
4) SQL Service