Revature Spark Flashcards
What does Cluster Computing refer to?
Cluster computing refers to the use of a group of interconnected computers working together as a single system to perform tasks, enabling parallel processing and increased computing power.
What is a Working Set?*
A working set refers to the subset of data actively used by computations, stored in memory to optimize access and performance.
What does RDD stand for?
Resilient Distributed Dataset.
a fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel. RDDs are designed to handle large-scale data processing efficiently and are the building blocks of PySpark’s distributed computing capabilities.
What does it mean when we say an RDD is a collection of objects partitioned across a set of machines?
It means that the data in an RDD is divided into chunks (partitions), and these chunks are distributed across different nodes in a cluster for parallel processing.
Why do we say that MapReduce has an acyclic data flow?
MapReduce has an acyclic data flow because data moves in a directed manner from input to output without cycles, ensuring deterministic processing.
Explain the deficiency in using Hive for interactive analysis on datasets. How does Spark alleviate this problem?
Hive processes queries using MapReduce, which has high latency due to repeated disk I/O. Spark uses in-memory computation, significantly reducing latency and improving interactivity.
What is the lineage of an RDD?
The lineage of an RDD is the sequence of transformations applied to create it from other RDDs, allowing Spark to recompute lost data partitions.
RDDs are lazy and ephemeral. What does this mean?
Lazy means RDDs do not compute their transformations until an action is triggered. Ephemeral means RDDs are not persisted unless explicitly cached or stored.
What are the 4 ways provided to construct an RDD?
- Parallelizing an existing collection. 2. Reading data from an external storage system. 3. Transforming an existing RDD. 4. Creating RDDs from Hadoop datasets.
What does it mean to transform an RDD?
Transforming an RDD involves applying operations like map, filter, or join to produce a new RDD.
What does it mean to cache an RDD?
Caching an RDD means storing its data in memory for reuse across operations, improving performance for iterative computations.
What does it mean to perform a parallel operation on an RDD?
Performing a parallel operation on an RDD means executing computations concurrently across the partitions on multiple nodes of the cluster.
Why does Spark need special tools for shared variables, instead of just declaring?
Spark tasks run on different nodes, and variables in closures are copied to each node, making updates inconsistent without special tools like broadcast variables or accumulators.
What is a broadcast variable?
A broadcast variable is a read-only variable that is cached and shared with all nodes in the cluster, ensuring efficient distribution of large values.
What is an accumulator?
An accumulator is a shared, write-only variable used for aggregating values across nodes, like counters or sums.
Be comfortable enough with the following terms to recognize them.
RDD, Action, Transformation, lineage, cache, lazy evaluation, broadcast variable, accumulator
What are some transformations available on an RDD?
Transformation create new rdd. Some transformation include map, filter, flatMap, groupByKey, reduceByKey, union, join, distinct.
What is the difference between a wide and a narrow transformation in Spark?
Narrow transformations like map or filter involve a single stage, where data from each partition is processed independently. Wide transformations like groupByKey or join involve shuffles, where data is redistributed across partitions.
What are some actions available on an RDD?
collect, count, reduce, take, saveAsTextFile, foreach.
What is a shuffle in Spark?
A shuffle in Spark is the process of redistributing data across partitions, typically required for wide transformations.
What’s the difference in output between MapReduce wordcount in Hadoop and .map followed by .reduceByKey in Spark?
Both produce the same word count result, but Spark’s .map and .reduceByKey operations are faster due to in-memory computation.
Why should we be careful about using accumulators outside of an action?
Accumulators are only guaranteed to be updated once per action, and using them outside actions can lead to inconsistent results.
What is the closure of a task? Can we use variables in a closure?
The closure of a task is the set of variables captured from the driver program for use in a task. Variables can be used but are copied to each node, and updates do not reflect back in the driver.
How can we see the lineage of an RDD?
By using the toDebugString method on the RDD.
What are actions
They re operations that trigger transformation like collect or reduce
Transformation vs actions
Transformation create a new data set from an existing such as filter, map,join.
Actions trigger the computation and returns the result to the driver program