Revature Spark Flashcards by Jarrod Saechao

What does Cluster Computing refer to?

Cluster computing refers to the use of a group of interconnected computers working together as a single system to perform tasks, enabling parallel processing and increased computing power.

How well did you know this?

Not at all

Perfectly

What is a Working Set?*

A working set refers to the subset of data actively used by computations, stored in memory to optimize access and performance.

How well did you know this?

Not at all

Perfectly

What does RDD stand for?

Resilient Distributed Dataset.

a fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel. RDDs are designed to handle large-scale data processing efficiently and are the building blocks of PySpark’s distributed computing capabilities.

How well did you know this?

Not at all

Perfectly

What does it mean when we say an RDD is a collection of objects partitioned across a set of machines?

It means that the data in an RDD is divided into chunks (partitions), and these chunks are distributed across different nodes in a cluster for parallel processing.

How well did you know this?

Not at all

Perfectly

Why do we say that MapReduce has an acyclic data flow?

MapReduce has an acyclic data flow because data moves in a directed manner from input to output without cycles, ensuring deterministic processing.

How well did you know this?

Not at all

Perfectly

Explain the deficiency in using Hive for interactive analysis on datasets. How does Spark alleviate this problem?

Hive processes queries using MapReduce, which has high latency due to repeated disk I/O. Spark uses in-memory computation, significantly reducing latency and improving interactivity.

How well did you know this?

Not at all

Perfectly

What is the lineage of an RDD?

The lineage of an RDD is the sequence of transformations applied to create it from other RDDs, allowing Spark to recompute lost data partitions.

How well did you know this?

Not at all

Perfectly

RDDs are lazy and ephemeral. What does this mean?

Lazy means RDDs do not compute their transformations until an action is triggered. Ephemeral means RDDs are not persisted unless explicitly cached or stored.

How well did you know this?

Not at all

Perfectly

What are the 4 ways provided to construct an RDD?

Parallelizing an existing collection. 2. Reading data from an external storage system. 3. Transforming an existing RDD. 4. Creating RDDs from Hadoop datasets.

How well did you know this?

Not at all

Perfectly

What does it mean to transform an RDD?

Transforming an RDD involves applying operations like map, filter, or join to produce a new RDD.

How well did you know this?

Not at all

Perfectly

What does it mean to cache an RDD?

Caching an RDD means storing its data in memory for reuse across operations, improving performance for iterative computations.

How well did you know this?

Not at all

Perfectly

What does it mean to perform a parallel operation on an RDD?

Performing a parallel operation on an RDD means executing computations concurrently across the partitions on multiple nodes of the cluster.

How well did you know this?

Not at all

Perfectly

Why does Spark need special tools for shared variables, instead of just declaring?

Spark tasks run on different nodes, and variables in closures are copied to each node, making updates inconsistent without special tools like broadcast variables or accumulators.

How well did you know this?

Not at all

Perfectly

What is a broadcast variable?

A broadcast variable is a read-only variable that is cached and shared with all nodes in the cluster, ensuring efficient distribution of large values.

How well did you know this?

Not at all

Perfectly

What is an accumulator?

An accumulator is a shared, write-only variable used for aggregating values across nodes, like counters or sums.

How well did you know this?

Not at all

Perfectly

Be comfortable enough with the following terms to recognize them.

Study These Flashcards

RDD, Action, Transformation, lineage, cache, lazy evaluation, broadcast variable, accumulator

What are some transformations available on an RDD?

Study These Flashcards

Transformation create new rdd. Some transformation include map, filter, flatMap, groupByKey, reduceByKey, union, join, distinct.

What is the difference between a wide and a narrow transformation in Spark?

Study These Flashcards

Narrow transformations like map or filter involve a single stage, where data from each partition is processed independently. Wide transformations like groupByKey or join involve shuffles, where data is redistributed across partitions.

What are some actions available on an RDD?

Study These Flashcards

collect, count, reduce, take, saveAsTextFile, foreach.

What is a shuffle in Spark?

Study These Flashcards

A shuffle in Spark is the process of redistributing data across partitions, typically required for wide transformations.

What’s the difference in output between MapReduce wordcount in Hadoop and .map followed by .reduceByKey in Spark?

Study These Flashcards

Both produce the same word count result, but Spark’s .map and .reduceByKey operations are faster due to in-memory computation.

Why should we be careful about using accumulators outside of an action?

Study These Flashcards

Accumulators are only guaranteed to be updated once per action, and using them outside actions can lead to inconsistent results.

What is the closure of a task? Can we use variables in a closure?

Study These Flashcards

The closure of a task is the set of variables captured from the driver program for use in a task. Variables can be used but are copied to each node, and updates do not reflect back in the driver.

How can we see the lineage of an RDD?

Study These Flashcards

By using the toDebugString method on the RDD.

What are actions

They re operations that trigger transformation like collect or reduce

Transformation vs actions

Transformation create a new data set from an existing such as filter, map,join. Actions trigger the computation and returns the result to the driver program

Revature Spark Flashcards

(26 cards)