Introduction to Big Data with PySpark Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Introduction to Big Data

Big Data Definition

A

Big data is a term that refers to data that are too large and complex to be processed by normal computing capabilities. Describing data as “big” is relative to our modern computing power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Introduction to Big Data

Big Data 3 Vs

A

Big Data can be characterized by what are known as the 3 Vs:

  1. Volume: the size of big data is larger than the amount of available computing power.
  2. Velocity: big data grows rapidly as it becomes faster, cheaper, and easier to collect automatically and continuously.
  3. Variety: big data comes in a variety of formats, such as structured (data tables with rows and columns), semi-structured (think JSON files with nested data), and unstructured (audio, image, and video data).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Introduction to Big Data

Big Data and RAM

A

Big data analysis is limited by the amount of Random Access Memory (RAM) that the available computing resources have. Many big data systems will use a computing cluster to increase the amount of total RAM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Introduction to Big Data

HDFS Overview

A

One system for big data storage is called Hadoop Distributed File Storage (HDFS). In this system, a cluster of computing resources stores the data. This cluster consists of a manager node, which sends commands to the worker nodes that house the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Introduction to Big Data

MapReduce Overview

A

MapReduce is a framework that can be used to process large datasets stored in a Hadoop Distributed File System (HDFS) cluster. MapReduce consists of two main functions, map and reduce, which can perform complex operations over a distributed system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Introduction to Big Data

MapReduce Process Overview

A

MapReduce works by sending commands from the manager node down to the numerous worker nodes, which process subsets of data in parallel. This speeds up processing when compared to traditional data processing frameworks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Introduction to Big Data

MapReduce Process Overview

A

MapReduce works by sending commands from the manager node down to the numerous worker nodes, which process subsets of data in parallel. This speeds up processing when compared to traditional data processing frameworks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Spark RDDs with PySpark

Spark Overview

A

Spark is an application that was designed to process large amounts of data. Originally designed for creating data pipelines for machine learning workloads, Spark is capable of querying, transforming, and analyzing big data on a variety of data systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Spark Process Overview

Spark Process Overview

A

Spark is able to process data quickly because it leverages the Random Access Memory (RAM) of a computing cluster. When processing data, Spark keeps the data in RAM, which is a faster processing part of a computing node. Spark does this in parallel across all worker nodes in a cluster. This differs from MapReduce, which processes data on the node’s disk, and explains why Spark is a faster framework than MapReduce.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Spark RDDs with PySpark

Pyspark Overview

A

The Spark framework is written in Scala but can be used in several languages, namely Python, Java, SQL, and R.

Pyspark is the Python API for Spark that can be installed directly from the leading Python repositories (PyPI and conda). Pyspark is a particularly popular framework because it makes the big data processing of Spark available to Python programmers. Python is a more approachable and familiar language for many data practitioners than Scala.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Spark RDDs with PySpark

Properties of RDDs

A

The three key properties of RDDs:

  • Fault-tolerant (resilient): data is recoverable in the event of failure
  • Partitioned (distributed): datasets are cut up and distributed to nodes
  • Operated on in parallel (parallelization): tasks are executed on all the chunks of data at the same time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Spark RDDs with PySpark

Transforming an RDD

A

A transformation is a Spark operation that takes an existing RDD as an input and provides a new RDD that has been modified by the transformation as an output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Spark RDDs with PySpark

Lambdas in Spark Operations

A

Lambdas expressions allow us to apply a simple operation to an object without needing to define it as a function. This improves readability by condensing what could be a few lines of code into a single line. Utilizing lambdas in Spark operations allows us to apply any arbitrary function to all RDD elements specified by the transformation or action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Spark RDDs with PySpark

Executing Actions on RDDs

A

An action is a Spark operation that takes an RDD as input, but always outputs a value instead of a new RDD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Spark RDDs with PySpark

Spark Transformations are Lazy

A

Transformations in Spark are not performed until an action is called. Spark optimizes and reduces overhead once it has the full list of transformations to perform. This behavior is called lazy evaluation. In contrast, eager evaluation is how Pandas transformations behave.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Spark RDDs with PySpark

Viewing RDDs

A

Two common functions used to view RDDs are:

.collect(), which pulls the entire RDD into memory. This method will probably max out our memory if the RDD is big.
.take(n), which will only pull in the first n elements of the RDD into memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Spark RDDs with PySpark

Reducing RDDs

A

When executing .reduce() on an RDD, the reducing function must be both commutative and associative due to the fact that RDDs are partitioned and sent to different nodes. Enforcing these two properties will guarantee that parallelized tasks can be executed and completed in any order without affecting the output. Examples of operations with these properties include addition and multiplication.

18
Q

Spark RDDs with PySpark

Aggregating with Accumulators

A

Accumulator variables are shared variables that can only be updated through associative and commutative operations. They are primarily used as counters or sums in parallel computing since they operate on each node separately and adhere to both the associative and commutative properties. However, they are only infallible when used in actions because Spark transformations can re-execute after failure, which would wrongfully increment the accumulator.

19
Q

Spark RDDs with PySpark

Aggregating with Accumulators

A

Accumulator variables are shared variables that can only be updated through associative and commutative operations. They are primarily used as counters or sums in parallel computing since they operate on each node separately and adhere to both the associative and commutative properties. However, they are only infallible when used in actions because Spark transformations can re-execute after failure, which would wrongfully increment the accumulator.

20
Q

Spark RDDs with PySpark

Sharing Broadcast Variables

A

In Spark, broadcast variables are cached input datasets that are sent to each node. This provides a performance boost when running operations that utilize the broadcasted dataset since all nodes have access to the same data. We would never want to broadcast large amounts of data because the size would be too much to serialize and send through the network.

21
Q

Spark DataFrames with PySpark SQL

Starting a SparkSession

A

A SparkSession is the entry point to Spark SQL. The session is a wrapper around a SparkContext and contains all the metadata required to start working with distributed data.

# start a SparkSession
spark = SparkSession.builder.getOrCreate()
22
Q

Spark DataFrames with PySpark SQL

PySpark DataFrames

A

PySpark DataFrames are distributed collections of data in tabular format that are built on top of RDDs. They function almost identically to pandas DataFrames, and allow users to manipulate data in Spark easily, especially when compared to RDDs.

23
Q

Spark DataFrames with PySpark SQL

RDDs to DataFrames

A

PySpark DataFrames can be created from RDDs using rdd.toDF(). They can also be converted back to RDDs with DataFrame.rdd.

24
Q

Spark DataFrames with PySpark SQL

Inspecting DataFrame Schemas

A

All DataFrames have a schema that defines their structure, columns, datatypes, and value restrictions. We can use DataFrame.printSchema() to show a DataFrame’s schema.

25
Q

Spark DataFrames with PySpark SQL

Summarizing PySpark DataFrames

A

Similarly to pandas, we can display a high-level summary of PySpark DataFrames by using the .describe() function to quickly inspect the stored data.

26
Q

Spark DataFrames with PySpark SQL

DataFrame Columns

A
27
Q

Spark DataFrames with PySpark SQL

Querying DataFrames with SQL

A
28
Q

Spark DataFrames with PySpark SQL

Creating a Temp View

A

If there is a query that is often executed, we can save some time by saving that query as a temporary view. This saves the results as a table that can be stored in memory and used for future analysis.

29
Q

Spark DataFrames with PySpark SQL

Using Parquet Files

A

Parquet is a file format used with Spark to save DataFrames. Parquet format offers many benefits over traditional file formats like CSV:

  • Parquet files are efficiently compressed, so they are smaller than CSV files.
  • Parquet files preserve information about a DataFrame’s schema.
  • Performing analysis on parquet files is often faster than CSV files.
30
Q

Spark DataFrames with PySpark SQL

Reading and Writing using PySpark

A
31
Q

Putting it all together

Lambdas in Spark Operations

A

Lambdas expressions allow us to apply a simple operation to an object without needing to define it as a function. This improves readability by condensing what could be a few lines of code into a single line. Utilizing lambdas in Spark operations allows us to apply any arbitrary function to all RDD elements specified by the transformation or action.

32
Q

Putting it all together

Viewing RDDs

A

Two common functions used to view RDDs are:
* .collect(), which pulls the entire RDD into memory. This method will probably max out our memory if the RDD is big.
* .take(n), which will only pull in the first n elements of the RDD into memory.

33
Q

Putting it all together

Reducing RDDs

A

When executing .reduce() on an RDD, the reducing function must be both commutative and associative due to the fact that RDDs are partitioned and sent to different nodes. Enforcing these two properties will guarantee that parallelized tasks can be executed and completed in any order without affecting the output. Examples of operations with these properties include addition and multiplication.

34
Q

Putting it all together

Starting a SparkSession

A

A SparkSession is the entry point to Spark SQL. The session is a wrapper around a SparkContext and contains all the metadata required to start working with distributed data.

# start a SparkSession
spark = SparkSession.builder.getOrCreate()
35
Q

Putting it all together

RDDs to DataFrames

A

PySpark DataFrames can be created from RDDs using rdd.toDF(). They can also be converted back to RDDs with DataFrame.rdd.

36
Q

Putting it all together

Reading and Writing using PySpark

A
37
Q

Putting it all together

Inspecting DataFrame Schemas

A
38
Q

Putting it all together

DataFrame Columns

A
39
Q

Putting it all together

Creating a Temp View

A

If there is a query that is often executed, we can save some time by saving that query as a temporary view. This saves the results as a table that can be stored in memory and used for future analysis.

40
Q

Putting it all together

Using Parquet Files

A

Parquet is a file format used with Spark to save DataFrames. Parquet format offers many benefits over traditional file formats like CSV:

  • Parquet files are efficiently compressed, so they are smaller than CSV files.
  • Parquet files preserve information about a DataFrame’s schema.
  • Performing analysis on parquet files is often faster than CSV files.
41
Q

Putting it all together

Querying DataFrames with SQL

A