Introduction to Big Data with PySpark Flashcards
Introduction to Big Data
Big Data Definition
Big data is a term that refers to data that are too large and complex to be processed by normal computing capabilities. Describing data as “big” is relative to our modern computing power.
Introduction to Big Data
Big Data 3 Vs
Big Data can be characterized by what are known as the 3 Vs:
- Volume: the size of big data is larger than the amount of available computing power.
- Velocity: big data grows rapidly as it becomes faster, cheaper, and easier to collect automatically and continuously.
- Variety: big data comes in a variety of formats, such as structured (data tables with rows and columns), semi-structured (think JSON files with nested data), and unstructured (audio, image, and video data).
Introduction to Big Data
Big Data and RAM
Big data analysis is limited by the amount of Random Access Memory (RAM) that the available computing resources have. Many big data systems will use a computing cluster to increase the amount of total RAM.
Introduction to Big Data
HDFS Overview
One system for big data storage is called Hadoop Distributed File Storage (HDFS). In this system, a cluster of computing resources stores the data. This cluster consists of a manager node, which sends commands to the worker nodes that house the data.
Introduction to Big Data
MapReduce Overview
MapReduce is a framework that can be used to process large datasets stored in a Hadoop Distributed File System (HDFS) cluster. MapReduce consists of two main functions, map and reduce, which can perform complex operations over a distributed system.
Introduction to Big Data
MapReduce Process Overview
MapReduce works by sending commands from the manager node down to the numerous worker nodes, which process subsets of data in parallel. This speeds up processing when compared to traditional data processing frameworks.
Introduction to Big Data
MapReduce Process Overview
MapReduce works by sending commands from the manager node down to the numerous worker nodes, which process subsets of data in parallel. This speeds up processing when compared to traditional data processing frameworks.
Spark RDDs with PySpark
Spark Overview
Spark is an application that was designed to process large amounts of data. Originally designed for creating data pipelines for machine learning workloads, Spark is capable of querying, transforming, and analyzing big data on a variety of data systems.
Spark Process Overview
Spark Process Overview
Spark is able to process data quickly because it leverages the Random Access Memory (RAM) of a computing cluster. When processing data, Spark keeps the data in RAM, which is a faster processing part of a computing node. Spark does this in parallel across all worker nodes in a cluster. This differs from MapReduce, which processes data on the node’s disk, and explains why Spark is a faster framework than MapReduce.
Spark RDDs with PySpark
Pyspark Overview
The Spark framework is written in Scala but can be used in several languages, namely Python, Java, SQL, and R.
Pyspark is the Python API for Spark that can be installed directly from the leading Python repositories (PyPI and conda). Pyspark is a particularly popular framework because it makes the big data processing of Spark available to Python programmers. Python is a more approachable and familiar language for many data practitioners than Scala.
Spark RDDs with PySpark
Properties of RDDs
The three key properties of RDDs:
- Fault-tolerant (resilient): data is recoverable in the event of failure
- Partitioned (distributed): datasets are cut up and distributed to nodes
- Operated on in parallel (parallelization): tasks are executed on all the chunks of data at the same time
Spark RDDs with PySpark
Transforming an RDD
A transformation is a Spark operation that takes an existing RDD as an input and provides a new RDD that has been modified by the transformation as an output.
Spark RDDs with PySpark
Lambdas in Spark Operations
Lambdas expressions allow us to apply a simple operation to an object without needing to define it as a function. This improves readability by condensing what could be a few lines of code into a single line. Utilizing lambdas in Spark operations allows us to apply any arbitrary function to all RDD elements specified by the transformation or action.
Spark RDDs with PySpark
Executing Actions on RDDs
An action is a Spark operation that takes an RDD as input, but always outputs a value instead of a new RDD.
Spark RDDs with PySpark
Spark Transformations are Lazy
Transformations in Spark are not performed until an action is called. Spark optimizes and reduces overhead once it has the full list of transformations to perform. This behavior is called lazy evaluation. In contrast, eager evaluation is how Pandas transformations behave.
Spark RDDs with PySpark
Viewing RDDs
Two common functions used to view RDDs are:
.collect(), which pulls the entire RDD into memory. This method will probably max out our memory if the RDD is big.
.take(n), which will only pull in the first n elements of the RDD into memory.