AWS Flashcards
What is the lineage of an RDD?
A lineage is a graph of all the parent RDDs of a RDD. It keeps track of all transformations that has been applied on that RDD, including the location from where it has to read the data.
How can we see the Lineage?
by calling .toDebugString on the RDD and web UI
What is the logical plan?
Logical plan refers to an abstract of all the transformation steps that need to be executed on your data. It depicts what you would expect as output. Like, join, filter etc…
The physical plan?
The translation of the logical plan into actual processes that occur when we call an action upon our RDD.
It consists of stages, which consist of tasks.
How many partitions does a single task work on?
Each task operates on its own partition, meaning that the number of partitions for our dataset will determine the number of tasks that are spawned
What’s the difference between cluster mode and client mode on YARN?
Cluster mode- the spark driver program runs within the ApplicationMaster, and output from the driver program will appear inside the container
Client mode- the driver runs separately from YARN, and communicates with the ApplicationMaster to ensure resources are allocated appropriately. The driver will typically run on the YARN master node and output will appear on the same machine on which it is running
What is an executor? What are executors when we run Spark on YARN?
The worker node processes that are in charge of running individual tasks in a given spark job. This is where the evaluation of RDDs, as well as the caching of RDDs in memory occurs.
When we run Spark on YARN, executors run as containers on the YARN NodeManagers.
What is AWS?
AWS stands for Amazon Web Services and is a cloud computing platform that offers services such as database storage options, computing power, content delivery, and networking among other functionalities to help organizations scale up
EC2?
Elastic Compute Cloud provides virtual machines that represent a physical server for you to deploy your applications. When you setup an EC2 instance you can decide how much memory, processing power, and disk space you want.
General Purpose
Compute optimized - high processing
Memory optimized - large datasets in memory
Storage optimized - high sequential read and write access
S3?
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security, and performance.
Amazon S3 allows people to store objects (files) in buckets (directories). Buckets must have a unique name and are defined at the region level
EMR?
Amazon elastic map reduce is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data
When we spin up an EMR cluster, AWS will spin up some number of EC2 instances in that cluster, network them together, and install + run Hadoop (YARN) on those machines.
What does it mean to run an EMR Step Execution?
Step execution is the execution of a number of MapReduce jobs as a series of steps, and finally the termination of the EMR. Using this method allows for easy to understand workflow when running jobs on an EMR and can help prevent accidentally leaving an EMR running after it is no longer in use
What are some benefits to using the cloud?
High Speed
Cloud computing allows you to deploy your service quickly in fewer clicks. This faster deployment allows you to get the resources required for your system within fewer minutes.
Cost Savings
Cost saving is one of the biggest Cloud Computing benefits. It helps you to save substantial capital cost as it does not need any physical hardware investments. Also, you do not need trained personnel to maintain the hardware. The buying and managing of equipment is done by the cloud service provider.
What is the Spark History Server?
The spark history server is a monitoring tool that displays information about completed Spark applications. It is usually represented a a graph such as Directed Acyclic Graph that give a visual representation of the execution of stages and tasks in a given Spark job, as well as the use of resources and time to perform various tasks
What does it mean to “spill to disk” when executing spark tasks?
that your task ran out of space in memory and so it wrote some data to disk temporarily.
When during a Job do we need to pay attention to the number of partitions and adjust if necessary?
Whenever we first create an RDD by reading in some form of input data, before we perform the initial stage of transformations