AWS Flashcards by Unknown Unknown

What is the lineage of an RDD?

A lineage is a graph of all the parent RDDs of a RDD. It keeps track of all transformations that has been applied on that RDD, including the location from where it has to read the data.

How well did you know this?

Not at all

Perfectly

How can we see the Lineage?

by calling .toDebugString on the RDD and web UI

How well did you know this?

Not at all

Perfectly

What is the logical plan?

Logical plan refers to an abstract of all the transformation steps that need to be executed on your data. It depicts what you would expect as output. Like, join, filter etc…

How well did you know this?

Not at all

Perfectly

The physical plan?

The translation of the logical plan into actual processes that occur when we call an action upon our RDD.
It consists of stages, which consist of tasks.

How well did you know this?

Not at all

Perfectly

How many partitions does a single task work on?

Each task operates on its own partition, meaning that the number of partitions for our dataset will determine the number of tasks that are spawned

How well did you know this?

Not at all

Perfectly

What’s the difference between cluster mode and client mode on YARN?

Cluster mode- the spark driver program runs within the ApplicationMaster, and output from the driver program will appear inside the container

Client mode- the driver runs separately from YARN, and communicates with the ApplicationMaster to ensure resources are allocated appropriately. The driver will typically run on the YARN master node and output will appear on the same machine on which it is running

How well did you know this?

Not at all

Perfectly

What is an executor? What are executors when we run Spark on YARN?

The worker node processes that are in charge of running individual tasks in a given spark job. This is where the evaluation of RDDs, as well as the caching of RDDs in memory occurs.

When we run Spark on YARN, executors run as containers on the YARN NodeManagers.

How well did you know this?

Not at all

Perfectly

What is AWS?

AWS stands for Amazon Web Services and is a cloud computing platform that offers services such as database storage options, computing power, content delivery, and networking among other functionalities to help organizations scale up

How well did you know this?

Not at all

Perfectly

EC2?

Elastic Compute Cloud provides virtual machines that represent a physical server for you to deploy your applications. When you setup an EC2 instance you can decide how much memory, processing power, and disk space you want.
General Purpose
Compute optimized - high processing
Memory optimized - large datasets in memory
Storage optimized - high sequential read and write access

How well did you know this?

Not at all

Perfectly

S3?

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security, and performance.

Amazon S3 allows people to store objects (files) in buckets (directories). Buckets must have a unique name and are defined at the region level

How well did you know this?

Not at all

Perfectly

EMR?

Amazon elastic map reduce is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data

When we spin up an EMR cluster, AWS will spin up some number of EC2 instances in that cluster, network them together, and install + run Hadoop (YARN) on those machines.

How well did you know this?

Not at all

Perfectly

What does it mean to run an EMR Step Execution?

Step execution is the execution of a number of MapReduce jobs as a series of steps, and finally the termination of the EMR. Using this method allows for easy to understand workflow when running jobs on an EMR and can help prevent accidentally leaving an EMR running after it is no longer in use

How well did you know this?

Not at all

Perfectly

What are some benefits to using the cloud?

High Speed
Cloud computing allows you to deploy your service quickly in fewer clicks. This faster deployment allows you to get the resources required for your system within fewer minutes.

Cost Savings
Cost saving is one of the biggest Cloud Computing benefits. It helps you to save substantial capital cost as it does not need any physical hardware investments. Also, you do not need trained personnel to maintain the hardware. The buying and managing of equipment is done by the cloud service provider.

How well did you know this?

Not at all

Perfectly

What is the Spark History Server?

The spark history server is a monitoring tool that displays information about completed Spark applications. It is usually represented a a graph such as Directed Acyclic Graph that give a visual representation of the execution of stages and tasks in a given Spark job, as well as the use of resources and time to perform various tasks

How well did you know this?

Not at all

Perfectly

What does it mean to “spill to disk” when executing spark tasks?

that your task ran out of space in memory and so it wrote some data to disk temporarily.

How well did you know this?

Not at all

Perfectly

When during a Job do we need to pay attention to the number of partitions and adjust if necessary?

Whenever we first create an RDD by reading in some form of input data, before we perform the initial stage of transformations

How well did you know this?

Not at all

Perfectly

What is spark.driver.memory?

Study These Flashcards

The memory allocated to the driver program, which is 1GB by default.

What about spark.executor.memory?

Study These Flashcards

The amount of RAM available on the worker node of an executor that is allocated to the processing of Spark tasks.

What is a Spark Application?

Study These Flashcards

A spark application is a self contained computation that runs user supplied code to compute a result.
one call to spark-submit with a .jar file, one program, one SparkContext. RDDs that are cached in an application can be reused.

What is Spark job?

Study These Flashcards

A parallel computation consisting of multiple tasks that get spawned when we have a spark action. So for example when we call the .save or collect method.
collect() - return list of elements in rdd
save() - save as text file

What is Spark Stage?

Study These Flashcards

Spark stage is a series of tasks that run on the same data. Jobs are divided up into stages based on how many times we shuffle. Each shuffle starts a new stage, on the splits produced by that shuffle. A Job might have 1 or many stages.

What is a Spark Task?

Study These Flashcards

A task is a single unit of work that corresponds to a RDD partition. Task are like transformations like map or filter that operate on single splits. Tasks exist in order within stages. The transformations that cause shuffle also produce tasks.

When we cache an RDD?

Study These Flashcards

RDDs typically are not shared across jobs, however it is possible for them to be shared if they are written to local disk as part of a shuffle.

RDDs are shared across tasks when they are cached and can also be shared across stages.

Some levels have _SER, what does this mean?

Study These Flashcards

“Serialized”. When we serialize something, we transform it into bytes that can be written to disk/sent over the network and reconstituted later.

requires less space, more processing power

Some levels have _2, what does this mean?

replications. These storage levels cache your RDD partitions on multiple machines across the cluster. This can make jobs faster (due to data locality + redundant copies) but uses more resources.

If the storage level for a persist is MEMORY_ONLY and there isn't enough memory, what happens?

Partitions that exceed memory get recomputed

What is the storage level for .cache()?

MEMORY_ONLY

AWS Availability Zone

Each region has many availability zones. They usually have 3 min = 2 max = 6. Each AZ is one or more discrete data centers with its own power, networking , and connectivity. They are seperate from each other so that they are isolated from disasters.

Classic Ports to know?

``` 22 - SSh secure shell 21 - FTP File transfer protocol 22 - SFTP Secure file transfer protocol 80 - HTTP access unsecured websites 443 - HTTPS access secured websites 3389 - RDP Remote Desk Protocol ```

How do I check storage level?

Call the method getStorageLevel

Different ways to repartition?

repartiton() - full shuffle coalesce() - partial shuffle repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions.

What are core services that AWS offers?

IaaS - Infrastructure as a Service PaaS - Platform as a Service SaaS - Software as a Service

Launch modes in EMR?

Cluster and Step Execution

In IaaS what are managed by you?

Application and Operating System. Example: EC2

What is the correct command utilized to transfer a local file into an EMR cluster?

scp -i privateKeyPath /localPath hadoop@emr-dns:/remotePath

What is horizontal scaling?

Adding more resources of the same kind. More servers of physical machines.

What is elasticity in computing?

Dynamically (de)allocate capacity on an existing resource

In SaaS which of the following are managed by you?

With software as a service (SaaS) products, you deploy software hosted on AWS infrastructure and grant buyers access to the software in your AWS environment. You are responsible for managing customer access, account creation, resource provisioning, and account management within your software.

What is vertical scaling?

Increasing capacity in an existing resource. CPU/RAM

What is the SSH command syntax use to connect to an EMR cluster?

ssh -i privateKeyPath hadoop@emr-dns

What is the correct command utilized to transfer a local file into an EMR cluster?

scp -i privateKeyPath /localPath hadoop@emr-dns:/remotePath

AWS Flashcards

(41 cards)