AWS Flashcards

1
Q

What is the lineage of an RDD?

A

A lineage is a graph of all the parent RDDs of a RDD. It keeps track of all transformations that has been applied on that RDD, including the location from where it has to read the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can we see the Lineage?

A

by calling .toDebugString on the RDD and web UI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the logical plan?

A

Logical plan refers to an abstract of all the transformation steps that need to be executed on your data. It depicts what you would expect as output. Like, join, filter etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The physical plan?

A

The translation of the logical plan into actual processes that occur when we call an action upon our RDD.
It consists of stages, which consist of tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How many partitions does a single task work on?

A

Each task operates on its own partition, meaning that the number of partitions for our dataset will determine the number of tasks that are spawned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s the difference between cluster mode and client mode on YARN?

A

Cluster mode- the spark driver program runs within the ApplicationMaster, and output from the driver program will appear inside the container

Client mode- the driver runs separately from YARN, and communicates with the ApplicationMaster to ensure resources are allocated appropriately. The driver will typically run on the YARN master node and output will appear on the same machine on which it is running

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an executor? What are executors when we run Spark on YARN?

A

The worker node processes that are in charge of running individual tasks in a given spark job. This is where the evaluation of RDDs, as well as the caching of RDDs in memory occurs.

When we run Spark on YARN, executors run as containers on the YARN NodeManagers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is AWS?

A

AWS stands for Amazon Web Services and is a cloud computing platform that offers services such as database storage options, computing power, content delivery, and networking among other functionalities to help organizations scale up

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

EC2?

A

Elastic Compute Cloud provides virtual machines that represent a physical server for you to deploy your applications. When you setup an EC2 instance you can decide how much memory, processing power, and disk space you want.
General Purpose
Compute optimized - high processing
Memory optimized - large datasets in memory
Storage optimized - high sequential read and write access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

S3?

A

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security, and performance.

Amazon S3 allows people to store objects (files) in buckets (directories). Buckets must have a unique name and are defined at the region level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

EMR?

A

Amazon elastic map reduce is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data

When we spin up an EMR cluster, AWS will spin up some number of EC2 instances in that cluster, network them together, and install + run Hadoop (YARN) on those machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does it mean to run an EMR Step Execution?

A

Step execution is the execution of a number of MapReduce jobs as a series of steps, and finally the termination of the EMR. Using this method allows for easy to understand workflow when running jobs on an EMR and can help prevent accidentally leaving an EMR running after it is no longer in use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some benefits to using the cloud?

A

High Speed
Cloud computing allows you to deploy your service quickly in fewer clicks. This faster deployment allows you to get the resources required for your system within fewer minutes.

Cost Savings
Cost saving is one of the biggest Cloud Computing benefits. It helps you to save substantial capital cost as it does not need any physical hardware investments. Also, you do not need trained personnel to maintain the hardware. The buying and managing of equipment is done by the cloud service provider.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Spark History Server?

A

The spark history server is a monitoring tool that displays information about completed Spark applications. It is usually represented a a graph such as Directed Acyclic Graph that give a visual representation of the execution of stages and tasks in a given Spark job, as well as the use of resources and time to perform various tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does it mean to “spill to disk” when executing spark tasks?

A

that your task ran out of space in memory and so it wrote some data to disk temporarily.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When during a Job do we need to pay attention to the number of partitions and adjust if necessary?

A

Whenever we first create an RDD by reading in some form of input data, before we perform the initial stage of transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is spark.driver.memory?

A

The memory allocated to the driver program, which is 1GB by default.

18
Q

What about spark.executor.memory?

A

The amount of RAM available on the worker node of an executor that is allocated to the processing of Spark tasks.

19
Q

What is a Spark Application?

A

A spark application is a self contained computation that runs user supplied code to compute a result.
one call to spark-submit with a .jar file, one program, one SparkContext. RDDs that are cached in an application can be reused.

20
Q

What is Spark job?

A

A parallel computation consisting of multiple tasks that get spawned when we have a spark action. So for example when we call the .save or collect method.
collect() - return list of elements in rdd
save() - save as text file

21
Q

What is Spark Stage?

A

Spark stage is a series of tasks that run on the same data. Jobs are divided up into stages based on how many times we shuffle. Each shuffle starts a new stage, on the splits produced by that shuffle. A Job might have 1 or many stages.

22
Q

What is a Spark Task?

A

A task is a single unit of work that corresponds to a RDD partition. Task are like transformations like map or filter that operate on single splits. Tasks exist in order within stages. The transformations that cause shuffle also produce tasks.

23
Q

When we cache an RDD?

A

RDDs typically are not shared across jobs, however it is possible for them to be shared if they are written to local disk as part of a shuffle.

RDDs are shared across tasks when they are cached and can also be shared across stages.

24
Q

Some levels have _SER, what does this mean?

A

“Serialized”. When we serialize something, we transform it into bytes that can be written to disk/sent over the network and reconstituted later.

requires less space, more processing power

25
Q

Some levels have _2, what does this mean?

A

replications. These storage levels cache your RDD partitions on multiple machines across the cluster. This can make jobs faster (due to data locality + redundant copies) but uses more resources.

26
Q

If the storage level for a persist is MEMORY_ONLY and there isn’t enough memory, what happens?

A

Partitions that exceed memory get recomputed

27
Q

What is the storage level for .cache()?

A

MEMORY_ONLY

28
Q

AWS Availability Zone

A

Each region has many availability zones. They usually have 3 min = 2 max = 6. Each AZ is one or more discrete data centers with its own power, networking , and connectivity. They are seperate from each other so that they are isolated from disasters.

29
Q

Classic Ports to know?

A
22 - SSh secure shell
21 - FTP File transfer protocol
22 - SFTP Secure file transfer protocol
80 - HTTP access unsecured websites
443 - HTTPS access secured websites
3389 - RDP Remote Desk Protocol
30
Q

How do I check storage level?

A

Call the method getStorageLevel

31
Q

Different ways to repartition?

A

repartiton() - full shuffle
coalesce() - partial shuffle
repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions.

32
Q

What are core services that AWS offers?

A

IaaS - Infrastructure as a Service
PaaS - Platform as a Service
SaaS - Software as a Service

33
Q

Launch modes in EMR?

A

Cluster and Step Execution

34
Q

In IaaS what are managed by you?

A

Application and Operating System. Example: EC2

35
Q

What is the correct command utilized to transfer a local file into an EMR cluster?

A

scp -i privateKeyPath /localPath hadoop@emr-dns:/remotePath

36
Q

What is horizontal scaling?

A

Adding more resources of the same kind. More servers of physical machines.

37
Q

What is elasticity in computing?

A

Dynamically (de)allocate capacity on an existing resource

38
Q

In SaaS which of the following are managed by you?

A

With software as a service (SaaS) products, you deploy software hosted on AWS infrastructure and grant buyers access to the software in your AWS environment. You are responsible for managing customer access, account creation, resource provisioning, and account management within your software.

39
Q

What is vertical scaling?

A

Increasing capacity in an existing resource. CPU/RAM

40
Q

What is the SSH command syntax use to connect to an EMR cluster?

A

ssh -i privateKeyPath hadoop@emr-dns

41
Q

What is the correct command utilized to transfer a local file into an EMR cluster?

A

scp -i privateKeyPath /localPath hadoop@emr-dns:/remotePath