Data Pipelines on Cloud (Computing) Flashcards

Question 1

Q

IaaS vs PaaS

Answer

A

Outsource virtual machines to cloud (AWS EC2)
Outsource data ecosystem to cloud (AWS EMR)

Question 2

Q

Amazon Machine Image (AMI)

Answer

A

Help launch an instance by defining software, settings, etc (you did it, m4.large)

Question 3

Q

Amazon EMR

Answer

A

Automates configuring of clusters of EC2 instances that run the big data framework of your choice (Hadoop, Spark, Hbase, Hive, Presto etc)

Question 4

Q

Example of data workload with AWS EMR

Answer

A

Upload to AWS S3
EMR launches EC2 instances specified
EMR pulls data from S3 and begins execution
EMR transfers output data to S3

Question 5

Q

Amazon EMR cluster

Answer

A

Collection of EC2 instances

Question 6

Q

3 types of nodes in EMR cluster

Answer

A

Master - Manage cluster, schedule tasks, monitor health
Core Nodes - Run tasks, store data in HDFS
Task Nodes (optional) - only help process taks, not storing data

Question 7

Q

On-demand vs Spot instance

Answer

A

Spot are unused EC2 instances available for less cost, but dropped when data-center needs them

Question 8

Q

Which nodes are best to use spot instances for?

Answer

A

Task nodes, because maybe you lost compute capacity but no data is lost

Question 9

Q

AWS EBS

Answer

A

Elastic Block Storage

Question 10

Q

HDFS vs EMRFS

Answer

A

Both options for DFS within EMR, but EMRFS is specific implementation of hadoop file system atop Amazon S3, and you don’t need to use EBS in this case

Question 11

Q

A cluster cannot but stopped, only:

Answer

A

terminated which is why you should store output data you want to keep, anything stored on cluster will get deleted!

Question 12

Q

Step

Answer

A

User-defined unit of processing (algorithm manipulating data)

Question 13

Q

Cluster lifecycle (5 steps)

Answer

A

starting (launch EC2s)
bootstrapping (install Hadoop, Spark, etc)
running (Step runs)
waiting (time after step runs)
terminating (after manual shutdown)

Question 14

Q

Data Pipelines on Cloud (Computing) Flashcards

(14 cards)