Data Pipelines on Cloud (Computing) Flashcards

1
Q

IaaS vs PaaS

A

Outsource virtual machines to cloud (AWS EC2)
Outsource data ecosystem to cloud (AWS EMR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Amazon Machine Image (AMI)

A

Help launch an instance by defining software, settings, etc (you did it, m4.large)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon EMR

A

Automates configuring of clusters of EC2 instances that run the big data framework of your choice (Hadoop, Spark, Hbase, Hive, Presto etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Example of data workload with AWS EMR

A

Upload to AWS S3
EMR launches EC2 instances specified
EMR pulls data from S3 and begins execution
EMR transfers output data to S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Amazon EMR cluster

A

Collection of EC2 instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

3 types of nodes in EMR cluster

A

Master - Manage cluster, schedule tasks, monitor health
Core Nodes - Run tasks, store data in HDFS
Task Nodes (optional) - only help process taks, not storing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

On-demand vs Spot instance

A

Spot are unused EC2 instances available for less cost, but dropped when data-center needs them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which nodes are best to use spot instances for?

A

Task nodes, because maybe you lost compute capacity but no data is lost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS EBS

A

Elastic Block Storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

HDFS vs EMRFS

A

Both options for DFS within EMR, but EMRFS is specific implementation of hadoop file system atop Amazon S3, and you don’t need to use EBS in this case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A cluster cannot but stopped, only:

A

terminated which is why you should store output data you want to keep, anything stored on cluster will get deleted!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Step

A

User-defined unit of processing (algorithm manipulating data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cluster lifecycle (5 steps)

A

starting (launch EC2s)
bootstrapping (install Hadoop, Spark, etc)
running (Step runs)
waiting (time after step runs)
terminating (after manual shutdown)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly