Data Pipelines on Cloud (Computing) Flashcards
IaaS vs PaaS
Outsource virtual machines to cloud (AWS EC2)
Outsource data ecosystem to cloud (AWS EMR)
Amazon Machine Image (AMI)
Help launch an instance by defining software, settings, etc (you did it, m4.large)
Amazon EMR
Automates configuring of clusters of EC2 instances that run the big data framework of your choice (Hadoop, Spark, Hbase, Hive, Presto etc)
Example of data workload with AWS EMR
Upload to AWS S3
EMR launches EC2 instances specified
EMR pulls data from S3 and begins execution
EMR transfers output data to S3
Amazon EMR cluster
Collection of EC2 instances
3 types of nodes in EMR cluster
Master - Manage cluster, schedule tasks, monitor health
Core Nodes - Run tasks, store data in HDFS
Task Nodes (optional) - only help process taks, not storing data
On-demand vs Spot instance
Spot are unused EC2 instances available for less cost, but dropped when data-center needs them
Which nodes are best to use spot instances for?
Task nodes, because maybe you lost compute capacity but no data is lost
AWS EBS
Elastic Block Storage
HDFS vs EMRFS
Both options for DFS within EMR, but EMRFS is specific implementation of hadoop file system atop Amazon S3, and you don’t need to use EBS in this case
A cluster cannot but stopped, only:
terminated which is why you should store output data you want to keep, anything stored on cluster will get deleted!
Step
User-defined unit of processing (algorithm manipulating data)
Cluster lifecycle (5 steps)
starting (launch EC2s)
bootstrapping (install Hadoop, Spark, etc)
running (Step runs)
waiting (time after step runs)
terminating (after manual shutdown)