AWS EMR Flashcards
What is AWS EMR
It is AWS elastic map-reduce, it is where data can be broken up and some sort of calculation of code can be run over it, and the results are then compiled. If you were to have the text form every book in the world and you have to look for the word dog. You would split the book into all the map node have e each map node do a search on each book for the word doc and once each map node is finished you could the returned map node in the reduce node.
As a developer what two components of code do you have to give an EMR?
- Map code component
- Reduce code component
What is a split size?
This is where the dat is split into the map nodes by size.
Is there input and output data from EMR?
Yes, data comes from a persistent data store and once processed is pushed to a persistent data store, s3 is a candidate.
Outside of AWS wnat is EMS know as?
Hadoop
What are the two frameworks that EMS can run?
Hadoop and spark, it also used hive and pig, HBase
Hue, Zookeeper,
I we were to see hive?
What would it relate to EMR
What type of node has an EMR cluster?
- Master node
- Core node
- Task node
What is the master node job?
Master node controles the cluster and distributes the workload and monitors the health.
What does the EMR cluster run on?
It runs on EC2 instances.
What nodes do the work in an EMR cluster?
Core Nodes
Other than processing, what else does the code node do?
They provide the HDFS file system.
Is data replicated between code nodes?
Yes 100%
What is the difference between Task node and Code nodes?
Task nodes process but do not have HDFS
Where can we get and put data for EMR?
From S3
Where is HDFS run in EMR?
On the code nodes
What is EMRFS?
It is an S3 backed file system and can be used to replace HDFS
What advantage has using EMRFS?
It is in S3 so it lives beyond the life of the cluster.
Has EMR fully managed services and does not use a VPC with nodes?
No, EMR is not fully managed but is a managed service that is deployed in your VPC.
Is EMR highly available across all availability zones?
No, for speed of processing, EMR (Hadoop) nodes are deployed into a single AZ.
What does spark do?
It is a batch and stream processing engine for data, it competes again EMR (Hadoop ) in the area of batch.
Who uses EMR (Hadoop) and spark?
- Financial sector: if you are looking for fraud
- Health: Scoring potential health risks
What is Hive?
It complements the HDFS file system, it enables you to use SQL like queries that are converted into map reduce jobs to be run on a Hadoop cluster.
Is hive a good use for OLTP or relational data?
No