AWS EMR Flashcards
What is AWS EMR
It is AWS elastic map-reduce, it is where data can be broken up and some sort of calculation of code can be run over it, and the results are then compiled. If you were to have the text form every book in the world and you have to look for the word dog. You would split the book into all the map node have e each map node do a search on each book for the word doc and once each map node is finished you could the returned map node in the reduce node.
As a developer what two components of code do you have to give an EMR?
- Map code component
- Reduce code component
What is a split size?
This is where the dat is split into the map nodes by size.
Is there input and output data from EMR?
Yes, data comes from a persistent data store and once processed is pushed to a persistent data store, s3 is a candidate.
Outside of AWS wnat is EMS know as?
Hadoop
What are the two frameworks that EMS can run?
Hadoop and spark, it also used hive and pig, HBase
Hue, Zookeeper,
I we were to see hive?
What would it relate to EMR
What type of node has an EMR cluster?
- Master node
- Core node
- Task node
What is the master node job?
Master node controles the cluster and distributes the workload and monitors the health.
What does the EMR cluster run on?
It runs on EC2 instances.
What nodes do the work in an EMR cluster?
Core Nodes
Other than processing, what else does the code node do?
They provide the HDFS file system.
Is data replicated between code nodes?
Yes 100%
What is the difference between Task node and Code nodes?
Task nodes process but do not have HDFS
Where can we get and put data for EMR?
From S3
Where is HDFS run in EMR?
On the code nodes
What is EMRFS?
It is an S3 backed file system and can be used to replace HDFS
What advantage has using EMRFS?
It is in S3 so it lives beyond the life of the cluster.
Has EMR fully managed services and does not use a VPC with nodes?
No, EMR is not fully managed but is a managed service that is deployed in your VPC.
Is EMR highly available across all availability zones?
No, for speed of processing, EMR (Hadoop) nodes are deployed into a single AZ.
What does spark do?
It is a batch and stream processing engine for data, it competes again EMR (Hadoop ) in the area of batch.
Who uses EMR (Hadoop) and spark?
- Financial sector: if you are looking for fraud
- Health: Scoring potential health risks
What is Hive?
It complements the HDFS file system, it enables you to use SQL like queries that are converted into map reduce jobs to be run on a Hadoop cluster.
Is hive a good use for OLTP or relational data?
No
His EMR good for use with OLTP and relational data?
NO
What is PIG used for?
Before PIG, people using EMP (Hadoop) have to interact with the cluster by doing low-level tasks written in Java. Pig is a sort of scripting language.
What is the minimum size of an EMR cluster?
One node, but this is for development only.
I am thinking of running the master node on a spot instance, is there any potential issue and why?
Yes, the master node is used to control the EMR (Hadoop) cluster, if it fails the cluster is failed, spot instances can and will go away at any point in time.
What EMR (Hadoop) nodes should I use spot instance for?
Use the spot instances for Task nodes
Can I use instance fleets with EMR nodes?
Yes, this gives you the ability to select up to five different instance types. The fleet enables you to select the desired number of nodes and price and the fleet will manage to try to make it happen.
How do I secure the EMR (Hadoop) cluster?
Using security groups and NACLs.
Do you wnat to use spot instances for code nodes?
You cna but you could lose the node and part of the HDFS file system
What should I use to run my EMR task nodes?
Spot instances as the task nodes have no data.
If I am using instance fleet, how my fleets are used for the different node types in EMR?
You will have three fleet types,
- Master node fleet
- Task node fleet
- Core node fleet
I have data in us-east-1 region in s3, where should I run my EMR cluster?
As close to the region as possible, in this case, us-east-1. The reason for this is latency, you get 1ms per 90 miles of distance.
I am calculating PI, should I use a general purpose, computer optimised or memory optimised node?
Computer optimised as it is going to use a lot of CPU.
What is the recommended instances type for Hadoop cluster nodes?
m4.large for a cluster with < 50 nodes, for a cluster with more then 50 nodes you step to next size m4.xlarge
For EMR, when should I used reserved instances?
When you know the cluster wi going to be used long term 1, 2, 3years)
For long-running EMR or where EMR is a data wherehouse, how should I set up the cost of the nodes?
- Master node = On-demand
- Core node = on-demand or fleet
- Task node = on-demand or fleet
For cost driven EMR how should I set up the cost of the nodes?
- Master node = Spot
- Core node = Spot
- Task node = Spot
For data warehouse critical EMR how should I set up the cost of the nodes?
- Master node = On-demand
- Core node = On-demand
- Task node = on-demand or fleet
For app testing EMR how should I set up the cost of the nodes?
- Master node = Spot
- Core node = Spot
- Task node = Spot
Do you have to provide code or dose EMR just do the map-reduce for me and generate code?
You have to provide the map and reduce code, this is the code EMR will push to the map and reduce nodes. And is the code thet will run on the modes to perform the map and reduce processing.
What are a split and a split size?
Split is the split size, where we split the data into chunks to save on separate nodes
What is a map job?
The map phase takes data like saying a data, name, address and store in the nodes splitting the data by say the date. This way each node has a subset of the data.
What is the reduce job?
Data is shuffled into the reduce where it is counted for example.
I require Hadoop, how do I configure RedShift?
You do not, RedShift is a data where-house, you need EMR, EMR is AWS implementation of map-reduce and Hadoop.
I require Spark, what product in AWS should I be configuring?
EMR
What is HIVE?
Hive is a wherehouse on top Hadoop, it gives you SQL query abilities. It has a metadata store and ODBC and JDBC drivers to enable you to easily query form your apps.
What is PIG?
Pig is a high-level language to analyze data in Hadoop. For example, you can use pig to,
- Load CSV file: LOAD k.csv USING PigStorage as id:int, date:chararray
- Create new data listings: FOREACH listings GENERATE list_id, ToDate
How are EMR clusters created?
EMR cluster can be created by you through the console/CLI/API or through another product like datapipeline. When you create a cluster, it is a long-running cluster.
Can I ssh to the master node?
Yes, you can ssh to the master node.
I am using Hadoop and hive, I wnat to use ODBC, dod I need to move the data to RedShift?
No, Hive is a data wherehouse on top of Hadoop, one of the features of Hive is its ability to use ODBC.
I am using Hadoop and hive, I wnat to use JDBC, dod I need to move the data to RedShift?
No, Hive is a data wherehouse on top of Hadoop, one of the features of Hive is its ability to use JDBC.
What is HBase?
HBase is like google BigTable database, it runs on top of Hadoop HDFS.
I have to write some code, Is map and reduce one application?
Two separate application, there is a map app and a reduce app.
Dose EMR support spark?
Yes