Scale - EMR Flashcards
What is Elastic Mapreduce?
a managed Hadoop framework for processing huge amounts of data; support Apache Spark, Hbase, Presto, and Flink
What is the EMR use case?
Most commonly used for log analysis, financial analysis
or extract, translate and loading (ETL) activities; perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
What is a step in regards to EMR?
A Step is a programmatic task for performing some process on the data (i.e. count words)
What is a cluster in regards to EMR?
A Cluster is a collection of EC2 instances provisioned by EMR to run your Steps.
What are the components of EMR?
Master, task and core nodes
What is AWS Glue?
AWS Glue is a flexible and easily scalable ETL platform as it works on AWS serverless platform. But, on the other hand, Amazon EMR is less flexible as it works on your onsite platform.
Why Glue?
But, AWS Glue is faster than Amazon EMR being an ETL-only platform. As a serverless platform, AWS Glue has the edge over EMR in terms of operational flexibility.
What is core node?
can host persistent data using Hadoop Distributed File System and run Hadoop task
What is a task node?
only run Hadoop tasks
What is a master node?
used Hadoop to perform computations, usually one only; distributes processing across other nodes; can lunch cluster with 3 master nodes for HA