AWS EMR Flashcards

Question 1

Q

What is AWS EMR

Answer

A

It is AWS elastic map-reduce, it is where data can be broken up and some sort of calculation of code can be run over it, and the results are then compiled. If you were to have the text form every book in the world and you have to look for the word dog. You would split the book into all the map node have e each map node do a search on each book for the word doc and once each map node is finished you could the returned map node in the reduce node.

Question 2

Q

As a developer what two components of code do you have to give an EMR?

Answer

A

Map code component

- Reduce code component

Question 3

Q

What is a split size?

Answer

A

This is where the dat is split into the map nodes by size.

Question 4

Q

Is there input and output data from EMR?

Answer

A

Yes, data comes from a persistent data store and once processed is pushed to a persistent data store, s3 is a candidate.

Question 5

Q

Outside of AWS wnat is EMS know as?

Question 6

Q

What are the two frameworks that EMS can run?

Answer

A

Hadoop and spark, it also used hive and pig, HBase

Hue, Zookeeper,

Question 7

Q

I we were to see hive?

Answer

A

What would it relate to EMR

Question 8

Q

What type of node has an EMR cluster?

Answer

A

Master node
Core node
Task node

Question 9

Q

What is the master node job?

Answer

A

Master node controles the cluster and distributes the workload and monitors the health.

Question 10

Q

What does the EMR cluster run on?

Answer

A

It runs on EC2 instances.

Question 11

Q

What nodes do the work in an EMR cluster?

Answer

A

Core Nodes

Question 12

Q

Other than processing, what else does the code node do?

Answer

A

They provide the HDFS file system.

Question 13

Q

Is data replicated between code nodes?

Question 14

Q

What is the difference between Task node and Code nodes?

Answer

A

Task nodes process but do not have HDFS

Question 15

Q

Where can we get and put data for EMR?

Question 16

Q

Where is HDFS run in EMR?

Answer

A

On the code nodes

Question 17

Q

What is EMRFS?

Answer

A

It is an S3 backed file system and can be used to replace HDFS

Question 18

Q

What advantage has using EMRFS?

Answer

A

It is in S3 so it lives beyond the life of the cluster.

Question 19

Q

Has EMR fully managed services and does not use a VPC with nodes?

Answer

A

No, EMR is not fully managed but is a managed service that is deployed in your VPC.

Question 20

Q

Is EMR highly available across all availability zones?

Answer

A

No, for speed of processing, EMR (Hadoop) nodes are deployed into a single AZ.

Question 21

Q

What does spark do?

Answer

A

It is a batch and stream processing engine for data, it competes again EMR (Hadoop ) in the area of batch.

Question 22

Q

Who uses EMR (Hadoop) and spark?

Answer

A

Financial sector: if you are looking for fraud

- Health: Scoring potential health risks

Question 23

Q

What is Hive?

Answer

A

It complements the HDFS file system, it enables you to use SQL like queries that are converted into map reduce jobs to be run on a Hadoop cluster.

Question 24

Q

Is hive a good use for OLTP or relational data?

Question 25

Q

His EMR good for use with OLTP and relational data?

Question 26

Q

What is PIG used for?

Answer

A

Before PIG, people using EMP (Hadoop) have to interact with the cluster by doing low-level tasks written in Java. Pig is a sort of scripting language.

Question 27

Q

What is the minimum size of an EMR cluster?

Answer

A

One node, but this is for development only.

Question 28

Q

I am thinking of running the master node on a spot instance, is there any potential issue and why?

Answer

A

Yes, the master node is used to control the EMR (Hadoop) cluster, if it fails the cluster is failed, spot instances can and will go away at any point in time.

Question 29

Q

What EMR (Hadoop) nodes should I use spot instance for?

Answer

A

Use the spot instances for Task nodes

Question 30

Q

Can I use instance fleets with EMR nodes?

Answer

A

Yes, this gives you the ability to select up to five different instance types. The fleet enables you to select the desired number of nodes and price and the fleet will manage to try to make it happen.

Question 31

Q

How do I secure the EMR (Hadoop) cluster?

Answer

A

Using security groups and NACLs.

Question 32

Q

Do you wnat to use spot instances for code nodes?

Answer

A

You cna but you could lose the node and part of the HDFS file system

Question 33

Q

What should I use to run my EMR task nodes?

Answer

A

Spot instances as the task nodes have no data.

Question 34

Q

If I am using instance fleet, how my fleets are used for the different node types in EMR?

Answer

A

You will have three fleet types,

Master node fleet
Task node fleet
Core node fleet

Question 35

Q

I have data in us-east-1 region in s3, where should I run my EMR cluster?

Answer

A

As close to the region as possible, in this case, us-east-1. The reason for this is latency, you get 1ms per 90 miles of distance.

Question 36

Q

I am calculating PI, should I use a general purpose, computer optimised or memory optimised node?

Answer

A

Computer optimised as it is going to use a lot of CPU.

Question 37

Q

What is the recommended instances type for Hadoop cluster nodes?

Answer

A

m4.large for a cluster with < 50 nodes, for a cluster with more then 50 nodes you step to next size m4.xlarge

Question 38

Q

For EMR, when should I used reserved instances?

Answer

A

When you know the cluster wi going to be used long term 1, 2, 3years)

Question 39

Q

For long-running EMR or where EMR is a data wherehouse, how should I set up the cost of the nodes?

Answer

A

Master node = On-demand
Core node = on-demand or fleet
Task node = on-demand or fleet

Question 40

Q

For cost driven EMR how should I set up the cost of the nodes?

Answer

A

Master node = Spot
Core node = Spot
Task node = Spot

Question 41

Q

For data warehouse critical EMR how should I set up the cost of the nodes?

Answer

A

Master node = On-demand
Core node = On-demand
Task node = on-demand or fleet

Question 42

Q

For app testing EMR how should I set up the cost of the nodes?

Answer

A

Master node = Spot
Core node = Spot
Task node = Spot

Question 43

Q

Do you have to provide code or dose EMR just do the map-reduce for me and generate code?

Answer

A

You have to provide the map and reduce code, this is the code EMR will push to the map and reduce nodes. And is the code thet will run on the modes to perform the map and reduce processing.

Question 44

Q

What are a split and a split size?

Answer

A

Split is the split size, where we split the data into chunks to save on separate nodes

Question 45

Q

What is a map job?

Answer

A

The map phase takes data like saying a data, name, address and store in the nodes splitting the data by say the date. This way each node has a subset of the data.

Question 46

Q

What is the reduce job?

Answer

A

Data is shuffled into the reduce where it is counted for example.

Question 47

Q

I require Hadoop, how do I configure RedShift?

Answer

A

You do not, RedShift is a data where-house, you need EMR, EMR is AWS implementation of map-reduce and Hadoop.

Question 48

Q

I require Spark, what product in AWS should I be configuring?

Question 49

Q

What is HIVE?

Answer

A

Hive is a wherehouse on top Hadoop, it gives you SQL query abilities. It has a metadata store and ODBC and JDBC drivers to enable you to easily query form your apps.

Question 50

Q

What is PIG?

Answer

A

Pig is a high-level language to analyze data in Hadoop. For example, you can use pig to,

Load CSV file: LOAD k.csv USING PigStorage as id:int, date:chararray
Create new data listings: FOREACH listings GENERATE list_id, ToDate

Question 51

Q

How are EMR clusters created?

Answer

A

EMR cluster can be created by you through the console/CLI/API or through another product like datapipeline. When you create a cluster, it is a long-running cluster.

Question 52

Q

Can I ssh to the master node?

Answer

A

Yes, you can ssh to the master node.

Question 53

Q

I am using Hadoop and hive, I wnat to use ODBC, dod I need to move the data to RedShift?

Answer

A

No, Hive is a data wherehouse on top of Hadoop, one of the features of Hive is its ability to use ODBC.

Question 54

Q

I am using Hadoop and hive, I wnat to use JDBC, dod I need to move the data to RedShift?

Answer

A

No, Hive is a data wherehouse on top of Hadoop, one of the features of Hive is its ability to use JDBC.

Question 55

Q

What is HBase?

Answer

A

HBase is like google BigTable database, it runs on top of Hadoop HDFS.

Question 56

Q

I have to write some code, Is map and reduce one application?

Answer

A

Two separate application, there is a map app and a reduce app.

Question 57

Q

Dose EMR support spark?

Question 58

Q

How are EMR clusters created?

Answer

A

You create the cluster (long-running cluster)

- Another product like AWS Data-pipeline creates the cluster

Question 59

Q

I wnat to get deeper understanding ow what my EMR cluster is doing, what cna I do?

Answer

A

When creating a cluster you have the option to create logs and have them saved to S3.

Question 60

Q

When creating a Hadoop cluster what software configuration do I have available?

Answer

A

Hadoop
HBase
Spark
Presto

Question 61

Q

When creating a Hadoo[ cluster, do I have the option to select instance size?

Answer

A

Yes, 100%, you are getting a managed cluster of nodes, loaded with software for you, you can size the nodes as you need.

Question 62

Q

As EMR is a service, do you get node and if so what are they called?

Answer

A

Master node and core nodes

Question 63

Q

If ai create a cluster of 4 what nodes am I getting?

Answer

A

You are getting, one master and 3 core nodes.

Question 64

Q

My org has a policy of encrypting everything, how can we apply this to EMR?

Answer

A

EMR has encryption in,

Transit
At rest (EBS) & HDFS encryption
Encryption in transit between nodes.

Answer 58

A

You cna have the data encrypted between EMR nodes.

Answer 59

A

An IAM role, this role is used to grant the EMR nodes access to S3.

Answer 60

A

Yes 100%, you can ssh to the nodes.

Answer 61

A

You use a ‘Security Configuration’

Answer 62

A

Yes 100%, this is done when you create a ‘security configuration’

Answer 63

A

It is an implementation of HDFS for reading and writing to S3.

Answer 64

A

Yes 100%, EMR is just a cluster of EC2 nodes and is deployed into a VPC.

Answer 65

A

3 nodes, a master and two core nodes.

Answer 66

A

You need to select the hive configuration when setting up the cluster, the hive is a data where the house on top of Hadoop and enables you to perform SQL and ODBS, JDBS queries.

Answer 67

A

It manages the cluster, distributes workloads to the cluster and monitors the health of the cluster.

Answer 68

A

SSH to master node and you cna run the queries, you can also connect to hive using ODBC and JDBC.

Answer 69

A

On HDFS or on EMRFS (S3)

Answer 70

A

It runs on the core nodes.

Answer 71

A

You do not, EMR clusters are architected and deployed to be on one availability zona.

Answer 72

A

Task nodes is an optional node for running tasks but is not a node that interacts with HDFS.

Answer 73

A

A task node can use a spot instance as it can be stopped at any point in time without causing issues.

Answer 74

A

How I use storage classes, data that cna be recreated should be stored in low-cost storage like one zone.

Answer 75

A

The cluster will fail, it is the most important node.

Answer 76

A

One, the single master node is responsible for cluster management.

Answer 77

A

No!!, the master node is responsible for managing the cluster and always needs to be present.

Answer 78

A

Redshift data node stores data and searches the data.

EMR core nodes perform jobs on the data like map and reduce.

Answer 79

A

Unlike redshift, you can use almost all the instance types

Answer 80

A

No, it is fixed after the cluster is provisioned.

Answer 81

A

It is not a good choice, you will wnat to keep data close to the EMR cluster, in this case, the data is in S3 in us-east-1, you will wnat to create an EMR cluster in us-east-1

Answer 82

A

select something like m4.large and then evaluate using cloud watch and resize the nodes.

Answer 83

A

Memory - > slect mem optimised instance type (r)
Compute- > slect mem optimised instance type (c)
Storage -> select storage optimized instance type ()

Answer 84

A

Consider spot for master, core and task nodes.

Answer 85

A

Consider spot for master, core and task nodes.

Answer 86

A

on-demand for master node
on-demand for core-node
spot for task-node

Answer 87

A

on-demand for master node
on-demand for core-node
spot for task-node or instance fleet

Answer 88

A

on-demand for master node
on-demand for core-node or instance fleet
spot for task-node or instance fleet

Answer 89

A

on-demand for the master node
on-demand for core-node or instance fleet
spot for task-node or instance fleet

Answer 90

A

There is no streaming and analytic service in EMR, EMR is a batch processing service for data thet is already captured and stored in storage like S3. For streaming and realtime analytics, it best to use Kinesis

Answer 91

A

web indexing, data mining
logfile analysis
machine learning
financial analysis
scientific simulation
bioinformatics research

Answer 92

A

(instances with better balance cpu to IO for this type of analysis). Smaller instances with higher IO would be better.

Answer 93

A

Amazon DynamoDB is integrated with Apache Hive, a data warehousing application that runs on Amazon EMR. Hive can read and write data in DynamoDB tables, allowing you to:

Query live DynamoDB data using a SQL-like language (HiveQL).

Copy data from a DynamoDB table to an Amazon S3 bucket, and vice-versa.

Copy data from a DynamoDB table into Hadoop Distributed File System (HDFS), and vice-versa.

Perform join operations on DynamoDB tables.

Answer 94

A

S3
Elastic search
Amazon RDS
DynamoDB
Redshift
Kafka
Kinesis

Answer 95

A

Yes 100%, you can load you own libs and code on the nodes in EMR.

Answer 96

A

Yes 100% there is a connector

Answer 97

A

Apache Sqoop is a tool for transferring data between Amazon S3, Hadoop, HDFS, and RDBMS databases

Answer 98

A

Spark -> redshift conenctor

Answer 99

A

Compress the data

Answer 100

A

It is a batch service, but if you add Kinesis connector you can have EMR query and process data coming from Kinesis stream using HIVE, PIG, MapReduce.

Answer 101

A

It enables you to connect EMR with Kinesis stream for processing and querying, you cna query and process using PIG, HIVE, MapReduce.

Answer 102

A

EMR, With EMR you a

Answer 103

A

Slave nodes are core nodes, just another name.

Answer 104

A

No, you cna have several groups of nodes, each group could be different sizes nodes, on-demand and spot instances.

Answer 105

A

Yes 100%, AWS has machine learning platforms thet make it easy to prefrom ML. But you cna alos run ML on top of EMR.

Answer 106

A

AS ML is compute-intensive, you would use compute-optimized instances (C family)

Answer 107

A

Log processing
Genomic
Clickstream

Answer 108

A

During deployment, you can opt to use advance and deploy more than a single master node.

Answer 109

A

You can opt to run a boot script.

Answer 110

A

Java, but most any language can be used.

Answer 111

A

You cna use,
Snowball
Import/export
AWS CLI S3
Data sync
Direct connect

Answer 112

A

Yes 100%, you can ssh to this node.

Answer 113

A

No, the core and task node carries out the map?

Answer 114

A

A task node is used for running the map and reduce functions, but the task node does not have HDFS data, the task node works on data from the HDFS or the EMRHDFS (S3HDFS).

Answer 115

A

You will need to use infrequent access, IA gives you good cost as your data is stored for a long period untouched, once a year. You can use glacier as the data need to be instantly available.

Answer 116

A

HDFS does is not run on the task nodes it runs on the core nodes.

Answer 117

A

The cluster will fail as the master nodes I the node thet takes care of the entire cluster?

Answer 118

A

You can configure EMR to use up to 3 master nodes

Answer 119

A

Use EMRFS (S3) to store the data.

Answer 120

A

Yes 100%.

Answer 121

A

Cluster, you can deploy EMR as nodes in your VPC.

Answer 122

A

You cant, EMR is deployed into a single availability zone.

Answer 123

A

Hadoop
Spark
HBase
Presto

Answer 124

A

It is a data-wherehouse on top of EMR

Answer 125

A

Implementation of Google’s Bigtable

Answer 126

A

It is a high-level languages for creating an application on Hadoop

Answer 127

A

a single node, a single node is only for demo and will have the master node, HDFS and the map and reduce.

Answer 128

A

No, the master node can not be on a spot as it the spot is taken back by AWS then the cluster will die.

Answer 129

A

You should run EMR with the master node as reserved or on-demand, the core nodes as on-demand or reserved and the task nodes can be bumped up to deal with the 50 by using spot instances as needed.

Answer 130

A

You cna use spot instances.

Answer 131

A

Spot, yes spot for master, core and task, lowest price.

Answer 132

A

Spot, yes spot for master, core and task, lowest price.

Answer 133

A

Master = on-demand
Core on-demand or instance fleet
Task spot or instance feet