Elastic Map Reduce Flashcards

1
Q

_ _describes the realization of _ _ _ by ,, and _ data that was previously ignored or siloed due to the limitations of _ _ management technolgies

A

Big data
greater business intelligence
storing
processing
analzying
traditional data

Big data describes the realization greater business intelligence by storing, processing, and analyzing that was previously ignored or siloed due to the limitation of traditional data managment technologies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The V’s of Big Data
_ is the _ data travels
_ is the _ data requires
_ is the _ types of _

A

Velocity, speed
Volume, space
Variety, heterogenous. files

-Velocity is the speed data travels
-Volume is the space data requires
-Variety is the heterogenous types of files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Velocity
_ _ from many sources at a _ _ of _
3 examples

A

Velocity
Ingesting data from many sources at a high rate of space
-Internet of things (IOT)
-clickstream data
-environmental data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Volume
_ (one character)
_ (1000 bytes)
_ (1000^2 bytes)
_ (1000^3 bytes)
_ (1000^4 bytes)
_ (1000^5 bytes)
_ (1000^6 bytes)
_ (1000^7 bytes)

fun fact: Single oil well generates _ _ data per day.

A

Byte (one character)
Kilobyte (1000 bytes)
Megabyte (1000^2 bytes)
Gigabyte (1000^3)
Terabyte (1000^4)
Petabyte (1000^5)
Extabyte (1000^6)
Zettabyte (1000^7)

Single oil well generates 15 terabytes of data per day

BKMGTPEZ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Volume Examples

A

-A standard work year - 2,016 hours
-YouTube (Google) Content ID System
-Looks for copyright violations in uploaded videos
-YouTube”s content ID system processes 250 years of video content in 24 hours

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Variety Examples

A

RDBMS- Relation data files
XML files
log files
unstructured text files
HTML files
PDF files
Video files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Big Data
Is Big Data just a _ in _?
Is big data just a _ _ for technolgies that always existed, but were just called something else?
Completely different _ for _ and _ _

A

fad, technology
new name
architeture, computing, data storage

-is big data just a fad in technology
-is big data just new name for technolgies that always existed but were just called soemthing else
-completely different architecture for computing and data storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Traditional computing model
Data stored in a _ _ like a _
Data copied to _at _ _
_ _ bottlenecks on the _ _

A
  • data stored in a central location like a SAN
    -Data copied to proccerers at run time
    -Large volumes bottelnecks on the transfer rate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hadoop Computing Model
Bring the _ _ _ _
_ and _ data when the _ _ _
Run the _ where the _ _

A

program to the data
replicate, distibute, data is stored
program, data resides

-Bring the model to the data
-Replicte and distribute data when the data is stored
-Run the program where the data resides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Distributions
_ is a _ of _ _ _ _ applications that have been tested to _ _
Prominent providers of distributions include…

A

Distirubiton, collection of open source Apache. work together

Cloudera
Hortonworks
Amazon
Google
MS Azure

Distribution is a collection of open source Apache that have been tested to work together

Prominent providers of distributions include
-cloudera
-MS Azure
-hortonworks
-Google
-Amazon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hadoop
The _ _ software library is a _ that allows for the _ _ of large sets across _ _ _ using simple _ _

A

apache hadoop , framework. distrubted processing, clusters of computer, programming models

The apache hadoop softtware libary is a framework that allows for the distributed programming of large sets across clusters of computer using simple programming models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hadoop Characterisitcs
_ data storage
inexpensive _
combines up to_ _ _ for _ performance

A

inexpensive
servers
1000, distributed servers, massive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Trends- Storage
Is a _
only getting _ and _ _
normalization vs _
Data schema on-_ vs _ on-write
data _
solid-_

A

commodity
cheaper, more abundant
denomrizliation
on-read, schema
lakes
state

is a commodity
only getting cheaper and more abundant
normaization vs denormalization
data schme on read vs sschema on write
data lakes
solid state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Trends-memory
Is a _
only getting _ and _ _
the _ the _
In-memory _ _ from _ _ of _
_ _ needs, depending on the side of _, lots of _

A

is a commodity
only getting cheaper and more abundant
the more the merrier
In memeory computing benefitting from massive allocation of RAM
Hadoop namenode needs, depending on the size of cluser, lot of RAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Distributed Processing
More cheapter to store _ _ of data using _ _ architeure

Think of _ _ on severs. At a large corporation there are massive quantities of _ _ _. They are used for analysis of _ _, _ _, _ _, _ _ and tuning, and more

Analyzing all of that _ stored data requries _ _ for analysis

A

massive quantities, big data

log files, log files (petabytes), security breaches, clickstream analysis, website statistics, infrasture analysis, and more

cheaply, different application

-more cheaper to store massivmee quanities of data using big data architecture
-think of log files on servers. At a large corportation there are massive quanities of log files (petabytes). They are used for analysis of security breaches, clickstream analysis, website stastics, infrastrue analysis and turning and more.
-analysing all of that cheaply stored data reequires a different application for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Hadoop Distributed File System
_ is the data storage layer for a _ _
Inexpensive reliable store for _ _ _ _
uses low cost industry _ _
data is _. and _ to multiple _ of _

A

HDFS, Hadoop system
massive amounts of data
standard hardware
replicated, distributed, nodes, storage

HDFS is the data storage layer fora Hadoop System
Inexpensive relaible storage for massive amounts of data
uses low cost industry standard hardware
data is replicated and distributed to nodes of hardware

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Hadoop application

HDFS the _ _ _
-distributes _ _ across the cluster in a reduntant manner
-Data is lost _ _

YARN is _ _ _ _
-Manages cluster resources for the _ _ _

MapReduce
-Base code that handles all _ _
-Maps data to / _

A

Hadoop file system
data blocks
cluster termination
Yet another resource negotiator
collections of applications
data processing
key/value pairs

HDFS the hadoop file system
Disteibutes data blocks across the clusser in a redunatnt manner
data is lost in cluster termination
yarn is yet another resource negotiatior
managers cluser resources for the collection application
base code that handles all data processing
Maps to key/value pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Map Reduce
Mechanism for bringing the processing to _ _ _
Maps where data is stored on each _ _
contains a master job tracker manaaging _ _
uses the task tracker to execute tasks on each _ _

A

the stored data
HDFS node
task resources
HDFS node

Mechanism for bringing the processing to the stored data
Maps where data is on each HDFS node
contains a master job tracker managing task resources
uses the task track to execute tasks on each HDFS node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is EMR?
EMR stands for _ _ _
EMR is a managed hadoop service by _
AWS Distributions provide support for the most popular _ _ applications like _, _, _, _, and _

A

Elastic Map Reduce
AWS
open source
Spark, hive, HDFS, presto and flink

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

EMR Cluster Architecture
Master Node “leader node”
-manages _ _
-tracks status of _
-Monitors _ _
-Single _ _

Core Node
-Saves _ _
-Used in _ _ _
-Runs _
-can be scaled _ or _

Task Node
-runs _ _
-does not store _
-_ instances can be used

A

Master Node “leader node”
-manages the cluster
-tracks status of tasks
-monitors cluster health
-Single EC2 insance

Core Node
-Saves HDFS data
-Used in multi node clusters
-runs tasks
-can be scaled up or down

Task node
-runs tasks only
-does not store data
-spot instances can be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Transient versus Long clusters
_ cluster terminate once all steps are complete
- it _ _ _
-perform work and then shut down _ _

_ are manually terminated
-Functions as a data warehouse with periodic processing on _ _ _
-Task nodes can be scaled using _ _
-Setup with termination protection on - _

A

Transiet clusters terminate once all steps are complete
-loading data, prcessing, and storing data
-perfrom work and then shut down saving costs
Long running are manually terminated
-functions as a data warehouse with periodic processing on large data sets
-tasks nodes can be scaled using spot instances
-setup with termination protection on and auto termination off

22
Q

Using EMR
_ and _ are part of cluseter creation
- users connect directly to the _ _ to run jobs
-configure steps in a _
-submit ordered steps via the _

A

frameworks and application
master node
cluster
console

Frameworks and application are part of cluster creation
users connect directly to the master node to run jobs
configure steps in a cluseter
submit ordered steps via the console

23
Q

EMR and AWS integration
_ provides the EMR nodes
_ provides a virtual network for nodes
_ stores input and output data
_ _ to schedule and start clusters
_ to configure permissions

A

EC2
VPC
S3
Data Pipeline
IAM

24
Q

EMR capabilites
_ is a by the hour service charge
_ is a seperate set of charges
automatically provisions core nodes when they _
cluster core nodes be resized _ _ _
core nodes be removed but risk _ _
task nodes can be added on the _

A

EMR is a by the hour service charge
EC2 is a seperate set of charges
auotmaitcllay provisons core nodes when they fail
cluster core nodes can be resized on the fly
core nodes can be removed but risk data loss
tasks nodes can be added on the fly

25
Q

HIVE
-It is a tool that provides SQL quering of data stored in _ or Hbase
-Accesed using the _ _
-Allows for easy _ _ _
-Transforms log file data into structures like _
-Consits of a schema in the Metastore and data in _

A

It is a tool that provides SQL querying of data stores in HDFS or HBase
Accessd using the HiveQL lanaguage
Allows for easy ad-hoc quereies
transforms log file data into sturcures like tables
consitsof a schedma in the Metastore and data in HDFS

26
Q

Success of HIVE
Uses familiar SQL syntax for _ _
Interactice and scalable on a _ _ _
Works very well for _ _ _
/ driver

A

OLAP queries
big data cluster
data warehouse application
JDBC/ODBC

uses familar SQL syntax for OLAP queries
inveractive and scalable on a big data cluser
works very well for data warehouse application
JDBC/ODBC driver

27
Q

Hive Metastore and Glue
___ shares schema across EMR and other AWS services
__ is used to create data lakes

A

Glue Data Catalog

28
Q

Schema on read
-Verfied data orgnization when query is _
-Provides much faster loading as strucuture is not _
-Multiple schemas serving different needs for the __
-Better options when schema is not known at ___

A

-Verfiies data organization when a query is issued
-Provides much faster loading at strucutre is not validted
-Multiple shcmeas servng differneeds for the same data
-Better option when is not known at loading time

29
Q

HIVE query example
-Creates a new table(file) for user_active
-Selects all users
-From table called user
-that have an active indicatior

A

INSERT OVERWRITE TABLE. user_active
SELECT user.*
FROM user
WHERE user.active = A;

30
Q

Loading Data into Hive
-Create a hive table called records to store data
-identify the metadata type with each filed
-define the strucutre as tab delimited

-Load data into the table as a local file
-OVERWRITE will replace the exisitng file

A

CREATE TABLE records (year STRING, temperature INT, quality INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;

LOAD DATA LOCAL INPATH ‘input/ncdc/micro-tab/sample.txt’
OVERWRITE INTO TABLE records;

31
Q

Query Against a Hive Table
-SELECT using SQL familar commands and specification
-MAX defines the maximum temperature for each year
-FROM defines our records table
-WHERE ensures clean data selections
-GROUP BY issues the value by year

A

hive> SELECT year, MAX(temperature)
> FROM records
>WHRE temperature != 1999 and quality IN (0, 1, 4, 5, 9)
> GROUP BY year;

32
Q

File Storage Formats
___ are avialbe for storage
5 examples

A

Multiple file formats are available for storage

–as- format is the nomenclatrue
–as-textfile
–as-sequencefile
–as-parquetfile
–as-avrodatafile

CREATE TABLE tablename (colname DATATYPE,…)
ROW FORMAT DELIMITED
FIELDS TERMINED BY char
STORED AS format

33
Q

Binary Column Formats
Column oriented formats work best when on a few columns are used in /
-Hive provides native support for _
-STORES AS..(PARQUET) or (ETC…)

A

-queries/calculations
-parquet
————————————–

CREATE TABLE users_parquet STORED AS PARQUET
AS
SELECT * FROM users;

34
Q

S3DistCP Copy
Transaction for copying large amounts of data from _ to _
Copies in a distributed manner using ___
Provides parallel path copying across _

A

transaction for copying large amounts of data from S3 to HDFS
Copies in a distributed manner using MapReduce
Provides parallel path copying across buckets

s3-dist-cp –srcs=s3://jb101000/data –dest=hdfs:///data

35
Q

EMR SERVICE
EMR is the encompassing big data service in __
Which applications are in a distribution for EMR service?

A

EMR is the encompassing big data service in AWS
Hadoop,Spark, Hive appliction are in distribution

36
Q

Clusters
Clusters is a computing network of __
____ in the cluster store hadoop files

A

Cluster is a computing network of computer
Core nodes in the cluster store hadoop files

37
Q

Select Application

____ is the distribution of application tested together
Apache application like __ and __ are selected

A

EMR release is the distribution application tested together
Apache applications like Spark and Hive are selected

37
Q
A
38
Q

Configure Cluster Nodes
___ configuration is for the master node
__ configuration is for storage and processing
___ configuration is for processing

A

primary node
confgiruation
task

39
Q

Cluster Configuration
Master node
-____ is the standardard configuration
Core/Task nodes
-____ is a good choice
-external dependencies use ___
-improved performance with ___

___ is a good choice for task nodes as they can scale. avoid using master and core notes as it may cause __

A

master node
m5.xlarge is the standard configuration
core/task nodes
-m5.xlarge is a good choice
-external dependencies use t2.medium
-improved performance with m4.xlarge

spot instance is a good choice for task nodes as they can scale. avoid using master and core notes as it may cause data loss

40
Q

Virutal privat cloud
_ is created as protected network for the cluster

A

VPC

41
Q

EMR Processing Logs

__ is created to capture the processing logs for the cluster
___ are captured in the same bucket

A

S3 bucket
error log files

42
Q

EMR security
___ grant or deny permissions to control cluster access
___ control access to EMFRS data based on user
___ are attached to IAM roles
____ provides a secure connection to the command line interference
___provides secure user authenication
______ setting prevents public access to data stored on your EMR cluster

A

IAM policies grant or deny permissions to control cluster access
IAM roles control access to EMFRS data based on user
IAM policies are attached to IAM roles
SSH provides a secure connection to the command line interfernce
Kerebos provides secure user authenication
Block Public Access setting prevents public access to data stored on your EMR cluster

43
Q

Define an IAM role

A

Asssign IAM roles to the cluser

entities you create and assign specific permissions to that allow trusted identities such as workforce identities and applications to perform actions in AWS.

44
Q

Define Security for the Cluster

Define the __ approach
Provide the key pair for ___ client access to the cluster

A

Define the encryption approach
Provide the key pari for SSH client access to the cluster

45
Q

Create the cluster
Selecting the ___ button engages the configuration options and creates the cluster

A

create cluster

46
Q

Cluster Operations
The cluser is now avialbe for access by ___
Any ___ selected by the cluster creation can now be enaged at the ______

A

The cluster is now available for access by SSH clients
Any Apache application selected by the cluster cretion can now be enagated at the command line

47
Q

Spark Application

-Apache spark is a fast and general engine for _ _ _ _
-in memory _
-optimized _ _
-Spark SQL _
-Machine Learning __
-Spark _

A

apache spark is a fast and general engine for large scale data processing
in memory catching
otpmized query execution
spark SQL queries
Machine Learnning MLlib
Spark Streaming

48
Q

Spark Applications
Consits of SparkContext _ process and _
YARN or Spark can be the cluster _

A

Consists of SparkContext driver process and executors
Yarn or Spark can be the cluste manager

49
Q

EMR Notebook
AWS notbook (__)
backed up to _ data storage
provision clusters from the _
accessed via _ _
hosted inside a _

A

Aws notebook jupyte
backed up to S3 data storage
provison cluster from the notebook
accessed via AWS console
hosted inside a VPC

50
Q

Cluster is a computing network of server nodes
 EMR is the managed hadoop service provided by AWS
 Distribution is the collection of applications available in the
cluster
 Applications like Spark, Hive, and Hadoop are provided by the
cluster
 Hive is an application that provides a metadata structure over
the hadoop file system
 Spark is an in-memory high speed analytical application

A