Elastic Map Reduce Flashcards
_ _describes the realization of _ _ _ by ,, and _ data that was previously ignored or siloed due to the limitations of _ _ management technolgies
Big data
greater business intelligence
storing
processing
analzying
traditional data
Big data describes the realization greater business intelligence by storing, processing, and analyzing that was previously ignored or siloed due to the limitation of traditional data managment technologies
The V’s of Big Data
_ is the _ data travels
_ is the _ data requires
_ is the _ types of _
Velocity, speed
Volume, space
Variety, heterogenous. files
-Velocity is the speed data travels
-Volume is the space data requires
-Variety is the heterogenous types of files
Velocity
_ _ from many sources at a _ _ of _
3 examples
Velocity
Ingesting data from many sources at a high rate of space
-Internet of things (IOT)
-clickstream data
-environmental data
Volume
_ (one character)
_ (1000 bytes)
_ (1000^2 bytes)
_ (1000^3 bytes)
_ (1000^4 bytes)
_ (1000^5 bytes)
_ (1000^6 bytes)
_ (1000^7 bytes)
fun fact: Single oil well generates _ _ data per day.
Byte (one character)
Kilobyte (1000 bytes)
Megabyte (1000^2 bytes)
Gigabyte (1000^3)
Terabyte (1000^4)
Petabyte (1000^5)
Extabyte (1000^6)
Zettabyte (1000^7)
Single oil well generates 15 terabytes of data per day
BKMGTPEZ
Volume Examples
-A standard work year - 2,016 hours
-YouTube (Google) Content ID System
-Looks for copyright violations in uploaded videos
-YouTube”s content ID system processes 250 years of video content in 24 hours
Variety Examples
RDBMS- Relation data files
XML files
log files
unstructured text files
HTML files
PDF files
Video files
Big Data
Is Big Data just a _ in _?
Is big data just a _ _ for technolgies that always existed, but were just called something else?
Completely different _ for _ and _ _
fad, technology
new name
architeture, computing, data storage
-is big data just a fad in technology
-is big data just new name for technolgies that always existed but were just called soemthing else
-completely different architecture for computing and data storage
Traditional computing model
Data stored in a _ _ like a _
Data copied to _at _ _
_ _ bottlenecks on the _ _
- data stored in a central location like a SAN
-Data copied to proccerers at run time
-Large volumes bottelnecks on the transfer rate
Hadoop Computing Model
Bring the _ _ _ _
_ and _ data when the _ _ _
Run the _ where the _ _
program to the data
replicate, distibute, data is stored
program, data resides
-Bring the model to the data
-Replicte and distribute data when the data is stored
-Run the program where the data resides
Distributions
_ is a _ of _ _ _ _ applications that have been tested to _ _
Prominent providers of distributions include…
Distirubiton, collection of open source Apache. work together
Cloudera
Hortonworks
Amazon
Google
MS Azure
Distribution is a collection of open source Apache that have been tested to work together
Prominent providers of distributions include
-cloudera
-MS Azure
-hortonworks
-Google
-Amazon
Hadoop
The _ _ software library is a _ that allows for the _ _ of large sets across _ _ _ using simple _ _
apache hadoop , framework. distrubted processing, clusters of computer, programming models
The apache hadoop softtware libary is a framework that allows for the distributed programming of large sets across clusters of computer using simple programming models
Hadoop Characterisitcs
_ data storage
inexpensive _
combines up to_ _ _ for _ performance
inexpensive
servers
1000, distributed servers, massive
Trends- Storage
Is a _
only getting _ and _ _
normalization vs _
Data schema on-_ vs _ on-write
data _
solid-_
commodity
cheaper, more abundant
denomrizliation
on-read, schema
lakes
state
is a commodity
only getting cheaper and more abundant
normaization vs denormalization
data schme on read vs sschema on write
data lakes
solid state
Trends-memory
Is a _
only getting _ and _ _
the _ the _
In-memory _ _ from _ _ of _
_ _ needs, depending on the side of _, lots of _
is a commodity
only getting cheaper and more abundant
the more the merrier
In memeory computing benefitting from massive allocation of RAM
Hadoop namenode needs, depending on the size of cluser, lot of RAM
Distributed Processing
More cheapter to store _ _ of data using _ _ architeure
Think of _ _ on severs. At a large corporation there are massive quantities of _ _ _. They are used for analysis of _ _, _ _, _ _, _ _ and tuning, and more
Analyzing all of that _ stored data requries _ _ for analysis
massive quantities, big data
log files, log files (petabytes), security breaches, clickstream analysis, website statistics, infrasture analysis, and more
cheaply, different application
-more cheaper to store massivmee quanities of data using big data architecture
-think of log files on servers. At a large corportation there are massive quanities of log files (petabytes). They are used for analysis of security breaches, clickstream analysis, website stastics, infrastrue analysis and turning and more.
-analysing all of that cheaply stored data reequires a different application for analysis
Hadoop Distributed File System
_ is the data storage layer for a _ _
Inexpensive reliable store for _ _ _ _
uses low cost industry _ _
data is _. and _ to multiple _ of _
HDFS, Hadoop system
massive amounts of data
standard hardware
replicated, distributed, nodes, storage
HDFS is the data storage layer fora Hadoop System
Inexpensive relaible storage for massive amounts of data
uses low cost industry standard hardware
data is replicated and distributed to nodes of hardware
Hadoop application
HDFS the _ _ _
-distributes _ _ across the cluster in a reduntant manner
-Data is lost _ _
YARN is _ _ _ _
-Manages cluster resources for the _ _ _
MapReduce
-Base code that handles all _ _
-Maps data to / _
Hadoop file system
data blocks
cluster termination
Yet another resource negotiator
collections of applications
data processing
key/value pairs
HDFS the hadoop file system
Disteibutes data blocks across the clusser in a redunatnt manner
data is lost in cluster termination
yarn is yet another resource negotiatior
managers cluser resources for the collection application
base code that handles all data processing
Maps to key/value pairs
Map Reduce
Mechanism for bringing the processing to _ _ _
Maps where data is stored on each _ _
contains a master job tracker manaaging _ _
uses the task tracker to execute tasks on each _ _
the stored data
HDFS node
task resources
HDFS node
Mechanism for bringing the processing to the stored data
Maps where data is on each HDFS node
contains a master job tracker managing task resources
uses the task track to execute tasks on each HDFS node
What is EMR?
EMR stands for _ _ _
EMR is a managed hadoop service by _
AWS Distributions provide support for the most popular _ _ applications like _, _, _, _, and _
Elastic Map Reduce
AWS
open source
Spark, hive, HDFS, presto and flink
EMR Cluster Architecture
Master Node “leader node”
-manages _ _
-tracks status of _
-Monitors _ _
-Single _ _
Core Node
-Saves _ _
-Used in _ _ _
-Runs _
-can be scaled _ or _
Task Node
-runs _ _
-does not store _
-_ instances can be used
Master Node “leader node”
-manages the cluster
-tracks status of tasks
-monitors cluster health
-Single EC2 insance
Core Node
-Saves HDFS data
-Used in multi node clusters
-runs tasks
-can be scaled up or down
Task node
-runs tasks only
-does not store data
-spot instances can be used