Hadoop Flashcards
Hadoop Application consits of
-Hadoop comuting _
-Distributed_
-Hadoop _ _ _
-Hadoop _ _
Hadoop Computing Architecute
Distributed Approach
Hadoop Distriubted FIle System
Hadoop File Operations
Current state of our world
-Data is exploding with _
-Social _
-Video _
-Photo _
-Wea—
-Internet of _
data is exploding with rapid gereration of data
social media
video streams
photo libraries
weather
Internet of things (IoT)
Value of Data
Which of these companies are data companies?Should companies track the value of data on the balance sheet?
More data beats ____
AI cannot run without ___
google, facebook, amazon, apple
more data beats better algorithms
AI cannot run without data
Traditional Data processing
-Traditionally computation was ___ with __ amounts of data
-earlier approaches increased __ with ___
-Traditonally computation was processor bound with small amounts of data
-earlier approaches simply increased hardware with faster processors
Hadoop Computing
-Hadoop introduced a _ _ of bringing the program to the _ rather than the _ to the program
-Distributed data storage on ____
-Run applications where the ___
-Hadoop introduced a radical appraoch of bringing the program to the data rather than the data to the program
-Distributed data storage on multiple server nodes
-run applictions where the data resides
Hadoop Program
-Founddation of _
-Reliable and _
-Open source free + ____
-Primarly focused on ___
-Architected to not move _ around
-uses __ with processing where the data is stored
Foundation of HDP
Reliable and scalable
Open Source Free + Cost to Support
Primarily focused on data storage
Architected to not move data around
Uses “data locality” with processing where the data is stored
Characterisitcs of Hadoop
-___ to storing and executing large data files
-HDFS file systtem has default redundancy of _
-Default block size is __
-Batch _
-Not very useful for _
-Read centric architerure for _
Distributed approach to storing and executing large data files
HDFS files system has default redundancy of 3
Default block size is 128 MB
Batch processing
Not very useful for OLTP
Read centric architecture for OLAP
Hadoop capbilites
- handles ,-, and _ data
-Schema on-
-Scales linearly with more disks providing a ____ increase in storage cpacity
-scales
-Hadoop is ___, avoiding __ as much as possible
example of normalized vs.denormalized
Handles structured, semi-structured, and unstructured data
Schema on-read
Scales linearly with more disks providing almost a 1-to-1 increase
in storage capacity
Scales horizontally
Hadoop is de-normalized, avoiding joins as much as possible
Example of normalized vs. de-normalized
MapReduce
___ the universal processing appraoch
__ updates all of the data by writing it to a new file everytime
Mapreduce is not good for updating _______
approach is write _, read many ___
Analyzing historcial record weather records for the last sales year
MapReduce is the universal processing approach
MapReduce updates all of the data by writing it to a new file
every time
MapReduce is not good for updating only some of the data
Approach is write once, read many times scenarios
Analyzing historical weather records for the last sales year
Hadoop Application system
___ utilities supporting other Hadoop
modules
___ distributed file system with high-throughput
____framework for job scheduling and cluster resource
management
___parallel processing of large data sets
Hadoop common
HDFS
YARN
MapReduce
Relational Database Systems
-Realtional db maanagment system _____
-Highly structured with ___
-Normalized using joint to ____
-Seek time increase slower than ____
-Predominatly scales __ with hardware
-Excets at write updates to only some fo the data like an _______
Relational Database Management System (Oracle, DB2,
Sybase, SQL Server)
Highly structured with schema on-write
Normalized using joins to reconstruct a dataset
Seek time increasing slower than transfer rate (bandwidth)
Predominantly scales vertically with hardware
Excels at write updates to only some of the data like an address
in a CRM system
Traditional RDBMS vs MapReduce
Data Size
Access Updates
Transactions
Strcuture
Integrity
Scaling
data size-gigabytes, petabytes
access-interactive and batch, batch
updates-ead and write many times, write once read many times
trnsactions-acid, none
structure-schema on write, schema on read
integrity-high, low
scaling - nonlinerar, linear
Data storage in Hadoop
-storage size is increasing __
-Read time is not incrasing as fast as _
-How do you speed up read times?
-Disk failures are managed with multiple copies of __
-MapReduce re-assemes the data into a ___
Storage size is increasing lowering the price
Read time is not increasing as fast as size
How do you speed up read times? Read from multiple
distributed disks at the same time
Disk failures are managed with multiple copies of each record
MapReduce re-assembles the data into a file
HDFS File System
-___ files across a netwrok of computers, each with its own storage
-It is a ___ using data locality
-More complex than a ___
-complexity is astracted _ from user
-Hadoop users do not need to ___
Distributes files across a network of computers, each with it’s
own storage
It is a distributed file system using data locality
More complex than a regular file system
Complexity is abstracted away from user
Hadoop users do not need to choose drives or server nodes
Design of HDFS
Very large files (100 Megabytes, 100 gigabytes, 100 Terabytes. Peta) —->Straming data (read once)—> read many times
File layers in HDFS
-HDFS is a file system written in _
-sits on top a ____
Provides ___ storage for massive amounts of data
HDFS is a filesystem written in Java
Sits on top of a native Linux filesystem
Provides redundant storage for massive amounts of data
File storage in HDFS
-HDFS performs best with small number of
-Millions of large files vervsus billions of
-Files in HDFS are _ as we cannot modify an existing file
-Optimized for large files with data ___
HDFS performs best with a small number of large files
Millions of large files versus billions of small ones
Files in HDFS are Write Once as we cannot modify an existing
file
Optimized for large files with data processed in large chunks