Hadoop Flashcards
Hadoop Application consits of
-Hadoop comuting _
-Distributed_
-Hadoop _ _ _
-Hadoop _ _
Hadoop Computing Architecute
Distributed Approach
Hadoop Distriubted FIle System
Hadoop File Operations
Current state of our world
-Data is exploding with _
-Social _
-Video _
-Photo _
-Wea—
-Internet of _
data is exploding with rapid gereration of data
social media
video streams
photo libraries
weather
Internet of things (IoT)
Value of Data
Which of these companies are data companies?Should companies track the value of data on the balance sheet?
More data beats ____
AI cannot run without ___
google, facebook, amazon, apple
more data beats better algorithms
AI cannot run without data
Traditional Data processing
-Traditionally computation was ___ with __ amounts of data
-earlier approaches increased __ with ___
-Traditonally computation was processor bound with small amounts of data
-earlier approaches simply increased hardware with faster processors
Hadoop Computing
-Hadoop introduced a _ _ of bringing the program to the _ rather than the _ to the program
-Distributed data storage on ____
-Run applications where the ___
-Hadoop introduced a radical appraoch of bringing the program to the data rather than the data to the program
-Distributed data storage on multiple server nodes
-run applictions where the data resides
Hadoop Program
-Founddation of _
-Reliable and _
-Open source free + ____
-Primarly focused on ___
-Architected to not move _ around
-uses __ with processing where the data is stored
Foundation of HDP
Reliable and scalable
Open Source Free + Cost to Support
Primarily focused on data storage
Architected to not move data around
Uses “data locality” with processing where the data is stored
Characterisitcs of Hadoop
-___ to storing and executing large data files
-HDFS file systtem has default redundancy of _
-Default block size is __
-Batch _
-Not very useful for _
-Read centric architerure for _
Distributed approach to storing and executing large data files
HDFS files system has default redundancy of 3
Default block size is 128 MB
Batch processing
Not very useful for OLTP
Read centric architecture for OLAP
Hadoop capbilites
- handles ,-, and _ data
-Schema on-
-Scales linearly with more disks providing a ____ increase in storage cpacity
-scales
-Hadoop is ___, avoiding __ as much as possible
example of normalized vs.denormalized
Handles structured, semi-structured, and unstructured data
Schema on-read
Scales linearly with more disks providing almost a 1-to-1 increase
in storage capacity
Scales horizontally
Hadoop is de-normalized, avoiding joins as much as possible
Example of normalized vs. de-normalized
MapReduce
___ the universal processing appraoch
__ updates all of the data by writing it to a new file everytime
Mapreduce is not good for updating _______
approach is write _, read many ___
Analyzing historcial record weather records for the last sales year
MapReduce is the universal processing approach
MapReduce updates all of the data by writing it to a new file
every time
MapReduce is not good for updating only some of the data
Approach is write once, read many times scenarios
Analyzing historical weather records for the last sales year
Hadoop Application system
___ utilities supporting other Hadoop
modules
___ distributed file system with high-throughput
____framework for job scheduling and cluster resource
management
___parallel processing of large data sets
Hadoop common
HDFS
YARN
MapReduce
Relational Database Systems
-Realtional db maanagment system _____
-Highly structured with ___
-Normalized using joint to ____
-Seek time increase slower than ____
-Predominatly scales __ with hardware
-Excets at write updates to only some fo the data like an _______
Relational Database Management System (Oracle, DB2,
Sybase, SQL Server)
Highly structured with schema on-write
Normalized using joins to reconstruct a dataset
Seek time increasing slower than transfer rate (bandwidth)
Predominantly scales vertically with hardware
Excels at write updates to only some of the data like an address
in a CRM system
Traditional RDBMS vs MapReduce
Data Size
Access Updates
Transactions
Strcuture
Integrity
Scaling
data size-gigabytes, petabytes
access-interactive and batch, batch
updates-ead and write many times, write once read many times
trnsactions-acid, none
structure-schema on write, schema on read
integrity-high, low
scaling - nonlinerar, linear
Data storage in Hadoop
-storage size is increasing __
-Read time is not incrasing as fast as _
-How do you speed up read times?
-Disk failures are managed with multiple copies of __
-MapReduce re-assemes the data into a ___
Storage size is increasing lowering the price
Read time is not increasing as fast as size
How do you speed up read times? Read from multiple
distributed disks at the same time
Disk failures are managed with multiple copies of each record
MapReduce re-assembles the data into a file
HDFS File System
-___ files across a netwrok of computers, each with its own storage
-It is a ___ using data locality
-More complex than a ___
-complexity is astracted _ from user
-Hadoop users do not need to ___
Distributes files across a network of computers, each with it’s
own storage
It is a distributed file system using data locality
More complex than a regular file system
Complexity is abstracted away from user
Hadoop users do not need to choose drives or server nodes
Design of HDFS
Very large files (100 Megabytes, 100 gigabytes, 100 Terabytes. Peta) —->Straming data (read once)—> read many times
File layers in HDFS
-HDFS is a file system written in _
-sits on top a ____
Provides ___ storage for massive amounts of data
HDFS is a filesystem written in Java
Sits on top of a native Linux filesystem
Provides redundant storage for massive amounts of data
File storage in HDFS
-HDFS performs best with small number of
-Millions of large files vervsus billions of
-Files in HDFS are _ as we cannot modify an existing file
-Optimized for large files with data ___
HDFS performs best with a small number of large files
Millions of large files versus billions of small ones
Files in HDFS are Write Once as we cannot modify an existing
file
Optimized for large files with data processed in large chunks
HDFS Limitations
-Response time contain __
-Filestytem metadata is held in ___ (not so for lots of ___)
-Writes are typically from a single _____
-Record locking are formally ____
Response times contain latency
Filesystem metadata is held in memory
Not good for lots of small files
Writes are typically from a single writer appending a file
Record locking are formally not supported
Storing Blocks on Data Nodes
Data files are split into ___ which are distributed at load time
Each block is replicated on ___
File can be larger than any ____
Data files are split into 128MB blocks which are distributed at
load time
Each block is replicated on multiple data nodes
File can be larger than any single storage disk
Hadoop Cluster
Cluster is a group of computers working together
Node is an individual server blade in the cluster
Daemon is a program running a node
Hadoop Cluster Components
Three main components of a cluster
Components work together to provide distributed data
processing
NameNode
-NameNode stores __
-Manages the filesystem __
-Knows where every block is stored for ___
-Composted of 2 files:
1.___
2.___
NameNode stores metadata
Manages the filesystem namespace
Knows where every block is stored for every file
Composed of 2 files
Namespace image
Edit log
Characteristics of NameNode
-Is the single point of ___
-Does not ___
-Reads and loads ___ information in memory
-____ memory requirements
-Users do not interact with ___
Is the single point of failure
Does not store data
Reads and loads block/file information in memory
High RAM memory requirements
Users do not interact with nodes
Secondary NameNode
-NameNode daemon must be _____
-HDFS is setup for ____ with activie and standby NameNodes
-Periodically merges the namespace ____
NameNode daemon must be
running at all times
HDFS is setup for high availability
with active and standby
NameNodes
Periodically merges the namespace
image and the edit log
DataNodes
-___ and ___ blocks when they are told to do so
-__ to NameNode the list of block they are storing
Store and retrieve blocks when they are told to do so
Report to NameNode the list of block they are storing
HDFS File Operations
-HDFS is a seperate file running on the big data clustr that lets you view and _____ directories and files
-Necessary to specify the file system using ___
-Many of the Hadoop file comands use the same command as ____
HDFS is a separate file system running on the big data cluster
that lets you view and manage your HDFS directories and files
Necessary to specify the file system using hadoop fs
Many of the Hadoop file commands use the same command as
Linux with a – (dash) preceding it
Hadoop File Storage
-File 031512 split into blocks___
-File 042313 split into
File 031512 split into blocks B1, B2 & B3
File 042313 split into blocks B4 & B5
Hadoop Block Indetification
-Client asks NameNode for contentns of ___
-NameNode responts it is found in blocks___
Client asks NameNode for contents of 042313
NameNode responds it is found in blocks B4 & B5
Hadoop File Retrieveal
-Client begins retrieval attemps from____
-Client befins retrieval attemps from___
-Data is looked at directly from the ______
Client begins retrieval attempts from Nodes A, B & E
Client begins retrieval attempts from Nodes C, E & D
Data is looked at directly from the datanode to client
Soring Files with HDFS
-Storage of files larger than an ___
-Blocks are large to reduce the amount of ____
-File metadata is stored in ____
-If a block replica is corrupt a ___
-1 MB file size only consumes _____
Storage of files larger than an entire hard drive
Blocks are large to reduce the amount of seek time
File metadata is stored in another system location
If a block replica is corrupt a replica is selected
1 MB file size only consumes 1 MB of block space
File Writes
File Reads
look at pic
DataNode Replication
-____ placed on the same node as client
-____ paced on a differnt rack chosen at random
-____ is paced on the same rack as the ___, on a ___, selected at ___
First replica placed on the same
node as the client
Second replica is placed on a
different rack chosen at random
Third replica is placed on the
same rack as the second, on a
different node, selected at
random
List the Home Directory
-_____ list the contnets of the user home directory
-_____ lists the content of the Hadoop root directory
-The __ begns at the Hadoop root
-____ spells out the entire directory structures
hadoop fs –ls lists the contents of the user home directory
hadoop fs –ls / lists the contents of the Hadoop root
directory
The / begins at the Hadoop root
-ls –R spells out the entire directory structures
Making Directories
-_____ makes a new directory in hadoop
-The new directory is called ____
- The _ begins at the home directory
hadoop fs -mkdir makes a new directory in hadoop
The new directory is called jdb101000
The / begins at the home directory
Put a File
-___ copies a file from the Linus file system to the Hadoop file system
-_____ is the syntax
-____ denotes the Linus source
-Destination is the ____ directory in Hadoop
-put copies a file from the Linux file system to the Hadoop
file system
hadoop fs –put <source></source> <destination> is the syntax
~/foodsales.csv denotes the Linux source
Destination is the /jdb101000 directory in Hadoop</destination>
Get a File
-____
-____ is the syntax
-____ denoes the Linus destination
-Source is the ___ file in the Hadoop directory
-get copies a file from the Hadoop file system to the Linux file
system
hadoop fs –get <source></source> <destination> is the syntax
~/outbound.csv denotes the Linux destination
Source is the foodsales.csv file in the Hadoop directory</destination>
Copy a directory
-___ copies one directory and creates another
-_____ is the syntax
-_____ is the source directory
-___ is the new destination directory
-cp copies one directory and creates another
hadoop fs –cp <source></source> <destination> is the syntax
/jdb101000 is the source directory
/production is the new destination directory</destination>
Copy a File
-___ copies one file and creates another
-____ is the syntax
-____ is the directory for both files
-___ is being copied to ____
-cp copies one file and creates another
hadoop fs –cp <source></source> <destination> is the syntax
/jdb101000 is the directory for both files
foodsales.csv is being copied to newfile.csv</destination>
Reviewing a File
-____ displays the contents of a file
-____can be edited
-____ will exit the command
-cat <filename> displays the contents of a file
File can be edited
Control D will exit the command</filename>
Moving
-____ moves a file or directory to a new location
-____ is the syntax
-_____ is the source directory
-That will be moved into ___ directory
-mv moves a file or directory to a new location
hadoop fs –mv <source></source> <destination> is the syntax
/production is the source directory
That will be moved into the /jdb101000 directory</destination>
Removing
-___ removes files and directories
-_____ is the syntax
-___ option is required to remove a directory
-The ____ directory will be removed
-rm removes files and directories
hadoop fs –rm <object> is the syntax
-r option is required to remove a directory
The /production directory will be removed</object>
HDFS Reommendations
-___ is repository for your ___
-Best practices:
-Define a ____
-Include ____ for staging data
Example
____ data and configuration belongint to a single user
___ work in progress in Extract/Transform/Load Stage
___ temmporary generated data shared between users
____ data sets that are processed and available
HDFS is a repository for your big data files
Best practices
Define a standard directory structure
Include separate locations for staging data
Examples
/user - data and configuration belonging to a single user
/etl - work in progress in Extract/Transform/Load stage
/tmp - temporary generated data shared between users
/data - data sets that are processed and available
Summary
Hadoop is a reliable distributed architecture for computing
HDFS is the storage layer file system for Hadoop
HDFS assigns three redundant file blocks to separate nodes and
distributes them across a cluster
Uses a system of NameNodes and DataNodes organized with
daemons
HDFS accessed using similar file commands with Linux