Hadoop Flashcards by Munza Hasan

Hadoop Application consits of
-Hadoop comuting _
-Distributed_
-Hadoop _ _ _
-Hadoop _ _

Hadoop Computing Architecute
Distributed Approach
Hadoop Distriubted FIle System
Hadoop File Operations

How well did you know this?

Not at all

Perfectly

Current state of our world
-Data is exploding with _
-Social _
-Video _
-Photo _
-Wea—
-Internet of _

data is exploding with rapid gereration of data
social media
video streams
photo libraries
weather
Internet of things (IoT)

How well did you know this?

Not at all

Perfectly

Value of Data
Which of these companies are data companies?Should companies track the value of data on the balance sheet?
More data beats ____
AI cannot run without ___

google, facebook, amazon, apple
more data beats better algorithms
AI cannot run without data

How well did you know this?

Not at all

Perfectly

Traditional Data processing
-Traditionally computation was ___ with __ amounts of data
-earlier approaches increased __ with ___

-Traditonally computation was processor bound with small amounts of data
-earlier approaches simply increased hardware with faster processors

How well did you know this?

Not at all

Perfectly

Hadoop Computing
-Hadoop introduced a _ _ of bringing the program to the _ rather than the _ to the program
-Distributed data storage on ____
-Run applications where the ___

-Hadoop introduced a radical appraoch of bringing the program to the data rather than the data to the program
-Distributed data storage on multiple server nodes
-run applictions where the data resides

How well did you know this?

Not at all

Perfectly

Hadoop Program
-Founddation of _
-Reliable and _
-Open source free + ____
-Primarly focused on ___
-Architected to not move _ around
-uses __ with processing where the data is stored

Foundation of HDP
 Reliable and scalable
 Open Source Free + Cost to Support
 Primarily focused on data storage
 Architected to not move data around
 Uses “data locality” with processing where the data is stored

How well did you know this?

Not at all

Perfectly

Characterisitcs of Hadoop
-___ to storing and executing large data files
-HDFS file systtem has default redundancy of _
-Default block size is __
-Batch _
-Not very useful for _
-Read centric architerure for _

Distributed approach to storing and executing large data files
 HDFS files system has default redundancy of 3
 Default block size is 128 MB
 Batch processing
 Not very useful for OLTP
 Read centric architecture for OLAP

How well did you know this?

Not at all

Perfectly

Hadoop capbilites
- handles ,-, and _ data
-Schema on-
-Scales linearly with more disks providing a ____ increase in storage cpacity
-scales
-Hadoop is ___, avoiding __ as much as possible
example of normalized vs.denormalized

Handles structured, semi-structured, and unstructured data
Schema on-read
Scales linearly with more disks providing almost a 1-to-1 increase
in storage capacity
Scales horizontally
Hadoop is de-normalized, avoiding joins as much as possible
Example of normalized vs. de-normalized

How well did you know this?

Not at all

Perfectly

MapReduce
___ the universal processing appraoch
__ updates all of the data by writing it to a new file everytime
Mapreduce is not good for updating _______
approach is write _, read many ___
Analyzing historcial record weather records for the last sales year

MapReduce is the universal processing approach
 MapReduce updates all of the data by writing it to a new file
every time
 MapReduce is not good for updating only some of the data
 Approach is write once, read many times scenarios
 Analyzing historical weather records for the last sales year

How well did you know this?

Not at all

Perfectly

Hadoop Application system
___ utilities supporting other Hadoop
modules
___ distributed file system with high-throughput
____framework for job scheduling and cluster resource
management
___parallel processing of large data sets

Hadoop common
HDFS
YARN
MapReduce

How well did you know this?

Not at all

Perfectly

Relational Database Systems
-Realtional db maanagment system _____
-Highly structured with ___
-Normalized using joint to ____
-Seek time increase slower than ____
-Predominatly scales __ with hardware
-Excets at write updates to only some fo the data like an _______

Relational Database Management System (Oracle, DB2,
Sybase, SQL Server)
 Highly structured with schema on-write
 Normalized using joins to reconstruct a dataset
 Seek time increasing slower than transfer rate (bandwidth)
 Predominantly scales vertically with hardware
 Excels at write updates to only some of the data like an address
in a CRM system

How well did you know this?

Not at all

Perfectly

Traditional RDBMS vs MapReduce

Data Size
Access Updates
Transactions
Strcuture
Integrity
Scaling

data size-gigabytes, petabytes
access-interactive and batch, batch
updates-ead and write many times, write once read many times
trnsactions-acid, none
structure-schema on write, schema on read
integrity-high, low
scaling - nonlinerar, linear

How well did you know this?

Not at all

Perfectly

Data storage in Hadoop
-storage size is increasing __
-Read time is not incrasing as fast as _
-How do you speed up read times?
-Disk failures are managed with multiple copies of __
-MapReduce re-assemes the data into a ___

Storage size is increasing lowering the price
 Read time is not increasing as fast as size
 How do you speed up read times? Read from multiple
distributed disks at the same time
 Disk failures are managed with multiple copies of each record
 MapReduce re-assembles the data into a file

How well did you know this?

Not at all

Perfectly

HDFS File System
-___ files across a netwrok of computers, each with its own storage
-It is a ___ using data locality
-More complex than a ___
-complexity is astracted _ from user
-Hadoop users do not need to ___

Distributes files across a network of computers, each with it’s
own storage
 It is a distributed file system using data locality
 More complex than a regular file system
 Complexity is abstracted away from user
 Hadoop users do not need to choose drives or server nodes

How well did you know this?

Not at all

Perfectly

Design of HDFS

Very large files (100 Megabytes, 100 gigabytes, 100 Terabytes. Peta) —->Straming data (read once)—> read many times

How well did you know this?

Not at all

Perfectly

File layers in HDFS
-HDFS is a file system written in _
-sits on top a ____
Provides ___ storage for massive amounts of data

HDFS is a filesystem written in Java
 Sits on top of a native Linux filesystem
 Provides redundant storage for massive amounts of data

How well did you know this?

Not at all

Perfectly

File storage in HDFS
-HDFS performs best with small number of
-Millions of large files vervsus billions of
-Files in HDFS are _ as we cannot modify an existing file
-Optimized for large files with data ___

HDFS performs best with a small number of large files
 Millions of large files versus billions of small ones
 Files in HDFS are Write Once as we cannot modify an existing
file
 Optimized for large files with data processed in large chunks

How well did you know this?

Not at all

Perfectly

HDFS Limitations
-Response time contain __
-Filestytem metadata is held in ___ (not so for lots of ___)
-Writes are typically from a single _____
-Record locking are formally ____

Study These Flashcards

Response times contain latency
 Filesystem metadata is held in memory
 Not good for lots of small files
 Writes are typically from a single writer appending a file
 Record locking are formally not supported

Storing Blocks on Data Nodes
Data files are split into ___ which are distributed at load time
Each block is replicated on ___
File can be larger than any ____

Study These Flashcards

Data files are split into 128MB blocks which are distributed at
load time
 Each block is replicated on multiple data nodes
 File can be larger than any single storage disk

Hadoop Cluster

Study These Flashcards

Cluster is a group of computers working together
 Node is an individual server blade in the cluster
 Daemon is a program running a node

Hadoop Cluster Components

Study These Flashcards

 Three main components of a cluster
 Components work together to provide distributed data
processing

NameNode
-NameNode stores __
-Manages the filesystem __
-Knows where every block is stored for ___
-Composted of 2 files:
1.___
2.___

Study These Flashcards

 NameNode stores metadata
 Manages the filesystem namespace
 Knows where every block is stored for every file
 Composed of 2 files
 Namespace image
 Edit log

Characteristics of NameNode
-Is the single point of ___
-Does not ___
-Reads and loads ___ information in memory
-____ memory requirements
-Users do not interact with ___

Study These Flashcards

 Is the single point of failure
 Does not store data
 Reads and loads block/file information in memory
 High RAM memory requirements
 Users do not interact with nodes

Secondary NameNode
-NameNode daemon must be _____
-HDFS is setup for ____ with activie and standby NameNodes
-Periodically merges the namespace ____

Study These Flashcards

NameNode daemon must be
running at all times
 HDFS is setup for high availability
with active and standby
NameNodes
 Periodically merges the namespace
image and the edit log

DataNodes -___ and ___ blocks when they are told to do so -__ to NameNode the list of block they are storing

Store and retrieve blocks when they are told to do so  Report to NameNode the list of block they are storing

HDFS File Operations -HDFS is a seperate file running on the big data clustr that lets you view and _____ directories and files -Necessary to specify the file system using ___ -Many of the Hadoop file comands use the same command as ____

 HDFS is a separate file system running on the big data cluster that lets you view and manage your HDFS directories and files  Necessary to specify the file system using hadoop fs  Many of the Hadoop file commands use the same command as Linux with a – (dash) preceding it

Hadoop File Storage -File 031512 split into blocks___ -File 042313 split into

File 031512 split into blocks B1, B2 & B3  File 042313 split into blocks B4 & B5

Hadoop Block Indetification -Client asks NameNode for contentns of ___ -NameNode responts it is found in blocks___

Client asks NameNode for contents of 042313  NameNode responds it is found in blocks B4 & B5

Hadoop File Retrieveal -Client begins retrieval attemps from____ -Client befins retrieval attemps from___ -Data is looked at directly from the ______

Client begins retrieval attempts from Nodes A, B & E  Client begins retrieval attempts from Nodes C, E & D  Data is looked at directly from the datanode to client

Soring Files with HDFS -Storage of files larger than an ___ -Blocks are large to reduce the amount of ____ -File metadata is stored in ____ -If a block replica is corrupt a ___ -1 MB file size only consumes _____

Storage of files larger than an entire hard drive  Blocks are large to reduce the amount of seek time  File metadata is stored in another system location  If a block replica is corrupt a replica is selected  1 MB file size only consumes 1 MB of block space

File Writes File Reads

look at pic

DataNode Replication -____ placed on the same node as client -____ paced on a differnt rack chosen at random -____ is paced on the same rack as the ___, on a ___, selected at ___

First replica placed on the same node as the client  Second replica is placed on a different rack chosen at random  Third replica is placed on the same rack as the second, on a different node, selected at random

List the Home Directory -_____ list the contnets of the user home directory -_____ lists the content of the Hadoop root directory -The __ begns at the Hadoop root -____ spells out the entire directory structures

hadoop fs –ls lists the contents of the user home directory  hadoop fs –ls / lists the contents of the Hadoop root directory  The / begins at the Hadoop root  -ls –R spells out the entire directory structures

Making Directories -_____ makes a new directory in hadoop -The new directory is called ____ - The _ begins at the home directory

hadoop fs -mkdir makes a new directory in hadoop  The new directory is called jdb101000  The / begins at the home directory

Put a File -___ copies a file from the Linus file system to the Hadoop file system -_____ is the syntax -____ denotes the Linus source -Destination is the ____ directory in Hadoop

-put copies a file from the Linux file system to the Hadoop file system  hadoop fs –put is the syntax  ~/foodsales.csv denotes the Linux source  Destination is the /jdb101000 directory in Hadoop

Get a File -____ -____ is the syntax -____ denoes the Linus destination -Source is the ___ file in the Hadoop directory

 -get copies a file from the Hadoop file system to the Linux file system  hadoop fs –get is the syntax  ~/outbound.csv denotes the Linux destination  Source is the foodsales.csv file in the Hadoop directory

Copy a directory -___ copies one directory and creates another -_____ is the syntax -_____ is the source directory -___ is the new destination directory

 -cp copies one directory and creates another  hadoop fs –cp is the syntax  /jdb101000 is the source directory  /production is the new destination directory

Copy a File -___ copies one file and creates another -____ is the syntax -____ is the directory for both files -___ is being copied to ____

 -cp copies one file and creates another  hadoop fs –cp is the syntax  /jdb101000 is the directory for both files  foodsales.csv is being copied to newfile.csv

Reviewing a File -____ displays the contents of a file -____can be edited -____ will exit the command

 -cat displays the contents of a file  File can be edited  Control D will exit the command

Moving -____ moves a file or directory to a new location -____ is the syntax -_____ is the source directory -That will be moved into ___ directory

 -mv moves a file or directory to a new location  hadoop fs –mv is the syntax  /production is the source directory  That will be moved into the /jdb101000 directory

Removing -___ removes files and directories -_____ is the syntax -___ option is required to remove a directory -The ____ directory will be removed

 -rm removes files and directories  hadoop fs –rm

Hadoop Flashcards

(43 cards)