Hadoop Flashcards

1
Q

Hadoop Application consits of
-Hadoop comuting _
-Distributed_
-Hadoop _ _ _
-Hadoop _ _

A

Hadoop Computing Architecute
Distributed Approach
Hadoop Distriubted FIle System
Hadoop File Operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Current state of our world
-Data is exploding with _
-Social _
-Video _
-Photo _
-Wea—
-Internet of _

A

data is exploding with rapid gereration of data
social media
video streams
photo libraries
weather
Internet of things (IoT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Value of Data
Which of these companies are data companies?Should companies track the value of data on the balance sheet?
More data beats ____
AI cannot run without ___

A

google, facebook, amazon, apple
more data beats better algorithms
AI cannot run without data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Traditional Data processing
-Traditionally computation was ___ with __ amounts of data
-earlier approaches increased __ with ___

A

-Traditonally computation was processor bound with small amounts of data
-earlier approaches simply increased hardware with faster processors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop Computing
-Hadoop introduced a _ _ of bringing the program to the _ rather than the _ to the program
-Distributed data storage on ____
-Run applications where the ___

A

-Hadoop introduced a radical appraoch of bringing the program to the data rather than the data to the program
-Distributed data storage on multiple server nodes
-run applictions where the data resides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hadoop Program
-Founddation of _
-Reliable and _
-Open source free + ____
-Primarly focused on ___
-Architected to not move _ around
-uses __ with processing where the data is stored

A

Foundation of HDP
 Reliable and scalable
 Open Source Free + Cost to Support
 Primarily focused on data storage
 Architected to not move data around
 Uses “data locality” with processing where the data is stored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Characterisitcs of Hadoop
-___ to storing and executing large data files
-HDFS file systtem has default redundancy of _
-Default block size is __
-Batch _
-Not very useful for _
-Read centric architerure for _

A

Distributed approach to storing and executing large data files
 HDFS files system has default redundancy of 3
 Default block size is 128 MB
 Batch processing
 Not very useful for OLTP
 Read centric architecture for OLAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hadoop capbilites
- handles ,-, and _ data
-Schema on-

-Scales linearly with more disks providing a ____ increase in storage cpacity
-scales
-Hadoop is ___, avoiding __ as much as possible
example of normalized vs.denormalized

A

Handles structured, semi-structured, and unstructured data
Schema on-read
Scales linearly with more disks providing almost a 1-to-1 increase
in storage capacity
Scales horizontally
Hadoop is de-normalized, avoiding joins as much as possible
Example of normalized vs. de-normalized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MapReduce
___ the universal processing appraoch
__ updates all of the data by writing it to a new file everytime
Mapreduce is not good for updating _______
approach is write _, read many ___
Analyzing historcial record weather records for the last sales year

A

MapReduce is the universal processing approach
 MapReduce updates all of the data by writing it to a new file
every time
 MapReduce is not good for updating only some of the data
 Approach is write once, read many times scenarios
 Analyzing historical weather records for the last sales year

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hadoop Application system
___ utilities supporting other Hadoop
modules
___ distributed file system with high-throughput
____framework for job scheduling and cluster resource
management
___parallel processing of large data sets

A

Hadoop common
HDFS
YARN
MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Relational Database Systems
-Realtional db maanagment system _____
-Highly structured with ___
-Normalized using joint to ____
-Seek time increase slower than ____
-Predominatly scales __ with hardware
-Excets at write updates to only some fo the data like an _______

A

Relational Database Management System (Oracle, DB2,
Sybase, SQL Server)
 Highly structured with schema on-write
 Normalized using joins to reconstruct a dataset
 Seek time increasing slower than transfer rate (bandwidth)
 Predominantly scales vertically with hardware
 Excels at write updates to only some of the data like an address
in a CRM system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Traditional RDBMS vs MapReduce

Data Size
Access Updates
Transactions
Strcuture
Integrity
Scaling

A

data size-gigabytes, petabytes
access-interactive and batch, batch
updates-ead and write many times, write once read many times
trnsactions-acid, none
structure-schema on write, schema on read
integrity-high, low
scaling - nonlinerar, linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data storage in Hadoop
-storage size is increasing __
-Read time is not incrasing as fast as _
-How do you speed up read times?
-Disk failures are managed with multiple copies of __
-MapReduce re-assemes the data into a ___

A

Storage size is increasing lowering the price
 Read time is not increasing as fast as size
 How do you speed up read times? Read from multiple
distributed disks at the same time
 Disk failures are managed with multiple copies of each record
 MapReduce re-assembles the data into a file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HDFS File System
-___ files across a netwrok of computers, each with its own storage
-It is a ___ using data locality
-More complex than a ___
-complexity is astracted _ from user
-Hadoop users do not need to ___

A

Distributes files across a network of computers, each with it’s
own storage
 It is a distributed file system using data locality
 More complex than a regular file system
 Complexity is abstracted away from user
 Hadoop users do not need to choose drives or server nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Design of HDFS

A

Very large files (100 Megabytes, 100 gigabytes, 100 Terabytes. Peta) —->Straming data (read once)—> read many times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

File layers in HDFS
-HDFS is a file system written in _
-sits on top a ____
Provides ___ storage for massive amounts of data

A

HDFS is a filesystem written in Java
 Sits on top of a native Linux filesystem
 Provides redundant storage for massive amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

File storage in HDFS
-HDFS performs best with small number of
-Millions of large files vervsus billions of
-Files in HDFS are _ as we cannot modify an existing file
-Optimized for large files with data ___

A

HDFS performs best with a small number of large files
 Millions of large files versus billions of small ones
 Files in HDFS are Write Once as we cannot modify an existing
file
 Optimized for large files with data processed in large chunks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

HDFS Limitations
-Response time contain __
-Filestytem metadata is held in ___ (not so for lots of ___)
-Writes are typically from a single _____
-Record locking are formally ____

A

Response times contain latency
 Filesystem metadata is held in memory
 Not good for lots of small files
 Writes are typically from a single writer appending a file
 Record locking are formally not supported

19
Q

Storing Blocks on Data Nodes
Data files are split into ___ which are distributed at load time
Each block is replicated on ___
File can be larger than any ____

A

Data files are split into 128MB blocks which are distributed at
load time
 Each block is replicated on multiple data nodes
 File can be larger than any single storage disk

20
Q

Hadoop Cluster

A

Cluster is a group of computers working together
 Node is an individual server blade in the cluster
 Daemon is a program running a node

21
Q

Hadoop Cluster Components

A

 Three main components of a cluster
 Components work together to provide distributed data
processing

22
Q

NameNode
-NameNode stores __
-Manages the filesystem __
-Knows where every block is stored for ___
-Composted of 2 files:
1.___
2.___

A

 NameNode stores metadata
 Manages the filesystem namespace
 Knows where every block is stored for every file
 Composed of 2 files
 Namespace image
 Edit log

23
Q

Characteristics of NameNode
-Is the single point of ___
-Does not ___
-Reads and loads ___ information in memory
-____ memory requirements
-Users do not interact with ___

A

 Is the single point of failure
 Does not store data
 Reads and loads block/file information in memory
 High RAM memory requirements
 Users do not interact with nodes

24
Q

Secondary NameNode
-NameNode daemon must be _____
-HDFS is setup for ____ with activie and standby NameNodes
-Periodically merges the namespace ____

A

NameNode daemon must be
running at all times
 HDFS is setup for high availability
with active and standby
NameNodes
 Periodically merges the namespace
image and the edit log

25
Q

DataNodes
-___ and ___ blocks when they are told to do so
-__ to NameNode the list of block they are storing

A

Store and retrieve blocks when they are told to do so
 Report to NameNode the list of block they are storing

26
Q

HDFS File Operations
-HDFS is a seperate file running on the big data clustr that lets you view and _____ directories and files
-Necessary to specify the file system using ___
-Many of the Hadoop file comands use the same command as ____

A

 HDFS is a separate file system running on the big data cluster
that lets you view and manage your HDFS directories and files
 Necessary to specify the file system using hadoop fs
 Many of the Hadoop file commands use the same command as
Linux with a – (dash) preceding it

27
Q

Hadoop File Storage
-File 031512 split into blocks___
-File 042313 split into

A

File 031512 split into blocks B1, B2 & B3
 File 042313 split into blocks B4 & B5

28
Q

Hadoop Block Indetification
-Client asks NameNode for contentns of ___
-NameNode responts it is found in blocks___

A

Client asks NameNode for contents of 042313
 NameNode responds it is found in blocks B4 & B5

29
Q

Hadoop File Retrieveal
-Client begins retrieval attemps from____
-Client befins retrieval attemps from___
-Data is looked at directly from the ______

A

Client begins retrieval attempts from Nodes A, B & E
 Client begins retrieval attempts from Nodes C, E & D
 Data is looked at directly from the datanode to client

30
Q

Soring Files with HDFS
-Storage of files larger than an ___
-Blocks are large to reduce the amount of ____
-File metadata is stored in ____
-If a block replica is corrupt a ___
-1 MB file size only consumes _____

A

Storage of files larger than an entire hard drive
 Blocks are large to reduce the amount of seek time
 File metadata is stored in another system location
 If a block replica is corrupt a replica is selected
 1 MB file size only consumes 1 MB of block space

31
Q

File Writes
File Reads

A

look at pic

32
Q

DataNode Replication
-____ placed on the same node as client
-____ paced on a differnt rack chosen at random
-____ is paced on the same rack as the ___, on a ___, selected at ___

A

First replica placed on the same
node as the client
 Second replica is placed on a
different rack chosen at random
 Third replica is placed on the
same rack as the second, on a
different node, selected at
random

33
Q

List the Home Directory
-_____ list the contnets of the user home directory
-_____ lists the content of the Hadoop root directory
-The __ begns at the Hadoop root
-____ spells out the entire directory structures

A

hadoop fs –ls lists the contents of the user home directory
 hadoop fs –ls / lists the contents of the Hadoop root
directory
 The / begins at the Hadoop root
 -ls –R spells out the entire directory structures

34
Q

Making Directories
-_____ makes a new directory in hadoop
-The new directory is called ____
- The _ begins at the home directory

A

hadoop fs -mkdir makes a new directory in hadoop
 The new directory is called jdb101000
 The / begins at the home directory

35
Q

Put a File
-___ copies a file from the Linus file system to the Hadoop file system
-_____ is the syntax
-____ denotes the Linus source
-Destination is the ____ directory in Hadoop

A

-put copies a file from the Linux file system to the Hadoop
file system
 hadoop fs –put <source></source> <destination> is the syntax
 ~/foodsales.csv denotes the Linux source
 Destination is the /jdb101000 directory in Hadoop</destination>

36
Q

Get a File
-____
-____ is the syntax
-____ denoes the Linus destination
-Source is the ___ file in the Hadoop directory

A

 -get copies a file from the Hadoop file system to the Linux file
system
 hadoop fs –get <source></source> <destination> is the syntax
 ~/outbound.csv denotes the Linux destination
 Source is the foodsales.csv file in the Hadoop directory</destination>

37
Q

Copy a directory
-___ copies one directory and creates another
-_____ is the syntax
-_____ is the source directory
-___ is the new destination directory

A

 -cp copies one directory and creates another
 hadoop fs –cp <source></source> <destination> is the syntax
 /jdb101000 is the source directory
 /production is the new destination directory</destination>

38
Q

Copy a File
-___ copies one file and creates another
-____ is the syntax
-____ is the directory for both files
-___ is being copied to ____

A

 -cp copies one file and creates another
 hadoop fs –cp <source></source> <destination> is the syntax
 /jdb101000 is the directory for both files
 foodsales.csv is being copied to newfile.csv</destination>

39
Q

Reviewing a File
-____ displays the contents of a file
-____can be edited
-____ will exit the command

A

 -cat <filename> displays the contents of a file
 File can be edited
 Control D will exit the command</filename>

40
Q

Moving
-____ moves a file or directory to a new location
-____ is the syntax
-_____ is the source directory
-That will be moved into ___ directory

A

 -mv moves a file or directory to a new location
 hadoop fs –mv <source></source> <destination> is the syntax
 /production is the source directory
 That will be moved into the /jdb101000 directory</destination>

41
Q

Removing
-___ removes files and directories
-_____ is the syntax
-___ option is required to remove a directory
-The ____ directory will be removed

A

 -rm removes files and directories
 hadoop fs –rm <object> is the syntax
 -r option is required to remove a directory
 The /production directory will be removed</object>

42
Q

HDFS Reommendations
-___ is repository for your ___
-Best practices:
-Define a ____
-Include ____ for staging data

Example
____ data and configuration belongint to a single user
___ work in progress in Extract/Transform/Load Stage
___ temmporary generated data shared between users
____ data sets that are processed and available

A

 HDFS is a repository for your big data files
 Best practices
 Define a standard directory structure
 Include separate locations for staging data
 Examples
 /user - data and configuration belonging to a single user
 /etl - work in progress in Extract/Transform/Load stage
 /tmp - temporary generated data shared between users
 /data - data sets that are processed and available

43
Q

Summary

A

 Hadoop is a reliable distributed architecture for computing
 HDFS is the storage layer file system for Hadoop
 HDFS assigns three redundant file blocks to separate nodes and
distributes them across a cluster
 Uses a system of NameNodes and DataNodes organized with
daemons
 HDFS accessed using similar file commands with Linux