Hadoop Tools Flashcards by Jon Steele

Mapreduce 4 processes

Mapping
Shuffling
Sorting
Reducing

How well did you know this?

Not at all

Perfectly

What is sqoop?

Sql to hadOOP
Import / export data to and from sql into hadoop

How well did you know this?

Not at all

Perfectly

What is Flume?

Flume is used to collect and store unstructured or semistructured data into Hadoop.

This data could be videos, pictures, emails, or websites.

Keeps a balance between data read and write speeds

How well did you know this?

Not at all

Perfectly

Describe an Agent (Flume)

3 components:
Source - accepts incoming data and forwards to channel
Channel - Receives data and buffers them until the sink consumes them, avoiding data loss
Sink - Receives data from the channel and stores into HDFS

How well did you know this?

Not at all

Perfectly

What is HBase?

HBase is a column oriented, graph based database built for the Hadoop file system. It is designed to provide quick random access to huge amounts of data. Users can store data in HDFS directly or through HBase.

How well did you know this?

Not at all

Perfectly

Describe how HBase stores data differently from row storage

It uses columns instead of rows

How well did you know this?

Not at all

Perfectly

What are the 4 components of HBase?

Master Server, Region Server, Regions, & Zookeeper

How well did you know this?

Not at all

Perfectly

What is Hive?

It is a data warehouse framework for querying and analyzing data stored in HDFS.
An open-source tool for data analysis

Also created to address the stressful programming associated with MapReduce
Unlike PIG, Hive is for processing structured* data only
Can process Terabytes of data in seconds vs RDBMS

How well did you know this?

Not at all

Perfectly

In HBase, what is the Master Server?

It assigns regions to region servers with the help of Apache ZooKeeper. It handles load balancing across those servers. It is responsible for schema changes.

How well did you know this?

Not at all

Perfectly

In HBase, what is Zookeeper?

Receives keep-alive heartbeats from HRegion and HMaster
servers. Keeps track of meta data of server

How well did you know this?

Not at all

Perfectly

In HBase, describe the Region Server and Regions.

These region servers have regions that communicate with the client and handle data related operations, like reading and writing requests for all regions under it. The region server decides the size of the region following the region size thresholds

How well did you know this?

Not at all

Perfectly

If MapReduce works, why use other tools?

MapReduce requires knowledge of programming java and or python. Complex codes need to be written.

1 line of PIG Latin script approx. 20 lines of MapReduce Java code

Faster deployment, built in functions to achieve complex tasks.

How well did you know this?

Not at all

Perfectly

What is Apache PIG?

An open-source high-level data analyses tool built upon MapReduce
It represents large datasets as data flows

Like a pig that eats anything, PIG accepts any type of data - structured, unstructured, and semi-structured data.

How well did you know this?

Not at all

Perfectly

What are the components of Apache PIG? How does it handle scripts?

It is made up of two components – PIG Latin and PIG Runtime envnmt.
PIG Latin - scripting language (similar to SQL)

PIG Latin scripts are internally converted to Map and Reduce
tasks by PIG MapReduce Engine

How well did you know this?

Not at all

Perfectly

Describe the HBase architecture

How well did you know this?

Not at all

Perfectly

How do you connect, and login to sqoop?

Study These Flashcards

– connect: Connection string to the RDBMS
–connect jdbc:mysql://<ip>/<database>
--username: database username
--password: database password
--username ADMIN --password P@\$\$w0rd</database></ip>

How do you set the import table, and target directory in sqoop?

Study These Flashcards

● –table: the RDBMS table to use
○ –table STUDENT_RECORDS
● –target-dir: where the data should be imported into in HDFS
○ –target-dir /DESTINATION_DIR

How do you connect, login, select the table, and export a table from sqoop to sql?

Study These Flashcards

● – connect: Connection string to the RDBMS
○ –connect jdbc:mysql://<ip>/<database>
● --username: database username
● --password: database password
○ --username ADMIN –password P@\$\$w0rd
● --table: the RDBMS table to export to ○ --table STUDENT_RECORDS
● --export-dir: where the data should be exported from in HDFS ○ --export-dir /SOURCE_DIR</database></ip>

Sqoop - How to list databases? Tables?

Study These Flashcards

● list-databases: list all databases in the RDBMS to which sqoop is connected
○ Sqoop list-databases –connect jdbc:mysql://localhost/ – username ADMIN password P@$$w0rd
● list-tables: list all tables in RDBMS database to which sqoop is connected
○ Sqoop list-tables –connect jdbc:mysql://localhost/DATABASE_NAME –username ADMIN password P@$$w0rd

What is an agent’s source in flume?

Study These Flashcards

Incoming stream of data

What is an agent’s channel in flume?

Study These Flashcards

The channel is a transient storage that buffers between source and sink.

What is an agent’s sink in flume?

Study These Flashcards

The sink is what receives data from the channel, and stores in HDFS (Think funnel)

In flume, there are 3 types of data flow. Recite them, and explain what each means.

Study These Flashcards

● Multi-hop Flow: event goes through two or more flume agents before reaching the its destination.
● Fan-out Flow: an event flows from one source to multiple channels.
● Fan-in Flow: an event is transferred from many sources to one channel.

In flume, how do you specify each component in code?

Study These Flashcards

agent.sources = specifies the source
agent.channels = specifies the source
agent.sinks = specifies the target

Given the flume agents source, what command specifies the data file path?

agent.sources.flumesource.type= agent.sources.flumesource.spoolDir = path to data file agent.sources.flumesource.fileHeader = false

In flume, how do you specify a channel's type, and capacity?

agent.channels.memoryChannel.type = agent.channels.memoryChannel.capacity = agent.channels.memoryChannel.transactionCapacity =

In flume, how do you link the hdfs path, as well as the file type?

agent.sinks.flumeHDFS.type = hdfs agent.sinks.flumeHDFS.hdfs.path = output directory agent.sinks.flumeHDFS.hdfs.fileType = DataStream agent.sinks.flumeHDFS.hdfs.rollCount = 0

Recite all the sqoop commands you know

Hadoop Tools Flashcards

(28 cards)