Hadoop Tools Flashcards

1
Q

Mapreduce 4 processes

A

Mapping
Shuffling
Sorting
Reducing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is sqoop?

A

Sql to hadOOP
Import / export data to and from sql into hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Flume?

A

Flume is used to collect and store unstructured or semistructured data into Hadoop.

This data could be videos, pictures, emails, or websites.

Keeps a balance between data read and write speeds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe an Agent (Flume)

A

3 components:
Source - accepts incoming data and forwards to channel
Channel - Receives data and buffers them until the sink consumes them, avoiding data loss
Sink - Receives data from the channel and stores into HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is HBase?

A

HBase is a column oriented, graph based database built for the Hadoop file system. It is designed to provide quick random access to huge amounts of data. Users can store data in HDFS directly or through HBase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe how HBase stores data differently from row storage

A

It uses columns instead of rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 4 components of HBase?

A

Master Server, Region Server, Regions, & Zookeeper

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Hive?

A

It is a data warehouse framework for querying and analyzing data stored in HDFS.
An open-source tool for data analysis

Also created to address the stressful programming associated with MapReduce
Unlike PIG, Hive is for processing structured* data only
Can process Terabytes of data in seconds vs RDBMS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In HBase, what is the Master Server?

A

It assigns regions to region servers with the help of Apache ZooKeeper. It handles load balancing across those servers. It is responsible for schema changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In HBase, what is Zookeeper?

A

Receives keep-alive heartbeats from HRegion and HMaster
servers. Keeps track of meta data of server

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In HBase, describe the Region Server and Regions.

A

These region servers have regions that communicate with the client and handle data related operations, like reading and writing requests for all regions under it. The region server decides the size of the region following the region size thresholds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

If MapReduce works, why use other tools?

A

MapReduce requires knowledge of programming java and or python. Complex codes need to be written.

1 line of PIG Latin script approx. 20 lines of MapReduce Java code

Faster deployment, built in functions to achieve complex tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Apache PIG?

A

An open-source high-level data analyses tool built upon MapReduce
It represents large datasets as data flows

Like a pig that eats anything, PIG accepts any type of data - structured, unstructured, and semi-structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the components of Apache PIG? How does it handle scripts?

A

It is made up of two components – PIG Latin and PIG Runtime envnmt.
PIG Latin - scripting language (similar to SQL)

PIG Latin scripts are internally converted to Map and Reduce
tasks by PIG MapReduce Engine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the HBase architecture

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you connect, and login to sqoop?

A

– connect: Connection string to the RDBMS
–connect jdbc:mysql://<ip>/<database>
--username: database username
--password: database password
--username ADMIN --password P@\$\$w0rd</database></ip>

17
Q

How do you set the import table, and target directory in sqoop?

A

● –table: the RDBMS table to use
○ –table STUDENT_RECORDS
● –target-dir: where the data should be imported into in HDFS
○ –target-dir /DESTINATION_DIR

18
Q

How do you connect, login, select the table, and export a table from sqoop to sql?

A

● – connect: Connection string to the RDBMS
○ –connect jdbc:mysql://<ip>/<database>
● --username: database username
● --password: database password
○ --username ADMIN –password P@\$\$w0rd
● --table: the RDBMS table to export to ○ --table STUDENT_RECORDS
● --export-dir: where the data should be exported from in HDFS ○ --export-dir /SOURCE_DIR</database></ip>

19
Q

Sqoop - How to list databases? Tables?

A

● list-databases: list all databases in the RDBMS to which sqoop is connected
○ Sqoop list-databases –connect jdbc:mysql://localhost/ – username ADMIN password P@$$w0rd
● list-tables: list all tables in RDBMS database to which sqoop is connected
○ Sqoop list-tables –connect jdbc:mysql://localhost/DATABASE_NAME –username ADMIN password P@$$w0rd

20
Q

What is an agent’s source in flume?

A

Incoming stream of data

21
Q

What is an agent’s channel in flume?

A

The channel is a transient storage that buffers between source and sink.

22
Q

What is an agent’s sink in flume?

A

The sink is what receives data from the channel, and stores in HDFS (Think funnel)

23
Q

In flume, there are 3 types of data flow. Recite them, and explain what each means.

A

● Multi-hop Flow: event goes through two or more flume agents before reaching the its destination.
● Fan-out Flow: an event flows from one source to multiple channels.
● Fan-in Flow: an event is transferred from many sources to one channel.

24
Q

In flume, how do you specify each component in code?

A

agent.sources = specifies the source
agent.channels = specifies the source
agent.sinks = specifies the target

25
Q

Given the flume agents source, what command specifies the data file path?

A

agent.sources.flumesource.type= agent.sources.flumesource.spoolDir = path to data file agent.sources.flumesource.fileHeader = false

26
Q

In flume, how do you specify a channel’s type, and capacity?

A

agent.channels.memoryChannel.type = agent.channels.memoryChannel.capacity = agent.channels.memoryChannel.transactionCapacity =

27
Q

In flume, how do you link the hdfs path, as well as the file type?

A

agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = output directory agent.sinks.flumeHDFS.hdfs.fileType = DataStream agent.sinks.flumeHDFS.hdfs.rollCount = 0

28
Q

Recite all the sqoop commands you know

A