Hadoop Tools Flashcards
Mapreduce 4 processes
Mapping
Shuffling
Sorting
Reducing
What is sqoop?
Sql to hadOOP
Import / export data to and from sql into hadoop
What is Flume?
Flume is used to collect and store unstructured or semistructured data into Hadoop.
This data could be videos, pictures, emails, or websites.
Keeps a balance between data read and write speeds
Describe an Agent (Flume)
3 components:
Source - accepts incoming data and forwards to channel
Channel - Receives data and buffers them until the sink consumes them, avoiding data loss
Sink - Receives data from the channel and stores into HDFS
What is HBase?
HBase is a column oriented, graph based database built for the Hadoop file system. It is designed to provide quick random access to huge amounts of data. Users can store data in HDFS directly or through HBase.
Describe how HBase stores data differently from row storage
It uses columns instead of rows
What are the 4 components of HBase?
Master Server, Region Server, Regions, & Zookeeper
What is Hive?
It is a data warehouse framework for querying and analyzing data stored in HDFS.
An open-source tool for data analysis
Also created to address the stressful programming associated with MapReduce
Unlike PIG, Hive is for processing structured* data only
Can process Terabytes of data in seconds vs RDBMS
In HBase, what is the Master Server?
It assigns regions to region servers with the help of Apache ZooKeeper. It handles load balancing across those servers. It is responsible for schema changes.
In HBase, what is Zookeeper?
Receives keep-alive heartbeats from HRegion and HMaster
servers. Keeps track of meta data of server
In HBase, describe the Region Server and Regions.
These region servers have regions that communicate with the client and handle data related operations, like reading and writing requests for all regions under it. The region server decides the size of the region following the region size thresholds
If MapReduce works, why use other tools?
MapReduce requires knowledge of programming java and or python. Complex codes need to be written.
1 line of PIG Latin script approx. 20 lines of MapReduce Java code
Faster deployment, built in functions to achieve complex tasks.
What is Apache PIG?
An open-source high-level data analyses tool built upon MapReduce
It represents large datasets as data flows
Like a pig that eats anything, PIG accepts any type of data - structured, unstructured, and semi-structured data.
What are the components of Apache PIG? How does it handle scripts?
It is made up of two components – PIG Latin and PIG Runtime envnmt.
PIG Latin - scripting language (similar to SQL)
PIG Latin scripts are internally converted to Map and Reduce
tasks by PIG MapReduce Engine
Describe the HBase architecture
How do you connect, and login to sqoop?
– connect: Connection string to the RDBMS
–connect jdbc:mysql://<ip>/<database>
--username: database username
--password: database password
--username ADMIN --password P@\$\$w0rd</database></ip>
How do you set the import table, and target directory in sqoop?
● –table: the RDBMS table to use
○ –table STUDENT_RECORDS
● –target-dir: where the data should be imported into in HDFS
○ –target-dir /DESTINATION_DIR
How do you connect, login, select the table, and export a table from sqoop to sql?
● – connect: Connection string to the RDBMS
○ –connect jdbc:mysql://<ip>/<database>
● --username: database username
● --password: database password
○ --username ADMIN –password P@\$\$w0rd
● --table: the RDBMS table to export to ○ --table STUDENT_RECORDS
● --export-dir: where the data should be exported from in HDFS ○ --export-dir /SOURCE_DIR</database></ip>
Sqoop - How to list databases? Tables?
● list-databases: list all databases in the RDBMS to which sqoop is connected
○ Sqoop list-databases –connect jdbc:mysql://localhost/ – username ADMIN password P@$$w0rd
● list-tables: list all tables in RDBMS database to which sqoop is connected
○ Sqoop list-tables –connect jdbc:mysql://localhost/DATABASE_NAME –username ADMIN password P@$$w0rd
What is an agent’s source in flume?
Incoming stream of data
What is an agent’s channel in flume?
The channel is a transient storage that buffers between source and sink.
What is an agent’s sink in flume?
The sink is what receives data from the channel, and stores in HDFS (Think funnel)
In flume, there are 3 types of data flow. Recite them, and explain what each means.
● Multi-hop Flow: event goes through two or more flume agents before reaching the its destination.
● Fan-out Flow: an event flows from one source to multiple channels.
● Fan-in Flow: an event is transferred from many sources to one channel.
In flume, how do you specify each component in code?
agent.sources = specifies the source
agent.channels = specifies the source
agent.sinks = specifies the target
Given the flume agents source, what command specifies the data file path?
agent.sources.flumesource.type= agent.sources.flumesource.spoolDir = path to data file agent.sources.flumesource.fileHeader = false
In flume, how do you specify a channel’s type, and capacity?
agent.channels.memoryChannel.type = agent.channels.memoryChannel.capacity = agent.channels.memoryChannel.transactionCapacity =
In flume, how do you link the hdfs path, as well as the file type?
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = output directory agent.sinks.flumeHDFS.hdfs.fileType = DataStream agent.sinks.flumeHDFS.hdfs.rollCount = 0
Recite all the sqoop commands you know