Hadoop Tools Flashcards
Mapreduce 4 processes
Mapping
Shuffling
Sorting
Reducing
What is sqoop?
Sql to hadOOP
Import / export data to and from sql into hadoop
What is Flume?
Flume is used to collect and store unstructured or semistructured data into Hadoop.
This data could be videos, pictures, emails, or websites.
Keeps a balance between data read and write speeds
Describe an Agent (Flume)
3 components:
Source - accepts incoming data and forwards to channel
Channel - Receives data and buffers them until the sink consumes them, avoiding data loss
Sink - Receives data from the channel and stores into HDFS
What is HBase?
HBase is a column oriented, graph based database built for the Hadoop file system. It is designed to provide quick random access to huge amounts of data. Users can store data in HDFS directly or through HBase.
Describe how HBase stores data differently from row storage
It uses columns instead of rows
What are the 4 components of HBase?
Master Server, Region Server, Regions, & Zookeeper
What is Hive?
It is a data warehouse framework for querying and analyzing data stored in HDFS.
An open-source tool for data analysis
Also created to address the stressful programming associated with MapReduce
Unlike PIG, Hive is for processing structured* data only
Can process Terabytes of data in seconds vs RDBMS
In HBase, what is the Master Server?
It assigns regions to region servers with the help of Apache ZooKeeper. It handles load balancing across those servers. It is responsible for schema changes.
In HBase, what is Zookeeper?
Receives keep-alive heartbeats from HRegion and HMaster
servers. Keeps track of meta data of server
In HBase, describe the Region Server and Regions.
These region servers have regions that communicate with the client and handle data related operations, like reading and writing requests for all regions under it. The region server decides the size of the region following the region size thresholds
If MapReduce works, why use other tools?
MapReduce requires knowledge of programming java and or python. Complex codes need to be written.
1 line of PIG Latin script approx. 20 lines of MapReduce Java code
Faster deployment, built in functions to achieve complex tasks.
What is Apache PIG?
An open-source high-level data analyses tool built upon MapReduce
It represents large datasets as data flows
Like a pig that eats anything, PIG accepts any type of data - structured, unstructured, and semi-structured data.
What are the components of Apache PIG? How does it handle scripts?
It is made up of two components – PIG Latin and PIG Runtime envnmt.
PIG Latin - scripting language (similar to SQL)
PIG Latin scripts are internally converted to Map and Reduce
tasks by PIG MapReduce Engine
Describe the HBase architecture