3. SAS and Hadoop Flashcards
What is a Cluster of Computers?
*A clusterof computers is a grouping of computers connected by a local area network.
*Each computer is referred to as a node in the cluster.
*The nodes work together as one system.
What is a node?
a computer in a cluster of computers. The nodes communicate with each other via the network and function as one unit.
What is Hadoop?
*an open-source software project supported by Apache
*a framework for distributed processing of large data sets
*designed to run on computer clusters.
* Made up of NameNodes and DataNodes
What are the 2 primary components of a Hadoop cluster?
A traditional Hadoop cluster consists of the NameNode, perhaps a backup NameNode, and many DataNodes.
What is meant by the Hadoop ecosystem and its 3 foundational components?
The ecosystem refers to the software components that make up the Hadoop framework. Each component has a unique function in the Hadoop ecosystem.
The 3 components or modules that serve as a foundation for Hadoop are HDFS, Yarn, and MapReduce.
What are some key features of Hadoop?
- Open-source.
- Simple to use, distributed file storage system.
- Supports highly parallel processing - makes it well-suited for performing analysis on huge volumes of data.
- Scales up well to handle massive amounts of data – it is easily extensible by adding more storage nodes into the cluster.
- Designed to work on low-cost hardware, so the cost-entry point is fairly low.
- Data is replicated across multiple hosts/nodes to make it fault-tolerant.
- SAS has integration points that make using Hadoop familiar to existing SAS customers - procedures, LIBNAME statements, and Data Integration Studio transforms.
What are some commercial distributions of Hadoop?
Cloudera
IBM BigInsights
Hortonworks
AWS EMR (Elastic MapReduce)
MapR
MSFT Azure HDInsight
What is the Hadoop Users Experience (HUE)?
It is an open-source application for browsing, querying, and visualizing data in Hadoop. Its browser-based interface enables you to perform a variety of tasks in Hadoop.
What is HDFS?
One of the 3 core modules, Hadoop Distributed File System is a virtual file system that distributes files across the Hadoop computer cluster
What is YARN?
One of the 3 core modules, Yet Another Resource Negotiator is a framework for job scheduling and cluster resource management.
What is MapReduce?
One of the 3 core modules, it is a YARN-based system for automating parallel processing of distributed data
What does a NameNode do?
The NameNode contains information about where the data is located on each DataNode. It does not hold the physical data.
What is a block in HDFS?
Blocks are how data is distributed across the Hadoop DataNodes
What does a DataNode do?
DataNodes are the components that contain blocks of data in HDFS. Data is replicated in HDFS in order to support fault tolerance. By default, each file block in HDFS is replicated on three other DataNodes. If any DataNode goes down, those backup copies are available for use.
What is the starting syntax for HDFS commands in Linux?
HDFS DFS followed by the command (e.g., HDFS DFS –LS)
What does HDFS DFS -ls do?
hdfs dfs -ls lists the contents of an HDFS directory. When you list the contents of a directory in HDFS, by default, the “home” directory of the current user is listed, such as /user/student.
What does the HDFS DFS –MKDIR do?
creates a directory within HDFS in the HDFS home directory of a user
What does hdfs dfs –copyFromLocal “/data/cust.txt” “/user/std” do?
copies local, non-distributed data into HDFS. In this example, cust.txt on the NameNode is copied to the /user/std directory on the DataNodes
Describe the 3 steps of the MapReduce process
- Map - makes initial read of the blocks of data in HDFS and completes initial row-level operations including filtering rows or computing new columns within rows
- Shuffle and Sort - orders and groups necessary rows together
- Reduce - Performs final calculations, including calculating summary statistics within groups and writes the final results to files in HDFS
What is Pig?
Pig is a stepwise, procedural programming method and platform for analysis. Pig programs can be submitted to Hadoop, where they are converted to MapReduce programs so that processing of the data can still occur in parallel.
What is Hive?
Hive is a data warehouse framework for files stored in HDFS built to query and manage large data sets stored in Hadoop. An SQL-like language called HiveQL is used to query the data. Most HiveQL queries are compiled into MapReduce programs.
What is Hadoop fs command?
Hadoop fs command can be used to interact with HDFS, along with other file systems that Hadoop supports, such as a local file system, WebHDFS, and Amazon S3 FS.
What does hdfs -put do?
hdfs -put copies a local file to an HDFS location
What does hdfs -get do?
hdfs -get copies an HDFS file to a local location.
What does hdfs dfs -cat do?
hdfs dfs -cat displays an HDFS file.
What does hdfs dfs -rm do?
hdfs dfs -rm deletes an HDFS file.
What does hdfs dfs -rm -r do?
hdfs dfs -rm -r recursively deletes an HDFS directory, subdirectories, and files.
hdfs dfs -du -h do?
hdfs dfs -du -h displays a summary of HDFS directory and file sizes.
What can you do with HiveQL?
HiveQL enables you to query and manage HDFS files directly in their native storage format. You can impose structure or table schemas on a variety of HDFS formats using HiveQL table definitions
What are the components of the Hive architecture?
Hive Client and the Hive Services interact with the Hadoop cluster
What 3 application interfaces does the Hive Client support?
Hive Client supports Hive Thrift Client, the Hive JDBC Driver, and the Hive ODBC Driver which submit SQL queries to the Hive Server
What is the Hive Driver?
The Hive Driver is a component of the Hive Services that interact with the Hive Server. It compiles, optimizes, and executes the broken-down Hive query into MapReduce for final execution in the Hadoop cluster against the data.
What is the Hive Metastore?
Hive Metastore is a separate database, such as Apache Derby, outside Hadoop. It contains Hive table metadata definitions that point to a file in HDFS, such as field names, data types, and so on.
What is HiveQL Data Definition Language (DDL)?
HiveQL DDL enables you to define data structures, such as databases and tables. Using the HiveQL DDL, you can create databases and schemas.
What does the CREATE command in Hive DDL do?
Use this command to create a new database. The database and schema keywords can be used interchangeably with the CREATE statement. A Hive managed table is created by default.
CREATE DATABASE IF NOT EXISTS dihdm
COMMENT ‘database used for this course’
WITH DBPROPERTIES (‘creator’=’student’) ;
What does the ALTER command in Hive DDL do?
It alters a database’s properties without dropping the table.
ALTER DATABASE dihdm SET OWNER USER STUDENT;
ALTER DATABASE dihdm SET DBPROPERTIES
(‘Modified by’=’student’);
What does the USE command in Beeline do?
Working interactively in Beeline or any other command-line utility, you can submit the USE command to switch to a different database.
USE dihdm;
USE default;
What is Beeline?
Beeline is a command-line interface and JDBC client supported by Hive Server.
What is Beeswax?
Beeswex is a HiveQL editor that accesses Hive Server through HUE
What is a Hive Managed table?
A managed table means that the associated data is “managed” by Hive.
* Storage is predetermined if not specified.
* Source is deleted on load unless the LOCAL keyword is used.
* Dropping a table deletes the HDFS data.
What is a Hive External table?
An external table is a table where you manage the storage location of the HDFS data.
* Use the LOCATION keyword to define the HDFS storage location.
* Dropping the table does not delete the HDFS data.
What does the TRUNCATE command in Hive DDL do?
The TRUNCATE statement enables you to remove all the rows from a managed table. If you created an external table, you cannot remove all the rows because all data resides outside the Hive metastore. To remove the rows from an external table, you can change the table from external to managed, delete the rows using the TRUNCATE statement, and reset the table back to external.
What does the DROP command in Hive DDL do?
The DROP statement removes a Hive table from the Hive metastore. If the data is deleted, such as for a managed table, it is actually moved to a .trash HDFS directory for the current user session. It can be recovered from that location until the .trash directory is emptied. To permanently delete the data, use the PURGE option in the DROP TABLE statement.
What file formats are available in Hadoop?
- Text Files (.csv, tsv) - delimited files using commas, tabs, and so on.
- JSON records - each row is a single piece of information
- Avro Files - store schema metadata with block compression data and support schema evolution
- Sequence Files - store data in binary key-value pairs with block compression
- Columnar files (e.g. RC files, ORC, Parquet) - store data in a columnar file format with significant compression
What are SerDes?
SerDes, Serializers and Deserializers, tell Hive how to process a record when reading (deserializer) and writing (serializer) data.
When a record is read from HDFS via a deserializer, an InputFormat is used. When a record is written to HDFS via a serializer, an OutputFormat is used.
What SerDes are built into Hadoop?
- Textfile (aka LazySimpleSerDe)
- Avro
- RCFILE (Record Columnar)
- ORC (Optimized Record Columnar)
- PARQUET
- REGEX
- SEQUENCEFILE
- JSON
Describe the Textfile SerDes
- Textfile (aka LazySimpleSerDe) - the default Hive table SerDe, where data is read and written as plain text files
Describe the Avro SerDes.
- Avro - specifies a JSON schema file that defines the table columns and their data types for data stored in rows. Supports full schema evolution.
Describe the RCFILE SerDes.
- RCFILE - stores data in columnar, compressed storage formats known as row groups. Each row group contains a synchronize record, metadata header, and row data.
Describe the ORC SerDes.
- ORC (Optimized Record Columnar) - similar to RCFiles. The columnar groups of data are called stripes. Each unique stripe contains index data, row data, and a stripe footer.
Describe the PARQUET SerDes.
- PARQUET - stores data in row groups. A row group can contain multiple columns. Each
column in a row group has a page that contains column metadata.
Describe the REGEX SerDes.
- REGEX - When reading unstructured data, it might be necessary to use regular expression (or RegEx) code to load the data into columns
Describe the SEQUENCEFILE SerDes
- SEQUENCEFILE - stores data as flat files consisting of data stored in binary key -value
pairs. It is a basic Hadoop proprietary format. The data is stored in blocks with a record length (key length + value length), key length, key, and the value. The sync marker denotes the end of the header and enables seeking to a random point in the file for efficient processing of large split files.
Describe the JSON SerDes.
The JsonSerDe enables you to read the JSON formatted file, census.json, stored as plain text in the census_json HDFS location.
What are the 3 methods of storing data for better performing Hive queries?
- Partitioning - separates data into manageable chunks based on a value of a column
- Clustering or Bucketing - Uses buckets and a hash algorithm to distribute
and find data values across a Hadoop cluster - Indexing (not supported in r3.0
What is Apache Pig?
Apache Pig is a platform or tool for analyzing large data sets in HDFS with a high-level SQL-like language called Pig Latin.
What is Pig Latin?
Pig Latin is a high-level, SQL-like language. It is a data flow language with a pipeline paradigm, meaning that the data cascades from one relational operation to the next in a Pig script. It can handle structured and unstructured data and is able to process big data.
What is Grunt?
Grunt is a command-line interface that enables you to submit Pig Scripts from a Client Node.
In the Pig Architecture, what does the Parser do?
The Parser performs several checks, such as syntax and type checking on a submitted Pig script
In the Pig Architecture, what does the Optimizer do?
Once the script has been parsed, the Optimizer performs logical optimizations of the Pig Latin statements and logical operators for a more efficient, logical plan.
In the Pig Architecture, what does the Compiler do?
The Compiler takes the optimized, logical plan and compiles it into a series of MapReduce jobs.
In the Pig Architecture, what does the Execution Engine do?
The Execution Engine submits the MapReduce jobs to Hadoop in sorted order for distributed, parallel execution against the HDFS data stored on the DataNodes.
Describe the general steps of a Pig Script.
- Load the data from HDFS to an Alias (e.g., C1)
- Filter the data and store results in an alias (e.g., C2)
- Select records in C2 for additional processing and store them in an alias (e.g., C3)
- Iterate until you get to just the required data adding to aliases
- Write the final output back to HDFS
What are the requirements for Pig identifiers?
Pig identifiers, such as aliases and column names, have to start with a letter and can be followed by any number of letters, digits, or underscores.
In Pig notation what are the ways to reference columns in a schema?
You can use the column name, positional notation, or a mixture of both in a Pig script. Positional notation begins with a dollar sign.
If data is loaded with no defined schema, positional notation using dollar zero ($0) is used to reference the data in a single field. This type of load might be required with certain unstructured data that has no defined delimiters, such as a Twitter feed or server log.