3. SAS and Hadoop Flashcards

1
Q

What is a Cluster of Computers?

A

*A clusterof computers is a grouping of computers connected by a local area network.
*Each computer is referred to as a node in the cluster.
*The nodes work together as one system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a node?

A

a computer in a cluster of computers. The nodes communicate with each other via the network and function as one unit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Hadoop?

A

*an open-source software project supported by Apache
*a framework for distributed processing of large data sets
*designed to run on computer clusters.
* Made up of NameNodes and DataNodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 2 primary components of a Hadoop cluster?

A

A traditional Hadoop cluster consists of the NameNode, perhaps a backup NameNode, and many DataNodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is meant by the Hadoop ecosystem and its 3 foundational components?

A

The ecosystem refers to the software components that make up the Hadoop framework. Each component has a unique function in the Hadoop ecosystem.

The 3 components or modules that serve as a foundation for Hadoop are HDFS, Yarn, and MapReduce.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some key features of Hadoop?

A
  • Open-source.
  • Simple to use, distributed file storage system.
  • Supports highly parallel processing - makes it well-suited for performing analysis on huge volumes of data.
  • Scales up well to handle massive amounts of data – it is easily extensible by adding more storage nodes into the cluster.
  • Designed to work on low-cost hardware, so the cost-entry point is fairly low.
  • Data is replicated across multiple hosts/nodes to make it fault-tolerant.
  • SAS has integration points that make using Hadoop familiar to existing SAS customers - procedures, LIBNAME statements, and Data Integration Studio transforms.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some commercial distributions of Hadoop?

A

Cloudera
IBM BigInsights
Hortonworks
AWS EMR (Elastic MapReduce)
MapR
MSFT Azure HDInsight

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Hadoop Users Experience (HUE)?

A

It is an open-source application for browsing, querying, and visualizing data in Hadoop. Its browser-based interface enables you to perform a variety of tasks in Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is HDFS?

A

One of the 3 core modules, Hadoop Distributed File System is a virtual file system that distributes files across the Hadoop computer cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is YARN?

A

One of the 3 core modules, Yet Another Resource Negotiator is a framework for job scheduling and cluster resource management.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is MapReduce?

A

One of the 3 core modules, it is a YARN-based system for automating parallel processing of distributed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does a NameNode do?

A

The NameNode contains information about where the data is located on each DataNode. It does not hold the physical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a block in HDFS?

A

Blocks are how data is distributed across the Hadoop DataNodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does a DataNode do?

A

DataNodes are the components that contain blocks of data in HDFS. Data is replicated in HDFS in order to support fault tolerance. By default, each file block in HDFS is replicated on three other DataNodes. If any DataNode goes down, those backup copies are available for use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the starting syntax for HDFS commands in Linux?

A

HDFS DFS followed by the command (e.g., HDFS DFS –LS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does HDFS DFS -ls do?

A

hdfs dfs -ls lists the contents of an HDFS directory. When you list the contents of a directory in HDFS, by default, the “home” directory of the current user is listed, such as /user/student.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the HDFS DFS –MKDIR do?

A

creates a directory within HDFS in the HDFS home directory of a user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does hdfs dfs –copyFromLocal “/data/cust.txt” “/user/std” do?

A

copies local, non-distributed data into HDFS. In this example, cust.txt on the NameNode is copied to the /user/std directory on the DataNodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Describe the 3 steps of the MapReduce process

A
  1. Map - makes initial read of the blocks of data in HDFS and completes initial row-level operations including filtering rows or computing new columns within rows
  2. Shuffle and Sort - orders and groups necessary rows together
  3. Reduce - Performs final calculations, including calculating summary statistics within groups and writes the final results to files in HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Pig?

A

Pig is a stepwise, procedural programming method and platform for analysis. Pig programs can be submitted to Hadoop, where they are converted to MapReduce programs so that processing of the data can still occur in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Hive?

A

Hive is a data warehouse framework for files stored in HDFS built to query and manage large data sets stored in Hadoop. An SQL-like language called HiveQL is used to query the data. Most HiveQL queries are compiled into MapReduce programs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Hadoop fs command?

A

Hadoop fs command can be used to interact with HDFS, along with other file systems that Hadoop supports, such as a local file system, WebHDFS, and Amazon S3 FS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does hdfs -put do?

A

hdfs -put copies a local file to an HDFS location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does hdfs -get do?

A

hdfs -get copies an HDFS file to a local location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does hdfs dfs -cat do?

A

hdfs dfs -cat displays an HDFS file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does hdfs dfs -rm do?

A

hdfs dfs -rm deletes an HDFS file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What does hdfs dfs -rm -r do?

A

hdfs dfs -rm -r recursively deletes an HDFS directory, subdirectories, and files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

hdfs dfs -du -h do?

A

hdfs dfs -du -h displays a summary of HDFS directory and file sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What can you do with HiveQL?

A

HiveQL enables you to query and manage HDFS files directly in their native storage format. You can impose structure or table schemas on a variety of HDFS formats using HiveQL table definitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the components of the Hive architecture?

A

Hive Client and the Hive Services interact with the Hadoop cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What 3 application interfaces does the Hive Client support?

A

Hive Client supports Hive Thrift Client, the Hive JDBC Driver, and the Hive ODBC Driver which submit SQL queries to the Hive Server

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the Hive Driver?

A

The Hive Driver is a component of the Hive Services that interact with the Hive Server. It compiles, optimizes, and executes the broken-down Hive query into MapReduce for final execution in the Hadoop cluster against the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the Hive Metastore?

A

Hive Metastore is a separate database, such as Apache Derby, outside Hadoop. It contains Hive table metadata definitions that point to a file in HDFS, such as field names, data types, and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is HiveQL Data Definition Language (DDL)?

A

HiveQL DDL enables you to define data structures, such as databases and tables. Using the HiveQL DDL, you can create databases and schemas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What does the CREATE command in Hive DDL do?

A

Use this command to create a new database. The database and schema keywords can be used interchangeably with the CREATE statement. A Hive managed table is created by default.

CREATE DATABASE IF NOT EXISTS dihdm
COMMENT ‘database used for this course’
WITH DBPROPERTIES (‘creator’=’student’) ;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What does the ALTER command in Hive DDL do?

A

It alters a database’s properties without dropping the table.

ALTER DATABASE dihdm SET OWNER USER STUDENT;
ALTER DATABASE dihdm SET DBPROPERTIES
(‘Modified by’=’student’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What does the USE command in Beeline do?

A

Working interactively in Beeline or any other command-line utility, you can submit the USE command to switch to a different database.

USE dihdm;
USE default;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is Beeline?

A

Beeline is a command-line interface and JDBC client supported by Hive Server.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is Beeswax?

A

Beeswex is a HiveQL editor that accesses Hive Server through HUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is a Hive Managed table?

A

A managed table means that the associated data is “managed” by Hive.
* Storage is predetermined if not specified.
* Source is deleted on load unless the LOCAL keyword is used.
* Dropping a table deletes the HDFS data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is a Hive External table?

A

An external table is a table where you manage the storage location of the HDFS data.
* Use the LOCATION keyword to define the HDFS storage location.
* Dropping the table does not delete the HDFS data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What does the TRUNCATE command in Hive DDL do?

A

The TRUNCATE statement enables you to remove all the rows from a managed table. If you created an external table, you cannot remove all the rows because all data resides outside the Hive metastore. To remove the rows from an external table, you can change the table from external to managed, delete the rows using the TRUNCATE statement, and reset the table back to external.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What does the DROP command in Hive DDL do?

A

The DROP statement removes a Hive table from the Hive metastore. If the data is deleted, such as for a managed table, it is actually moved to a .trash HDFS directory for the current user session. It can be recovered from that location until the .trash directory is emptied. To permanently delete the data, use the PURGE option in the DROP TABLE statement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What file formats are available in Hadoop?

A
  • Text Files (.csv, tsv) - delimited files using commas, tabs, and so on.
  • JSON records - each row is a single piece of information
  • Avro Files - store schema metadata with block compression data and support schema evolution
  • Sequence Files - store data in binary key-value pairs with block compression
  • Columnar files (e.g. RC files, ORC, Parquet) - store data in a columnar file format with significant compression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are SerDes?

A

SerDes, Serializers and Deserializers, tell Hive how to process a record when reading (deserializer) and writing (serializer) data.

When a record is read from HDFS via a deserializer, an InputFormat is used. When a record is written to HDFS via a serializer, an OutputFormat is used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What SerDes are built into Hadoop?

A
  • Textfile (aka LazySimpleSerDe)
  • Avro
  • RCFILE (Record Columnar)
  • ORC (Optimized Record Columnar)
  • PARQUET
  • REGEX
  • SEQUENCEFILE
  • JSON
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Describe the Textfile SerDes

A
  • Textfile (aka LazySimpleSerDe) - the default Hive table SerDe, where data is read and written as plain text files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Describe the Avro SerDes.

A
  • Avro - specifies a JSON schema file that defines the table columns and their data types for data stored in rows. Supports full schema evolution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Describe the RCFILE SerDes.

A
  • RCFILE - stores data in columnar, compressed storage formats known as row groups. Each row group contains a synchronize record, metadata header, and row data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Describe the ORC SerDes.

A
  • ORC (Optimized Record Columnar) - similar to RCFiles. The columnar groups of data are called stripes. Each unique stripe contains index data, row data, and a stripe footer.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Describe the PARQUET SerDes.

A
  • PARQUET - stores data in row groups. A row group can contain multiple columns. Each
    column in a row group has a page that contains column metadata.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Describe the REGEX SerDes.

A
  • REGEX - When reading unstructured data, it might be necessary to use regular expression (or RegEx) code to load the data into columns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Describe the SEQUENCEFILE SerDes

A
  • SEQUENCEFILE - stores data as flat files consisting of data stored in binary key -value
    pairs. It is a basic Hadoop proprietary format. The data is stored in blocks with a record length (key length + value length), key length, key, and the value. The sync marker denotes the end of the header and enables seeking to a random point in the file for efficient processing of large split files.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Describe the JSON SerDes.

A

The JsonSerDe enables you to read the JSON formatted file, census.json, stored as plain text in the census_json HDFS location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are the 3 methods of storing data for better performing Hive queries?

A
  • Partitioning - separates data into manageable chunks based on a value of a column
  • Clustering or Bucketing - Uses buckets and a hash algorithm to distribute
    and find data values across a Hadoop cluster
  • Indexing (not supported in r3.0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is Apache Pig?

A

Apache Pig is a platform or tool for analyzing large data sets in HDFS with a high-level SQL-like language called Pig Latin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is Pig Latin?

A

Pig Latin is a high-level, SQL-like language. It is a data flow language with a pipeline paradigm, meaning that the data cascades from one relational operation to the next in a Pig script. It can handle structured and unstructured data and is able to process big data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is Grunt?

A

Grunt is a command-line interface that enables you to submit Pig Scripts from a Client Node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

In the Pig Architecture, what does the Parser do?

A

The Parser performs several checks, such as syntax and type checking on a submitted Pig script

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

In the Pig Architecture, what does the Optimizer do?

A

Once the script has been parsed, the Optimizer performs logical optimizations of the Pig Latin statements and logical operators for a more efficient, logical plan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

In the Pig Architecture, what does the Compiler do?

A

The Compiler takes the optimized, logical plan and compiles it into a series of MapReduce jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

In the Pig Architecture, what does the Execution Engine do?

A

The Execution Engine submits the MapReduce jobs to Hadoop in sorted order for distributed, parallel execution against the HDFS data stored on the DataNodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Describe the general steps of a Pig Script.

A
  1. Load the data from HDFS to an Alias (e.g., C1)
  2. Filter the data and store results in an alias (e.g., C2)
  3. Select records in C2 for additional processing and store them in an alias (e.g., C3)
  4. Iterate until you get to just the required data adding to aliases
  5. Write the final output back to HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What are the requirements for Pig identifiers?

A

Pig identifiers, such as aliases and column names, have to start with a letter and can be followed by any number of letters, digits, or underscores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

In Pig notation what are the ways to reference columns in a schema?

A

You can use the column name, positional notation, or a mixture of both in a Pig script. Positional notation begins with a dollar sign.

If data is loaded with no defined schema, positional notation using dollar zero ($0) is used to reference the data in a single field. This type of load might be required with certain unstructured data that has no defined delimiters, such as a Twitter feed or server log.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Pig notation is case-sensitive. True or False?

A

True.

Aliases, column names, and functions like COUNT are case-sensitive.

Keywords such as LOAD, USING, AS, GROUP BY, etc. are not case-sensitive.

67
Q

What is the pig -x option?

A

Pig supports several command-line execution modes using the pig -x option. The default execution mode is MapReduce. This mode requires access to the Hadoop cluster and HDFS installation in order to use the Hadoop cluster resources.

68
Q

What is the pig -x local option?

A

Local mode runs all files on a single machine using local host and file system resources.

69
Q

What is the pig -x tez option?

A

Tez mode requires access to the Hadoop cluster where the Tez component, originally created by Hortonworks, is installed, along with HDFS. Tez is built to run on top of YARN and does not require containers to be kept running like MapReduce or Spark. The performance of Tez is 100 times better than MapReduce.

70
Q

What is the pig -x spark option?

A

Spark mode requires access to a Spark or YARN Hadoop cluster and the HDFS installation. Pig Latin commands are translated to Spark transformations and actions and run in memory on the DataNodes. Each command carries out a single data transformation, such as filtering, grouping, or aggregation

71
Q

How are parameters used in Pig?

A

To support running Pig scripts in batch mode, parameters can be passed into the Pig script at run time using several methods. The parameters and values can be specified on the command line or in a parameter file. The value can be set inside the Pig script with %default or %declare. Using %declare, a Pig script that writes to STDOUT can be called as input in a LOAD statement of a different Pig script.

72
Q

What is the advantage of the Grunt shell?

A

To execute a Pig script from the command line, you can use the interactive Grunt shell by entering pig at the UNIX prompt on the Client Node. This enables you to interactively enter and debug Pig Latin commands.

73
Q

How do you run Pig scripts in batch?

A
  1. Create Pig script using favorite editor
  2. Save program with .pig extension
  3. $pig -param_file params.init program.pig
74
Q

What is the PigStorage() function in Pig?

A

The PigStorage function is used to parse the lines of the input files into separate fields. Several options are available with the PigStorage keyword. If the PigStorage keyword is not specified on the LOAD or STORE keywords, the default delimiter is the tab character.

75
Q

What are the -tagFile or -tagPath options in Pig?

A

The -tagFile or -tagPath options prepend the input file or input file path to each row or tuple loaded. Before using these options, the pig.splitCombination property file must be set to FALSE so that Pig does not combine small files and pass the resulting combination to the mapper. This would make it difficult to determine the exact input files.

76
Q

What are the 2 categories of data types in Pig?

A

Simple (e.g., int, long, float, double, chararray, datatime, etc.)

Complex (e.g., tuple, bag, relation (alias), map)

77
Q

What is a tuple in Pig?

A

An order set of fields (e.g., 5,2)

78
Q

What is a Bag in Pig?

A

A collection of Tuples (e.g., {(5,2),(12,9)} )

79
Q

What is a Relation (aka Alias) in Pig?

A

Outer bag with a collection of inner bags (e.g., {{(5,2),(12,9)}, {(5,2),(12,9)}} )

It is similar to a relational database

80
Q

What is a Map in Pig?

A

Set of unique keys with values within a relation (e.g., [id#1001,name#David,state#NC])

81
Q

What can you do with the CAST operator in Pig?

A

The CAST operator enables you to convert data from one data type to another if conversion is supported.

82
Q

What does the FLATTEN operator do in Pig?

A

The FLATTEN operator enables you to “un-nest” or “transpose” tuples or bags. For tuples, the FLATTEN operator substitutes the fields of a tuple in place of the tuple. For bags, FLATTEN creates new tuples.

83
Q

What does the FOREACH operator do in Pig?

A

The FOREACH operator processes each record and can be used to select desired columns and derive new columns for each record.

84
Q

What does the FILTER BY operator do in Pig?

A

The FILTER BY operator enables you to select the rows of data that you want to work with or remove the data that you don’t want based on the expression.

85
Q

What does the STORE keyword do in Pig?

A

The STORE keyword writes the results to HDFS. If the HDFS output directory does not exist, Pig creates it. If the HDFS output directory already exists, Pig fails.

86
Q

What does the DUMP keyword do in Pig?

A

The DUMP keyword prints the output to the console. Both the STORE and DUMP commands initiate MapReduce execution.

87
Q

What does the DISTINCT keyword do in Pig?

A

The DISTINCT keyword removes duplicate tuples (rows of data) from the alias.

88
Q

What does the LIMIT keyword do in Pig?

A

The LIMIT keyword is used to restrict the number of tuples or rows in the output. It is typically used during the development or testing phase of a Pig script. Java is the most supported language.

89
Q

What are the 2 types of functions in Pig?

A

Built-in
User-Defined - these extend the behavior of Pig, but need to be registered so that Pig knows where to find them

90
Q

What are the 6 categories of Pig Functions?

A
  1. Evaluation (AVG, SUM, CONCAT…)
  2. Load/Store (PigStorage, JsonLoader, TextLoader….)
  3. Math (ROUND, ABS, SIN, COS, EXP, RANDOM…)
  4. String (STRARTSWITH, ENDSWITH, REGEX_EXTRACT, REPLACE, TRIM…)
  5. Datetime (CurrentTime, DaysBetween,GetDate, ToDate…)
  6. Tuple, Bag, Map (TOTUPLE, TOBAG, TOMAP)
91
Q

What is the Piggybank library?

A

The Piggybank library is a collection of user-defined functions for Pig. They are “as-is” Pig-user contributions, so if you find a bug or feel a function is missing, you can fix it or write it yourself.

92
Q

What is the Apache DataFu Pig UDF library?

A

The Apache DataFu Pig UDF library is a collection of user-defined functions for working with large-scale data in Pig used for data mining and statistics. It was originally from LinkedIn but is now open source under the Apache Foundation. Here is a short list of functions in the DataFu library.

93
Q

What are the 3 ways to interface with Hadoop using Base SAS?

A
  1. FILENAME statement
  2. PROC HADOOP
  3. PROC SQOOP

These rely on SAS/ACCESS to interact with Hive in Hadoop for SQL-based parallel processing

94
Q

What 3 SAS technologies can use Hadoop for distributed file storage?

A
  1. SAS Viya
  2. SAS LASR
  3. SAS High-Performance Analytics

These solutions leverage a network-connected grid of computing resources to perform parallel distributed processing on distributed data sets

95
Q

What language and SAS technology can execute threaded code within Hadoop?

A

DS2 code executes within Hadoop using the SAS Code Accelerator for Hadoop, which is installed on a grid. DS2 is intended for data manipulation and data management

96
Q

What does the CONCAT option in the FILENAME method do?

A

The CONCAT option reads all of the files in the directory. The files in the directory should all be of the same structure to avoid any issues when processing the data.

97
Q

How can you read a Hadoop file accessed by the FILENAME method?

A

You can use the SAS DATA step with the INFILE statement pointing to the location specified in the FILENAME method.

98
Q

What does the PROC HADOOP method do?

A

PROC HADOOP submits a specific subset of HDFS commands and executes existing MAPREDUCE and PIG programs from Base SAS

99
Q

What are some HDFS commands you can run from PROC HADOOP?

A
  1. COPYTOLOCAL / COPYFROMLOCAL
  2. MKDIR
  3. DELETE - Delete a directory and all its files
  4. RENAME - Rename a file
  5. CAT - displays the contents of an HDFS file or the contents of a directory
  6. CHMOD - changes or sets the permissions on an HDFS location or file using Linux permission syntax in symbolic or octal notation. The RECURSE option sets permissions on all files and subdirectories in an HDFS location.
  7. LS - lists any files in an HDFS location

The “CODE =” option executes a Pig program defined in a file.

100
Q

What are the 2 main options for querying Hive in Hadoop through SAS/ACCESS?

A

SQL pass-through using PROC SQL to submit SQL and DDL statements

LIBNAME statement in a SAS program with SAS DATA step or procedure which will convert SAS code to HiveQL

101
Q

In PROC SQL, describe the attributes of the CONNECT statement.

A

The CONNECT statement has the following attributes:
* It defines the access method or engine, such as Hadoop or Impala.
* It establishes a connection to the Hive server running on the specified port, with the appropriate credentials.
* It provides information specific to the Hive metastore, such as an existing schema.

proc sql;
connect to hadoop
(server=server04.demo.sas.com
port=10000 schema=dihdm user=”student”);

102
Q

In PROC SQL, describe the attributes of the SELECT statement.

A

The PROC SQL SELECT statement is used to query Hive tables and bring results back to SAS.

It contains both a SAS SQL SELECT statement and a HiveQL query. The SELECT statement starts with the SAS SELECT statement, then the FROM CONNECTION TO HADOOP clause, and then the HiveQL query. The HiveQL query is enclosed in parentheses.

proc SQL;
connect to hadoop (<connection>)
select * from connection to hadoop
( select count(*) as count
from customer );</connection>

disconnect from hadoop;
quit;

103
Q

In PROC SQL, describe the EXECUTE statement.

A

The EXECUTE statement allows you to execute Hive DDL statements. The keywords BY HADOOP are used to indicate that the statement should be sent to the Hive server connection established on the CONNECT TO HADOOP statement.

proc SQL;
connect to hadoop (<connection>);
execute (drop table salestaff) by hadoop ;</connection>

disconnect from hadoop;
quit;

104
Q

What are the commands SHOW TABLES and SHOW COLUMNS?

A

The HiveQL commands SHOW TABLES and SHOW COLUMNS allow us to explore a particular database or a particular table. These commands can be issued from various interfaces available for Hive. This includes a Hive Editor available in Hue. It also includes SQL pass-through in SAS.

SHOW TABLES - lists available table in a Hive Schema
SHOW COLUMNS - lists columns in a Hive Table in the specified Schema

105
Q

What is the command DESCRIBE FORMATTED?

A

DESCRIBE FORMATTED Hive statement will give you column names and
data types, detailed table information, and storage information including what is listed here and more.

106
Q

What does the following code do?
libname AHDHIVE hadoop server=server04.demo.sas.com
user=”&std” schema=DIAHD;

proc contents data=AHDHIVE.all ;
run;

A

the SAS/ACCESS libname statement defines the AHDHIVE library and enables you to connect to a Hive database on hadoop

proc contents data=AHDHIVE.all; - accesses Hive metadata with overall and detailed information for all tables in the library displayed in the results tab

107
Q

What is the syntax for a Hive date and a SAS data?

A

Hive Date = ‘YYYY-MM-DD’
SAS Date = ‘DDMONYYYY’d

108
Q

What is the difference between the SAS statements CREATE TABLE and CREATE VIEW?

A

When a CREATE TABLE statement is executed, the results of the SELECT query are stored as data in the table.

When a CREATE VIEW statement is executed, the SELECT query is not executed and no data is stored in the view. Instead, the SELECT query itself is stored in the view.

109
Q

What is a SAS View?

A
  • Can be referenced in a SAS program in the same way as a table
  • Extracts underlying data each time it is used - most current data
  • Permits transparent access to Hive tables
  • Stores Hive connection information
  • Does not contain any data and not physical tables
110
Q

What are advantages of a SAS View?

A

Users are given access to current data in Hadoop but are also prevented from inadvertently writing SAS code to access Hive tables inefficiently and returning too much data from Hadoop back to the SAS server.

Views can also allow organizations to limit access to sensitive
data that can be stored in the Hive tables that business users should not be able to surface.

We could allow analysts who might not be familiar with HiveQL or SQL access to this view.

111
Q

Where does processing occur when combining data from multiple data sources (e.g. Hive, SAS, and Oracle)?

A

Processing occurs in SAS after the data is moved.

112
Q

What does the DATA Step Concatenation do?

A

If the DATA step is used for concatenation of two or more Hive tables, the concatenation will occur in SAS. The SAS/ACCESS engine will generate separate HiveQL SELECT statements for each table in the DATA step SET statement, then the results will be concatenated in SAS.

113
Q

What do the SET operator do?

A

It’s a UNION. With SQL set operators, two queries are written that independently retrieve data. Each query generates a results set, and those results sets are combined vertically. The way the results sets are combined depends on the set operator in use.

114
Q

What does the USING clause do?

A

It is useful anytime the view is stored in a library that is different from the library that the view refers to. The USING clause contains the same text you would supply to a LIBNAME statement to define the library referred to in the FROM clause of the query that is stored. This stores the library definition in the view definition, which is why it is sometimes called an embedded LIBNAME statement.

The library definition that is part of the stored query is defined only temporarily as the view is being executed and is accessible only as long as the view is executing. After the view executes and has returned results to SAS, there will not be a session-level definition for the library defined by the USING clause unless a session-level LIBNAME statement has been submitted independently during that SAS session.

It is not ANSI SQL

115
Q

What are disadvantages of a SAS View?

A

The stored query will execute first and return all results to SAS. Then the WHERE clause will execute on the SAS side (Not in Hive which is preferred). So, you could potentially pull a lot of data.

116
Q

How do you join tables in HiveQL?

A

Use the ON clause instead of WHERE…

from salesstaff st
inner join
sales s
on st.manager_id = s.employee_id

117
Q

What is the following SAS/ACCESS code doing?

libname AHDSAS “d:\workshop\diahd”;
libname AHDMYHIV hadoop server=’server04.demo.sas.com’ port=10000 schema=DIAHDMYHIVE user=”&std”;

proc copy in=AHDSAS out=AHDMYHIV;
select customer_dim;
run;

A

Using SAS libraries, data can be copied to Hive using standard SAS language syntax. In this example, the customer_dim data set is copied from a local SAS library (AHDSAS) to the Hive schema (DIAHDMYHIVE).

SAS implicitly generates a CREATE TABLE statement that executes in Hive. SAS also moves the data from the customer_dim SAS data set into HDFS and places it in the location that the Hive table definition points to.

118
Q

What is the following SAS/ACCESS code doing?

libname AHDSAS “d:\workshop\diahd”;
libname AHDMYHIV hadoop server=’server04.demo.sas.com’ port=10000 schema=DIAHDMYHIVE user=”&std”;

proc sort data=AHDSAS.order_fact
out=work.order_fact;
by customer_id;
run;
data AHDMYHIV.CustOrd;
merge AHDSAS.customer_dim(in=inC)
work.order_fact(in=inO);
by customer_id;
if inC=1 and inO=1;
run;

A

In this example, SAS processed the DATA step merge of the two SAS data sets. To create the output data set, SAS implicitly generates and sends a HiveQL CREATE TABLE statement to Hive and moves the output data to the HDFS location that the Hive table points to.

119
Q

What is SASTRACE?

A

Using SASTRACE to See the Generated HiveQL. his option will display the code passed into a database, in this case into Hadoop. The OPTIONS STATEMENT shown here specifies how and where to display the generated HiveQL.

120
Q

Can all SAS functions be implicitly converted when passed to Hive? True or False.

A

False. While many functions can be implicitly converted to Hive QL, some functions like WEEKDAY cannot.

121
Q

List some SAS language elements that convert to HiveQL.

A
  • PROC SQL
  • A few additional PROCs
  • WHERE
  • Some SAS functions
  • Some data set options
122
Q

List some SAS language elements that cannot convert to HiveQL.

A
  • DATA step
  • Some SAS functions
  • Most PROCs
123
Q

If a command does not convert to HiveQL, will it fail?

A

No. SAS will do further processing on any data returned from Hive in order to accomplish the task. For example, SAS might need to retrieve all the rows of data from a Hive table and proceed from that point to process the data on the SAS server.

124
Q

What do KEEP= and DROP= do?

A

The KEEP= and DROP= data set options allow you to limit the columns that you read from a source table and will allow you to limit the number of columns you return to SAS from a Hive table.

125
Q

What does OBS= do?

A

The OBS= data set option is often useful during testing to limit the total number of rows that you read from an input data set. You can use this while testing programs that read Hive tables to see what type of HiveQL is implicitly generated by the LIBNAME engine without executing the query on an entire Hive table.

126
Q

What does RENAME= do?

A

The RENAME= data set option is used to change the name that is used for a column during processing. This does not affect performance, but this option is still supported when reading Hive tables. Renaming occurs on the output data set.

127
Q

What affect do Reducers have on performance?

A

To take advantage of the parallel processing capabilities of Hadoop and improve query performance, you might need to increase the number of reducers used by the MapReduce tasks generated by the HiveQL processing.

By default, MapReduce uses a single reducer per node. Reducers are used, for example, for group processing, and you can increase performance by increasing the number of reducers to enable different groups to be processed in parallel.

Reducers are also used to write the output of MapReduce processes to HDFS.

The number of reducers can be specified by adding a PROPERTIES= option to the Hadoop LIBNAME statement and setting the value for the Hadoop property mapred.reduce.tasks.

128
Q

What procedures can be used to sort data in Hadoop?

A

both PROC SQL and PROC SORT will generate HiveQL to order data in Hive. Sorting data can be resource intensive, and you should sort the data only when it is needed.

129
Q

What happens when using a “by” statement in a DATA or PROC step for sorting?

A

SAS/ACCESS LIBNAME engine automatically generates an ORDER BY clause for those variables. Do not pre-sort Hive tables with PROC SORT or PROC SQL for the purpose of BY-statement processing. It is more efficient to let the LIBNAME engine generate the HiveQL to retrieve the data directly in the order required.

130
Q

What is DS2?

A

It is a Base SAS programming language. It provides some advanced data manipulation techniques that make programming for reusable code easier.

Has the ability to deal with many data types besides the traditional fixed width character and floating point numeric that were available to us in a Base SAS DATA step

You can create user-defined methods, store them in packages, and share them easily with yourself in the future or with the rest of the people in your organization.

DS2 process can process multiple observations at once or multiple compute threads, if you will, even if you’re running on a single computer. You take advantage of all that lovely parallel processing power by multithreading fully distributed across the nodes of those systems.

131
Q

When should you use DS2?

A

If your program does complex computations or is CPU bound, your process will probably benefit from multithreading on the Base SAS platform, even if you’re running on a single machine.

If your data is coming from a table on a database with the SAS In-Database Accelerator installed, you can take the code right to where the data lives rather than having to move the data to the code.

If you’re reading and writing to caslibs in the SAS Viya Cloud Analytic Server, then you’re definitely going to want to use DS2 because DS2 runs natively in parallel across all of those massively parallel nodes.

Or you might need DS2 just for the precision that the data types offer.

Use DS2 for the reusable code modules stored in packages

132
Q

What are the 3 types of programs for DS2?

A

Package
Thread
Data

133
Q

How is a DS2 Data program structured?

A

Begins with a DATA Statement
Ends With and ENDDATA Statement
Requires a Run Statement

134
Q

What is a method?

A

A named executable block of code. In DS2 all executable statements are only valid within the context of a method.

135
Q

What are the 3 system methods that execute automatically when a DS2 Data program is compiled?

A

INIT() - required to kick things off and contains your executable statements that will execute once and only once immediately upon beginning execution.
RUN() - repeats execution once for every row of data read in through the SET statement. Only method that has an implicit output at the end.
TERM() - excutes its code once and only once immediately before the DS2 DATA program stops execution.

136
Q

Describe a User Defined Method.

A

Methods that a user can write themselves and which can accept arguments and return a value. They only execute when called

137
Q

What is a declarative statement in DS2?

A

DCL - Either Global (outside a method) or Local (inside a method)
Required to come before any executable statement

138
Q

How do you overwrite output data in a DS2 program?

A

Use the /overwrite=yes option

139
Q

How do you control DS2’s reaction to undeclared variables?

A

Modify this behavior using the system option ds2scond, or for a particular PROC DS2 step using the scond equal option for the PROC DS2 statement.

Using a value of NONE causes DS2 to take absolutely no action when undeclared variables are encountered. Nothing will be written in the log.

140
Q

How do double quotes and single quotes differ in DS2 programs?

A

Double quotes identify the name of something. Single quotes are used to delimit character literal values e.g. ‘Text Value’

141
Q

What version of SQL is integrated with DS2?

A

FedSQL - is a newer version of SQL based on the SQL3 standard. FedSQL has improved ANSI syntax compliance, which means that it is more likely to pass through larger amounts of processing into the database when used in conjunction with a relational database.

In addition, FedSQL has native support for 17 different ANSI data types, unlike PROC SQL, which can only handle fixed width character and double precision floating point. And finally, FedSQL was designed from the ground up for threaded, high-performance processing, making it extremely scalable and fast in operation.

142
Q

What does the FORWARD statement do?

A

The FORWARD statement basically tells the compiler that in the code that follows, a method call is going to be seen before the method definition is produced. This is not uncommon in packages, where we’re packaging up a whole mess of custom methods for future reuse. And frequently, the definition of one method will include a call to another method which has not yet been defined in the code.

It is only valid in the global portion of a program block

143
Q

What Character Data Types can DS2 handle?

A

Fixed length character and Fixed length unicode character
Variable length character and Variable length unicode character

144
Q

What numeric data types can DS2 handle?

A

Double, Float, Real, decimal, integer

Precision can be an issue - extremely long number strings and DOUBLE precision values in DS2 are presented as at best an approximation of the number that was stored.

Decimal numbers handles precision

You can add an n-suffix to tell DS2 to compile large constant as a decimal

Scale assigns the number of digits for a fraction. e.g. decimal(5,3)

145
Q

What are the key words for data and time constants in DS2?

A

Date, Time, Timestamp

Mydate= date’2013-09-07’

146
Q

What is a coercible data type in DS2?

A

Coercible data types will automatically convert to another data type in order to resolve an expression. There is a hierarchy of data types to determine which take precedence.

147
Q

How do SAS and DS2 treat nulls?

A

DS2 treats nulls as true nulls. SAS does not have the concept of null. It converts nulls to missing, which can be represented in various ways like a period or a space.

148
Q

What are the 2 modes for DS2 processing?

A

SAS and ANSI. These modes handle nulls and missing data differently

149
Q

Are ANSI Date and Time typse coercible?

A

No. They are not. So, when DS2 reads SAS dates they are converted to character data types. DS2 will need to convert these character dates to ANSI for use.

150
Q

What is a limitation of VARARRY?

A

They are global variables that cannot be declared within a method.

151
Q

Can DS2 methods define a variable number of parameters?

A

No. DS2 methods definitions must specify a fixed number of arguments.

152
Q

What does it mean to Overload a Method?

A

Overloading is simply defining more than one method with the same name, but with distinguishing characteristics that allow the system to pick the correct one when called. These distinguishing characteristics are called a method’s signature. The signature consists of the name of that method followed by an ordered list of the parameter data classes. The method is said to be overloaded when there are two or more definitions with the same name, but with clearly distinguishable signatures.

153
Q

What are DS2 Packages?

A

DS2 packages provide a convenient way to aggregate several user-defined methods and store them offline in a way that makes the code easily reusable. DS2 packages are stored in SAS libraries and so are available to most any SAS session that has access to the library where the package is stored.

154
Q

What is the SQLSTMT Package?

A

The SQLSTMT package has several built-in methods that allow us to execute a FedSQL statement stored as part of the package information, and dynamically replace placeholders in that SQL statement with values from Program Data Vector variables

It has compile-time instantiation

Can be parameterized

155
Q

What is execution time instantiation?

A

You can declare a package identifier, but do not provide instantiation parameters.
Delay instantiation to during program execution with the NEW operator

156
Q

How can you submit programs that use parallel processing?

A

Use a DS2 Threaded Program. They are DS2 programs whose result is a stored program in a SAS library.

Threads are executed from a small and simple DS2 data program, and a thread cannot cause another thread to execute.

When we’re threading on the Base SAS system, it is necessary for us to declare the number of threads we wish to run. We do this with a THREADS= option on the SET FROM statement. A failure to declare the number of threads will result in the process running single threaded.

157
Q

What is the SAS Performance Scalable Data Engine, or SPDE?

A

You can use SPDE to store SAS files in the file system by breaking data files up into chunks, read from the disk, and written back to the disk in parallel.

158
Q

What is the AS In-Database Code Accelerator?

A

The SAS In-Database Code Accelerator adds additional capabilities to the SAS/ACCESS interface installed on your SAS system. In order for the SAS In-Database Code Accelerator to work, you must first have licensed the traditional SAS/ACCESS interface.

With a properly installed and configured SAS embedded process, the DS2 programs are sent in-database as code, compiled and executed natively on the database hardware. This execution leverages the native massively parallel I/O and compute processing to dramatically improve the speed of your process.

159
Q

What Is SAS Data Loader for Hadoop?

A

SAS Data Loader for Hadoop is designed to provide business users access to data in the Hadoop cluster without requiring them to be programmers.

160
Q

What are the 4 key capabilities for SAS Data Loader for Hadoop?

A
  1. simple, self-service interface to support data preparation.
  2. designed and built to run most of the SAS data manipulation and cleansing functions in parallel on Hadoop to leverage the power of distributed processing.
  3. provides an easy-to-use interface that performs multiple data management and data quality tasks on data in Hadoop without moving the data from the Hadoop data node. Also, users do not need to write specialized ETL code, such as HiveQL, or Pig Latin.
  4. loading data from Hadoop directly into the SAS LASR Analytic Server, or CAS Server, to support Visual Analytics, Statistics, and other applications.
161
Q

What are Directives?

A

Primary functions of SAS Data Loader such as Cluster and Survive, Run a SAS Program, Cleanse Data, Load Data to Hadoop, and Load Data from Hadoop.

162
Q

What are Tasks in Directives?

A

Each directive has a series of tasks to work through to provide information that pertains to the data management activity that you want to process and run.

163
Q

What are transformations within Directives?

A

aid in various tasks that are associated with managing and manipulating data in Hadoop. Many transformations are available within SAS Data Loader directives. These transformations provide you with the ability to perform one or more data transformation steps within the job or directive.

164
Q

What are the 5 steps for Data Preparation Methodology?

A
  1. Acquire and Discover Data
  2. Transform Data
  3. Cleanse Data
  4. Integrate Data
  5. Deliver Data