3. SAS and Hadoop Flashcards

Question

What does hdfs dfs -cat do?

Answer 1

hdfs dfs -cat displays an HDFS file.

Answer 2

hdfs dfs -rm deletes an HDFS file.

Answer 3

hdfs dfs -rm -r recursively deletes an HDFS directory, subdirectories, and files.

Answer 4

hdfs dfs -du -h displays a summary of HDFS directory and file sizes.

Answer 5

HiveQL enables you to query and manage HDFS files directly in their native storage format. You can impose structure or table schemas on a variety of HDFS formats using HiveQL table definitions

Answer 6

Hive Client and the Hive Services interact with the Hadoop cluster

Answer 7

Hive Client supports Hive Thrift Client, the Hive JDBC Driver, and the Hive ODBC Driver which submit SQL queries to the Hive Server

Answer 8

The Hive Driver is a component of the Hive Services that interact with the Hive Server. It compiles, optimizes, and executes the broken-down Hive query into MapReduce for final execution in the Hadoop cluster against the data.

Answer 9

Hive Metastore is a separate database, such as Apache Derby, outside Hadoop. It contains Hive table metadata definitions that point to a file in HDFS, such as field names, data types, and so on.

Answer 10

HiveQL DDL enables you to define data structures, such as databases and tables. Using the HiveQL DDL, you can create databases and schemas.

Answer 11

Use this command to create a new database. The database and schema keywords can be used interchangeably with the CREATE statement. A Hive managed table is created by default. CREATE DATABASE IF NOT EXISTS dihdm COMMENT 'database used for this course' WITH DBPROPERTIES ('creator'='student') ;

Answer 12

It alters a database’s properties without dropping the table. ALTER DATABASE dihdm SET OWNER USER STUDENT; ALTER DATABASE dihdm SET DBPROPERTIES ('Modified by'='student');

Answer 13

Working interactively in Beeline or any other command-line utility, you can submit the USE command to switch to a different database. USE dihdm; USE default;

Answer 14

Beeline is a command-line interface and JDBC client supported by Hive Server.

Answer 15

Beeswex is a HiveQL editor that accesses Hive Server through HUE

Answer 16

A managed table means that the associated data is “managed” by Hive. * Storage is predetermined if not specified. * Source is deleted on load unless the LOCAL keyword is used. * Dropping a table deletes the HDFS data.

Answer 17

An external table is a table where you manage the storage location of the HDFS data. * Use the LOCATION keyword to define the HDFS storage location. * Dropping the table does not delete the HDFS data.

Answer 18

The TRUNCATE statement enables you to remove all the rows from a managed table. If you created an external table, you cannot remove all the rows because all data resides outside the Hive metastore. To remove the rows from an external table, you can change the table from external to managed, delete the rows using the TRUNCATE statement, and reset the table back to external.

Answer 19

The DROP statement removes a Hive table from the Hive metastore. If the data is deleted, such as for a managed table, it is actually moved to a .trash HDFS directory for the current user session. It can be recovered from that location until the .trash directory is emptied. To permanently delete the data, use the PURGE option in the DROP TABLE statement.

Answer 20

* Text Files (.csv, tsv) - delimited files using commas, tabs, and so on. * JSON records - each row is a single piece of information * Avro Files - store schema metadata with block compression data and support schema evolution * Sequence Files - store data in binary key-value pairs with block compression * Columnar files (e.g. RC files, ORC, Parquet) - store data in a columnar file format with significant compression

Answer 21

SerDes, Serializers and Deserializers, tell Hive how to process a record when reading (deserializer) and writing (serializer) data. When a record is read from HDFS via a deserializer, an InputFormat is used. When a record is written to HDFS via a serializer, an OutputFormat is used.

Answer 22

* Textfile (aka LazySimpleSerDe) * Avro * RCFILE (Record Columnar) * ORC (Optimized Record Columnar) * PARQUET * REGEX * SEQUENCEFILE * JSON

Answer 23

* Textfile (aka LazySimpleSerDe) - the default Hive table SerDe, where data is read and written as plain text files

Answer 24

* Avro - specifies a JSON schema file that defines the table columns and their data types for data stored in rows. Supports full schema evolution.

Answer 25

* RCFILE - stores data in columnar, compressed storage formats known as row groups. Each row group contains a synchronize record, metadata header, and row data.

Answer 26

* ORC (Optimized Record Columnar) - similar to RCFiles. The columnar groups of data are called stripes. Each unique stripe contains index data, row data, and a stripe footer.

Answer 27

* PARQUET - stores data in row groups. A row group can contain multiple columns. Each column in a row group has a page that contains column metadata.

Answer 28

* REGEX - When reading unstructured data, it might be necessary to use regular expression (or RegEx) code to load the data into columns

Answer 29

* SEQUENCEFILE - stores data as flat files consisting of data stored in binary key -value pairs. It is a basic Hadoop proprietary format. The data is stored in blocks with a record length (key length + value length), key length, key, and the value. The sync marker denotes the end of the header and enables seeking to a random point in the file for efficient processing of large split files.

Answer 30

The JsonSerDe enables you to read the JSON formatted file, census.json, stored as plain text in the census_json HDFS location.

Answer 31

* Partitioning - separates data into manageable chunks based on a value of a column * Clustering or Bucketing - Uses buckets and a hash algorithm to distribute and find data values across a Hadoop cluster * Indexing (not supported in r3.0

Answer 32

Apache Pig is a platform or tool for analyzing large data sets in HDFS with a high-level SQL-like language called Pig Latin.

Answer 33

Pig Latin is a high-level, SQL-like language. It is a data flow language with a pipeline paradigm, meaning that the data cascades from one relational operation to the next in a Pig script. It can handle structured and unstructured data and is able to process big data.

Answer 34

Grunt is a command-line interface that enables you to submit Pig Scripts from a Client Node.

Answer 35

The Parser performs several checks, such as syntax and type checking on a submitted Pig script

Answer 36

Once the script has been parsed, the Optimizer performs logical optimizations of the Pig Latin statements and logical operators for a more efficient, logical plan.

Answer 37

The Compiler takes the optimized, logical plan and compiles it into a series of MapReduce jobs.

Answer 38

The Execution Engine submits the MapReduce jobs to Hadoop in sorted order for distributed, parallel execution against the HDFS data stored on the DataNodes.

Answer 39

1. Load the data from HDFS to an Alias (e.g., C1) 2. Filter the data and store results in an alias (e.g., C2) 3. Select records in C2 for additional processing and store them in an alias (e.g., C3) 4. Iterate until you get to just the required data adding to aliases 5. Write the final output back to HDFS

Answer 40

Pig identifiers, such as aliases and column names, have to start with a letter and can be followed by any number of letters, digits, or underscores.

Answer 41

You can use the column name, positional notation, or a mixture of both in a Pig script. Positional notation begins with a dollar sign. If data is loaded with no defined schema, positional notation using dollar zero ($0) is used to reference the data in a single field. This type of load might be required with certain unstructured data that has no defined delimiters, such as a Twitter feed or server log.

Answer 42

True. Aliases, column names, and functions like COUNT are case-sensitive. Keywords such as LOAD, USING, AS, GROUP BY, etc. are not case-sensitive.

Answer 43

Pig supports several command-line execution modes using the pig -x option. The default execution mode is MapReduce. This mode requires access to the Hadoop cluster and HDFS installation in order to use the Hadoop cluster resources.

Answer 44

Local mode runs all files on a single machine using local host and file system resources.

Answer 45

Tez mode requires access to the Hadoop cluster where the Tez component, originally created by Hortonworks, is installed, along with HDFS. Tez is built to run on top of YARN and does not require containers to be kept running like MapReduce or Spark. The performance of Tez is 100 times better than MapReduce.

Answer 46

Spark mode requires access to a Spark or YARN Hadoop cluster and the HDFS installation. Pig Latin commands are translated to Spark transformations and actions and run in memory on the DataNodes. Each command carries out a single data transformation, such as filtering, grouping, or aggregation

Answer 47

To support running Pig scripts in batch mode, parameters can be passed into the Pig script at run time using several methods. The parameters and values can be specified on the command line or in a parameter file. The value can be set inside the Pig script with %default or %declare. Using %declare, a Pig script that writes to STDOUT can be called as input in a LOAD statement of a different Pig script.

Answer 48

To execute a Pig script from the command line, you can use the interactive Grunt shell by entering pig at the UNIX prompt on the Client Node. This enables you to interactively enter and debug Pig Latin commands.

Answer 49

1. Create Pig script using favorite editor 2. Save program with .pig extension 3. $pig -param_file params.init program.pig

Answer 50

The PigStorage function is used to parse the lines of the input files into separate fields. Several options are available with the PigStorage keyword. If the PigStorage keyword is not specified on the LOAD or STORE keywords, the default delimiter is the tab character.

Answer 51

The -tagFile or -tagPath options prepend the input file or input file path to each row or tuple loaded. Before using these options, the pig.splitCombination property file must be set to FALSE so that Pig does not combine small files and pass the resulting combination to the mapper. This would make it difficult to determine the exact input files.

Answer 52

Simple (e.g., int, long, float, double, chararray, datatime, etc.) Complex (e.g., tuple, bag, relation (alias), map)

Answer 53

An order set of fields (e.g., 5,2)

Answer 54

A collection of Tuples (e.g., {(5,2),(12,9)} )

Answer 55

Outer bag with a collection of inner bags (e.g., {{(5,2),(12,9)}, {(5,2),(12,9)}} ) It is similar to a relational database

Answer 56

Set of unique keys with values within a relation (e.g., [id#1001,name#David,state#NC])

Answer 57

The CAST operator enables you to convert data from one data type to another if conversion is supported.

Answer 58

The FLATTEN operator enables you to “un-nest” or “transpose” tuples or bags. For tuples, the FLATTEN operator substitutes the fields of a tuple in place of the tuple. For bags, FLATTEN creates new tuples.

Answer 59

The FOREACH operator processes each record and can be used to select desired columns and derive new columns for each record.

Answer 60

The FILTER BY operator enables you to select the rows of data that you want to work with or remove the data that you don't want based on the expression.

Answer 61

The STORE keyword writes the results to HDFS. If the HDFS output directory does not exist, Pig creates it. If the HDFS output directory already exists, Pig fails.

Answer 62

The DUMP keyword prints the output to the console. Both the STORE and DUMP commands initiate MapReduce execution.

Answer 63

The DISTINCT keyword removes duplicate tuples (rows of data) from the alias.

Answer 64

The LIMIT keyword is used to restrict the number of tuples or rows in the output. It is typically used during the development or testing phase of a Pig script. Java is the most supported language.

Answer 65

Built-in User-Defined - these extend the behavior of Pig, but need to be registered so that Pig knows where to find them

Answer 66

1. Evaluation (AVG, SUM, CONCAT...) 2. Load/Store (PigStorage, JsonLoader, TextLoader....) 3. Math (ROUND, ABS, SIN, COS, EXP, RANDOM...) 4. String (STRARTSWITH, ENDSWITH, REGEX_EXTRACT, REPLACE, TRIM...) 5. Datetime (CurrentTime, DaysBetween,GetDate, ToDate...) 6. Tuple, Bag, Map (TOTUPLE, TOBAG, TOMAP)

Answer 67

The Piggybank library is a collection of user-defined functions for Pig. They are “as-is” Pig-user contributions, so if you find a bug or feel a function is missing, you can fix it or write it yourself.

Answer 68

The Apache DataFu Pig UDF library is a collection of user-defined functions for working with large-scale data in Pig used for data mining and statistics. It was originally from LinkedIn but is now open source under the Apache Foundation. Here is a short list of functions in the DataFu library.

Answer 69

1. FILENAME statement 2. PROC HADOOP 3. PROC SQOOP These rely on SAS/ACCESS to interact with Hive in Hadoop for SQL-based parallel processing

Answer 70

1. SAS Viya 2. SAS LASR 3. SAS High-Performance Analytics These solutions leverage a network-connected grid of computing resources to perform parallel distributed processing on distributed data sets

Answer 71

DS2 code executes within Hadoop using the SAS Code Accelerator for Hadoop, which is installed on a grid. DS2 is intended for data manipulation and data management

Answer 72

The CONCAT option reads all of the files in the directory. The files in the directory should all be of the same structure to avoid any issues when processing the data.

Answer 73

You can use the SAS DATA step with the INFILE statement pointing to the location specified in the FILENAME method.

Answer 74

PROC HADOOP submits a specific subset of HDFS commands and executes existing MAPREDUCE and PIG programs from Base SAS

Answer 75

1. COPYTOLOCAL / COPYFROMLOCAL 2. MKDIR 3. DELETE - Delete a directory and all its files 4. RENAME - Rename a file 5. CAT - displays the contents of an HDFS file or the contents of a directory 6. CHMOD - changes or sets the permissions on an HDFS location or file using Linux permission syntax in symbolic or octal notation. The RECURSE option sets permissions on all files and subdirectories in an HDFS location. 7. LS - lists any files in an HDFS location The "CODE =" option executes a Pig program defined in a file.

Answer 76

SQL pass-through using PROC SQL to submit SQL and DDL statements LIBNAME statement in a SAS program with SAS DATA step or procedure which will convert SAS code to HiveQL

Answer 77

The CONNECT statement has the following attributes: * It defines the access method or engine, such as Hadoop or Impala. * It establishes a connection to the Hive server running on the specified port, with the appropriate credentials. * It provides information specific to the Hive metastore, such as an existing schema. proc sql; connect to hadoop (server=server04.demo.sas.com port=10000 schema=dihdm user="student");

Answer 78

The PROC SQL SELECT statement is used to query Hive tables and bring results back to SAS. It contains both a SAS SQL SELECT statement and a HiveQL query. The SELECT statement starts with the SAS SELECT statement, then the FROM CONNECTION TO HADOOP clause, and then the HiveQL query. The HiveQL query is enclosed in parentheses. proc SQL; connect to hadoop () select * from connection to hadoop ( select count(*) as count from customer ); disconnect from hadoop; quit;

Answer 79

The EXECUTE statement allows you to execute Hive DDL statements. The keywords BY HADOOP are used to indicate that the statement should be sent to the Hive server connection established on the CONNECT TO HADOOP statement. proc SQL; connect to hadoop (); execute (drop table salestaff) by hadoop ; disconnect from hadoop; quit;

Answer 80

The HiveQL commands SHOW TABLES and SHOW COLUMNS allow us to explore a particular database or a particular table. These commands can be issued from various interfaces available for Hive. This includes a Hive Editor available in Hue. It also includes SQL pass-through in SAS. SHOW TABLES - lists available table in a Hive Schema SHOW COLUMNS - lists columns in a Hive Table in the specified Schema

Answer 81

DESCRIBE FORMATTED Hive statement will give you column names and data types, detailed table information, and storage information including what is listed here and more.

Answer 82

the SAS/ACCESS libname statement defines the AHDHIVE library and enables you to connect to a Hive database on hadoop proc contents data=AHDHIVE._all_; - accesses Hive metadata with overall and detailed information for all tables in the library displayed in the results tab

Answer 83

Hive Date = 'YYYY-MM-DD' SAS Date = 'DDMONYYYY'd

Answer 84

When a CREATE TABLE statement is executed, the results of the SELECT query are stored as data in the table. When a CREATE VIEW statement is executed, the SELECT query is not executed and no data is stored in the view. Instead, the SELECT query itself is stored in the view.

Answer 85

* Can be referenced in a SAS program in the same way as a table * Extracts underlying data each time it is used - most current data * Permits transparent access to Hive tables * Stores Hive connection information * Does not contain any data and not physical tables

Answer 86

Users are given access to current data in Hadoop but are also prevented from inadvertently writing SAS code to access Hive tables inefficiently and returning too much data from Hadoop back to the SAS server. Views can also allow organizations to limit access to sensitive data that can be stored in the Hive tables that business users should not be able to surface. We could allow analysts who might not be familiar with HiveQL or SQL access to this view.

Answer 87

Processing occurs in SAS after the data is moved.

Answer 88

If the DATA step is used for concatenation of two or more Hive tables, the concatenation will occur in SAS. The SAS/ACCESS engine will generate separate HiveQL SELECT statements for each table in the DATA step SET statement, then the results will be concatenated in SAS.

Answer 89

It's a UNION. With SQL set operators, two queries are written that independently retrieve data. Each query generates a results set, and those results sets are combined vertically. The way the results sets are combined depends on the set operator in use.

Answer 90

It is useful anytime the view is stored in a library that is different from the library that the view refers to. The USING clause contains the same text you would supply to a LIBNAME statement to define the library referred to in the FROM clause of the query that is stored. This stores the library definition in the view definition, which is why it is sometimes called an embedded LIBNAME statement. The library definition that is part of the stored query is defined only temporarily as the view is being executed and is accessible only as long as the view is executing. After the view executes and has returned results to SAS, there will not be a session-level definition for the library defined by the USING clause unless a session-level LIBNAME statement has been submitted independently during that SAS session. It is not ANSI SQL

Answer 91

The stored query will execute first and return all results to SAS. Then the WHERE clause will execute on the SAS side (Not in Hive which is preferred). So, you could potentially pull a lot of data.

Answer 92

Use the ON clause instead of WHERE... from salesstaff st inner join sales s on st.manager_id = s.employee_id

Answer 93

Using SAS libraries, data can be copied to Hive using standard SAS language syntax. In this example, the customer_dim data set is copied from a local SAS library (AHDSAS) to the Hive schema (DIAHDMYHIVE). SAS implicitly generates a CREATE TABLE statement that executes in Hive. SAS also moves the data from the customer_dim SAS data set into HDFS and places it in the location that the Hive table definition points to.

Answer 94

In this example, SAS processed the DATA step merge of the two SAS data sets. To create the output data set, SAS implicitly generates and sends a HiveQL CREATE TABLE statement to Hive and moves the output data to the HDFS location that the Hive table points to.

Answer 95

Using SASTRACE to See the Generated HiveQL. his option will display the code passed into a database, in this case into Hadoop. The OPTIONS STATEMENT shown here specifies how and where to display the generated HiveQL.

Answer 96

False. While many functions can be implicitly converted to Hive QL, some functions like WEEKDAY cannot.

Answer 97

* PROC SQL * A few additional PROCs * WHERE * Some SAS functions * Some data set options

Answer 98

* DATA step * Some SAS functions * Most PROCs

Answer 99

No. SAS will do further processing on any data returned from Hive in order to accomplish the task. For example, SAS might need to retrieve all the rows of data from a Hive table and proceed from that point to process the data on the SAS server.

Answer 100

The KEEP= and DROP= data set options allow you to limit the columns that you read from a source table and will allow you to limit the number of columns you return to SAS from a Hive table.

Answer 101

The OBS= data set option is often useful during testing to limit the total number of rows that you read from an input data set. You can use this while testing programs that read Hive tables to see what type of HiveQL is implicitly generated by the LIBNAME engine without executing the query on an entire Hive table.

Answer 102

The RENAME= data set option is used to change the name that is used for a column during processing. This does not affect performance, but this option is still supported when reading Hive tables. Renaming occurs on the output data set.

Answer 103

To take advantage of the parallel processing capabilities of Hadoop and improve query performance, you might need to increase the number of reducers used by the MapReduce tasks generated by the HiveQL processing. By default, MapReduce uses a single reducer per node. Reducers are used, for example, for group processing, and you can increase performance by increasing the number of reducers to enable different groups to be processed in parallel. Reducers are also used to write the output of MapReduce processes to HDFS. The number of reducers can be specified by adding a PROPERTIES= option to the Hadoop LIBNAME statement and setting the value for the Hadoop property mapred.reduce.tasks.

Answer 104

both PROC SQL and PROC SORT will generate HiveQL to order data in Hive. Sorting data can be resource intensive, and you should sort the data only when it is needed.

Answer 105

SAS/ACCESS LIBNAME engine automatically generates an ORDER BY clause for those variables. Do not pre-sort Hive tables with PROC SORT or PROC SQL for the purpose of BY-statement processing. It is more efficient to let the LIBNAME engine generate the HiveQL to retrieve the data directly in the order required.

Answer 106

It is a Base SAS programming language. It provides some advanced data manipulation techniques that make programming for reusable code easier. Has the ability to deal with many data types besides the traditional fixed width character and floating point numeric that were available to us in a Base SAS DATA step You can create user-defined methods, store them in packages, and share them easily with yourself in the future or with the rest of the people in your organization. DS2 process can process multiple observations at once or multiple compute threads, if you will, even if you're running on a single computer. You take advantage of all that lovely parallel processing power by multithreading fully distributed across the nodes of those systems.

Answer 107

If your program does complex computations or is CPU bound, your process will probably benefit from multithreading on the Base SAS platform, even if you're running on a single machine. If your data is coming from a table on a database with the SAS In-Database Accelerator installed, you can take the code right to where the data lives rather than having to move the data to the code. If you're reading and writing to caslibs in the SAS Viya Cloud Analytic Server, then you're definitely going to want to use DS2 because DS2 runs natively in parallel across all of those massively parallel nodes. Or you might need DS2 just for the precision that the data types offer. Use DS2 for the reusable code modules stored in packages

Answer 108

Package Thread Data

Answer 109

Begins with a DATA Statement Ends With and ENDDATA Statement Requires a Run Statement

Answer 110

A named executable block of code. In DS2 all executable statements are only valid within the context of a method.

Answer 111

INIT() - required to kick things off and contains your executable statements that will execute once and only once immediately upon beginning execution. RUN() - repeats execution once for every row of data read in through the SET statement. Only method that has an implicit output at the end. TERM() - excutes its code once and only once immediately before the DS2 DATA program stops execution.

Answer 112

Methods that a user can write themselves and which can accept arguments and return a value. They only execute when called

Answer 113

DCL - Either Global (outside a method) or Local (inside a method) Required to come before any executable statement

Answer 114

Use the /overwrite=yes option

Answer 115

Modify this behavior using the system option ds2scond, or for a particular PROC DS2 step using the scond equal option for the PROC DS2 statement. Using a value of NONE causes DS2 to take absolutely no action when undeclared variables are encountered. Nothing will be written in the log.

Answer 116

Double quotes identify the name of something. Single quotes are used to delimit character literal values e.g. 'Text Value'

Answer 117

FedSQL - is a newer version of SQL based on the SQL3 standard. FedSQL has improved ANSI syntax compliance, which means that it is more likely to pass through larger amounts of processing into the database when used in conjunction with a relational database. In addition, FedSQL has native support for 17 different ANSI data types, unlike PROC SQL, which can only handle fixed width character and double precision floating point. And finally, FedSQL was designed from the ground up for threaded, high-performance processing, making it extremely scalable and fast in operation.

Answer 118

The FORWARD statement basically tells the compiler that in the code that follows, a method call is going to be seen before the method definition is produced. This is not uncommon in packages, where we're packaging up a whole mess of custom methods for future reuse. And frequently, the definition of one method will include a call to another method which has not yet been defined in the code. It is only valid in the global portion of a program block

Answer 119

Fixed length character and Fixed length unicode character Variable length character and Variable length unicode character

Answer 120

Double, Float, Real, decimal, integer Precision can be an issue - extremely long number strings and DOUBLE precision values in DS2 are presented as at best an approximation of the number that was stored. Decimal numbers handles precision You can add an n-suffix to tell DS2 to compile large constant as a decimal Scale assigns the number of digits for a fraction. e.g. decimal(5,3)

Answer 121

Date, Time, Timestamp Mydate= date'2013-09-07'

Answer 122

Coercible data types will automatically convert to another data type in order to resolve an expression. There is a hierarchy of data types to determine which take precedence.

Answer 123

DS2 treats nulls as true nulls. SAS does not have the concept of null. It converts nulls to missing, which can be represented in various ways like a period or a space.

Answer 124

SAS and ANSI. These modes handle nulls and missing data differently

Answer 125

No. They are not. So, when DS2 reads SAS dates they are converted to character data types. DS2 will need to convert these character dates to ANSI for use.

Answer 126

They are global variables that cannot be declared within a method.

Answer 127

No. DS2 methods definitions must specify a fixed number of arguments.

Answer 128

Overloading is simply defining more than one method with the same name, but with distinguishing characteristics that allow the system to pick the correct one when called. These distinguishing characteristics are called a method's signature. The signature consists of the name of that method followed by an ordered list of the parameter data classes. The method is said to be overloaded when there are two or more definitions with the same name, but with clearly distinguishable signatures.

Answer 129

DS2 packages provide a convenient way to aggregate several user-defined methods and store them offline in a way that makes the code easily reusable. DS2 packages are stored in SAS libraries and so are available to most any SAS session that has access to the library where the package is stored.

Answer 130

The SQLSTMT package has several built-in methods that allow us to execute a FedSQL statement stored as part of the package information, and dynamically replace placeholders in that SQL statement with values from Program Data Vector variables It has compile-time instantiation Can be parameterized

Answer 131

You can declare a package identifier, but do not provide instantiation parameters. Delay instantiation to during program execution with the _NEW_ operator

Answer 132

Use a DS2 Threaded Program. They are DS2 programs whose result is a stored program in a SAS library. Threads are executed from a small and simple DS2 data program, and a thread cannot cause another thread to execute. When we're threading on the Base SAS system, it is necessary for us to declare the number of threads we wish to run. We do this with a THREADS= option on the SET FROM statement. A failure to declare the number of threads will result in the process running single threaded.

Answer 133

You can use SPDE to store SAS files in the file system by breaking data files up into chunks, read from the disk, and written back to the disk in parallel.

Answer 134

The SAS In-Database Code Accelerator adds additional capabilities to the SAS/ACCESS interface installed on your SAS system. In order for the SAS In-Database Code Accelerator to work, you must first have licensed the traditional SAS/ACCESS interface. With a properly installed and configured SAS embedded process, the DS2 programs are sent in-database as code, compiled and executed natively on the database hardware. This execution leverages the native massively parallel I/O and compute processing to dramatically improve the speed of your process.

Answer 135

SAS Data Loader for Hadoop is designed to provide business users access to data in the Hadoop cluster without requiring them to be programmers.

Answer 136

1. simple, self-service interface to support data preparation. 2. designed and built to run most of the SAS data manipulation and cleansing functions in parallel on Hadoop to leverage the power of distributed processing. 3. provides an easy-to-use interface that performs multiple data management and data quality tasks on data in Hadoop without moving the data from the Hadoop data node. Also, users do not need to write specialized ETL code, such as HiveQL, or Pig Latin. 4. loading data from Hadoop directly into the SAS LASR Analytic Server, or CAS Server, to support Visual Analytics, Statistics, and other applications.

Answer 137

Primary functions of SAS Data Loader such as Cluster and Survive, Run a SAS Program, Cleanse Data, Load Data to Hadoop, and Load Data from Hadoop.

Answer 138

Each directive has a series of tasks to work through to provide information that pertains to the data management activity that you want to process and run.

Answer 139

aid in various tasks that are associated with managing and manipulating data in Hadoop. Many transformations are available within SAS Data Loader directives. These transformations provide you with the ability to perform one or more data transformation steps within the job or directive.

Answer 140

1. Acquire and Discover Data 2. Transform Data 3. Cleanse Data 4. Integrate Data 5. Deliver Data

3. SAS and Hadoop Flashcards

(164 cards)