SAS Data Curation Practice Exam Flashcards

1
Q

Which statement best describes the difference between the Inventory tab and the Folder tab in the SAS Data Integration Studio interface?

Select one:
a. The Folder tab contains metadata objects in a customizable folder structure. The Inventory tab contains objects in pre-determined folders based on the object type.
b. The Folder tab contains objects created by the current user, while the Inventory tab contains objects created by all users.
c. The Folder tab contains jobs and stored process objects, while the Inventory tab contains source tables and target tables.
d. The Folder tab contains metadata items related to jobs that are open in Job Designer windows. Other objects are contained within the Inventory tab.

A

a. The Folder tab contains metadata objects in a customizable folder structure. The Inventory tab contains objects in pre-determined folders based on the object type.

The Folder tab displays the metadata folder hierarchy that is shared across the SAS platform and it can be customized and used to organize metadata in a structure that meets the needs of the user. The Inventory tab displays metadata in predefined categories that organize the metadata by type. The other options are not true statements about the Folder tab and the Inventory tab.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Identify the format used to move metadata for SAS platform metadata objects within the same metadata repository OR between different SAS metadata repositories.

Select one:
a. Star Schema
b. SAS Package
c. Metadata Bridge
d. Common Warehouse Metamodel

A

b. SAS Package

The type of file created for the purposes of moving metadata within a metadata repository, or from one metadata repository to another, is known as a “SAS Package.” Metadata bridges are the SAS product that allows users to move third party metadata from a third-party database to SAS or from SAS to a third-party database. When importing third party metadata, the format used is a version of the Common Warehouse Metamodel format. Therefore, C and D describe technologies to help users move third-party metadata. A star schema is a paradigm for organizing data in a data warehouse and is unrelated to importing metadata. Therefore, A is also false.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is defined by the Location field in the New Library Wizard?

Select one:
a. The server where the data is located.
b. The file path where the data is located.
c. The metadata folder where the library metadata is stored.
d. The connection profile where the library is stored.

A

c. The metadata folder where the library metadata is stored.

The Location field does not contain information about how to assign the library, rather it contains the location in metadata where the library assignment definition should reside. The server and file path information are information about the actual library assignment. The connection profile tells SAS Management Console where to look for metadata and how to access it, and is already opened and accessed before invoking the New Library Wizard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Assume that you have completed the Register Tables wizard in SAS Data Integration Studio. Which statement is true?

Select one:
a. The physical table(s) selected are copied to the application server specified in the library.
b. The physical table(s) selected are copied to the SAS Folders location specified in the wizard.
c. Metadata for the physical table(s) selected is stored on the application server specified in the library.
d. Metadata for the physical table(s) selected is stored in the SAS Folders location specified in the wizard.

A

d. Metadata for the physical table(s) selected is stored in the SAS Folders location specified in the wizard.

The Register Tables wizard queries the user for a library and table to register, then gathers information about that table to store into metadata at a specified location. Outside of gathering metadata information of the table, no movement of the table anywhere is done. Hence the “copied” choices are invalid. Metadata is not stored on the application server, only the metadata server, so that choice is also invalid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you register column metadata from a comma-separated file where the top of the data file looks like this:

To: A.Person
Subject: your data
date: 01/01/1960

fld1, fld2, fld3, fld4
1,1,1,1
2,2,2,2
3,3,3,3

Select one:
a. Get the column names from column headings in this file.
b. Use PROC CIMPORT to read the column headers.
c. Get the column definitions from a COBOL format file.
d. It is not possible to register column metadata from this file.

A

a. Get the column names from column headings in this file.

Often customers need to read in “flat-files” with abnormal structures at the beginning of the file. The easiest way in DI Studio to get column header information mid-file is to use “Get column headings from file” in the “Import column definitions” dialog.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which definition cannot be imported using a SAS Metadata Bridge?

Select one:
a. Column definition
b. Index definition
c. Server definition
d. Table definition

A

c. Server definition

SAS Data Integration Studio can import column, table, and index definitions using a SAS metadata bridge. It cannot import server properties from a SAS metadata bridge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which two statements are true regarding columns when using the SAS Data Integration Studio New Table wizard?

Select one or more:
a. You can access metadata for indexes from tables already registered in metadata.
b. You can access metadata for any column from tables already registered in metadata.
c. You can define new columns for the table.
d. You cannot override imported columns for the table.

A

b. You can access metadata for any column from tables already registered in metadata.
c. You can define new columns for the table.

The New Table wizard provides access to column metadata in already registered tables and it allows new columns to be defined. The New Table wizard does not provide access to index metadata in already registered tables and it does allow imported column metadata to be overridden.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

After opening a Join in Data Integration Studio, the following options are available in which parts of the Join’s designer window? (Choose two.)

Choose Columns…
CASE…
Subquery…
Advanced…

Select one or more:

a. the “Expression” window
b. a field’s “Expression” definition
c. a “Group By” clause definition
d. an “Operand” definition

A

b. a field’s “Expression” definition
d. an “Operand” definition

Users can specify these options in a field’s Expression definition and when setting an Operand definition for a clause within the Join transformation, such as Where or Having.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Within SAS Data Integration Studio, which type of expression can you create on the Where tab of the Extract transformation?

  1. a SAS expression
  2. an SQL expression
  3. an XML expression
  4. a constant

Select one:
a. 1 and 2 only
b. 1 and 3 only
c. 1, 2 and 4 only
d. 1, 2, 3 and 4

A

c. 1, 2 and 4 only

A SAS expression, an SQL expression, and a constant are valid on the Where tab in the Extract transformation. An XML expression is invalid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which statement is true regarding the Set Operators transformation?

Select one:
a. The Set Operators transformation performs set operations on queries of the source tables.
b. By default, the set operations keep non-common columns.
c. All of the set operations keep duplicate rows by default.
d. By default, the set operations match columns from different sources tables by name.

A

a. The Set Operators transformation performs set operations on queries of the source tables.

Each table has a Select, Where, Having, Group By and Order By tab in the Set Operators tab where users can configure a query for each source table. By default, the set operations match columns from different source tables by position in the table, not by name. Only the Outer Union set operation keeps duplicate rows by default.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you test the options of a new transformation in SAS Data Integration Studio?

Select one:
a. The transformation has to be used within a job to interactively test the options with the Test Prompt button in the Options window from the properties of the transformation within the job editor.
b. The options can be tested after finalizing and saving the transformation with the Test Prompt item on the Tools menu.
c. The options can be tested only after adding all options that are assigned to the source code with the Test Prompt button in the Options window of the New Transformation wizard.
d. The options can be tested at any time after adding them to the transformation with the Test Prompt button in the Options window of the New Transformation wizard.

A

d. The options can be tested at any time after adding them to the transformation with the Test Prompt button in the Options window of the New Transformation wizard.

When creating a custom transformation, a user can check the way his options will appear in the final transformation at any time by pressing the Test Prompts button in the New Transformation Wizard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The New Transformation wizard allows you to extend the capabilities of SAS Data Integration Studio and create new transformations. Which language does it support?

Select one:
a. SAS
b. Java
c. Python
d. C++

A

a. SAS

User-generated transformations require SAS code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which statement is TRUE regarding Java Plug-in Transformation templates and New Transformation templates?

Select one:
a. Java Plug-in Transformation templates and New Transformation templates cannot be stored in the same Transformation Category on the Transformations tab.
b. Java Plug-in Transformation templates and New Transformation templates have different options in their right-click pop-up menus.
c. Java Plug-in Transformation templates and New Transformation templates all have the same icon.
d. Both Java Plug-in Transformation templates and New Transformation templates are created with a wizard in SAS Data Integration Studio.

A

b. Java Plug-in Transformation templates and New Transformation templates have different options in their right-click pop-up menus.

Java Plug-in Transformation templates and New Transformation templates can be stored in the same Transformation Category on the Transformations tab. An example is the Analysis Category, which comes with the Java Plug-in Transformation template “Forecasting” and the New Transformation template “One-Way Frequency.” They also have different icons. Java transformation icons vary, while all New Transformation templates share the same icon. Finally, only New Transformation templates can be created in SAS Data Integration Studio with a wizard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Refer to the screen capture below, which shows the contents of a Quality Knowledge Base:

What do the icons to the left of the objects represent?

Select one:
a. the type of object in the Quality Knowledge Base
b. the data type associated with the object
c. the locale level associated with the object
d. the type of definition in the Quality Knowledge Base

A

c. the locale level associated with the object

The icons to the left of each item in the list indicate the locale level associated with the object in the QKB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which two types of items comprise the Quality Knowledge Base (QKB)?

Select one:
a. files and repositories
b. files and definitions
c. files and reference data sources
d. definitions and reference data sources

A

b. files and definitions

In the QKB (Quality Knowledge Base), various definitions are listed for each locale. The definitions can be “opened” where a sequence of processing will then be surfaced in a “flow” diagram. Many of the items listed in these “flow” diagrams are pointing to various files of the QKB (schemes, chop tables, phonetics libraries, regex libraries, vocabulary libraries, grammars).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You are working with the data below, which represents peoples’ last names (or surnames). You would like to apply the proper casing to the data using a Case Definition. Which two Quality Knowledge Base (QKB) file components can you use within the Case Definition to accomplish this task?

MacAlister
MacDonald
McCarthy
McDonald
McNeill

Select one or more:
a. Regular Expression library
b. Standardization Scheme
c. Vocabulary
d. Phonetics library

A

a. Regular Expression library
b. Standardization Scheme

In a Case Definition, the user can either use a Standardization Scheme or a Regular Expression library for ensuring the proper casing of words with “tricky” casing in the string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the first step for parsing a data string using a Parse definition?

Select one:
a. Chopping
b. Word Categorization
c. Tokenization
d. Preprocessing

A

d. Preprocessing

The template for Parse definitions has, for every parse definition, an initial Preprocessing step (which if specified, involves one or more regex libraries).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you detach a data job tab in DataFlux Data Management Studio?

Select one:
a. On the tool bar, click the Detach tool.
b. Drag the tab with the mouse cursor from its docked position.
c. From the File menu, select Detach.
d. You cannot detach a primary tab in DataFlux Data Management Studio.

A

a. On the tool bar, click the Detach tool.

The detach tool is used to open a data job in a new window so that it can be compared side-by-side with another job.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Consider the partial screen capture from a Profile Report below:

Why is the PRODUCT CODE column not a valid candidate for a primary key?

Select one:
a. There are not enough observations in the table to declare a primary key.
b. The pattern count metric is greater than 1.
c. The data type is “string,” and only numeric data can be considered for primary keys.
d. The data values are not 100% unique across all of the rows.

A

d. The data values are not 100% unique across all of the rows.

In order for a column to be a valid primary key candidate, the values must be 100% unique across the rows, and also no null values. ALthough this column meets the criteria of no null values, it is not 100% unique across rows, as you can see by examining the Uniqueness metric (96.55).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A Frequency Distribution node can assist in determining how well standardization rules and QKB customizations are addressing data quality issues. By selecting the frequency count of the standardization flag field, what do the Preview window results represent?

Select one:
a. 28.57 percent of the data were affected by the Standardization node.
b. 71.43 percent of the data still needs to be cleansed so that data quality is improved.
c. The Field Layout node will only process the 6 standardized rows.
d. Standardization rules were applied to 15 of the 21 rows.

A

a. 28.57 percent of the data were affected by the Standardization node.

A standardization flag of true represents those rows whose value has been altered from its original value by the standardization rules. A standardization flag of false does not indicate that the data value is bad. All rows are passed along through the data job flow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

You are building a data job to prepare data for future analysis and reporting. A sample of the company name data is shown below. Which SAS-supplied Case Definition do you use to ensure the proper casing of your company names?

Kaiser-Permanente
McDonalds
The SAS Institute
O’Neil Deli and Pub
SAS

Select one:
a. Proper
b. Proper (Company)
c. Proper (Name)
d. Proper (Organization)

A

d. Proper (Organization)

The Proper (Organization) is the valid name of the Case Definition to be used in casing organization names. Proper and Proper (Company) are not valid definition names, and Proper (Name) is used to properly case the names of individuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

You are creating a data job to apply a data cleansing process to an input data field containing city, state and postal code data. You would like to create individual fields from the components of the data values, with the resulting data being written into individual fields for City, State/Province and Postal Code. Which node would you use to accomplish this result?

Select one:
a. Right Fielding node
b. Parsing node
c. Identification Analysis node
d. Standardization node

A

b. Parsing node

The Parsing node will have to be used to accomplish this task. Although the Identification Analysis and Right Fielding nodes can recognize the full CSZ field, they cannot be used to break the data into the component parts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

A table has these fields:

CUSTOMER_ID
FIRST_NAME
LAST_NAME
COMPANY
ADDRESS
CITY
POSTAL_CODE
PHONE

The table has many duplicate records for customers. You will use a Clustering node to find the duplicates using the conditions shown below:

Note the name information in the table is parsed in two fields, however, the match code for name information is contained in a single field.

After reading the table with a Data Source node, what is the minimum number of nodes (not counting the Clustering node) necessary to prepare the data for these clustering conditions?

Select one:
a. 1
b. 2
c. 3
d. 4

A

c. 3

For the conditions shown for the provided field names given, you would need three nodes:
1. a Match Codes (Parsed) node to generate a match code string for the parsed name fields,
2. a Match Codes node to generate the remaining match code string fields, and finally
3. a standardization node to generate the standardized value for the phone field.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

A sample of data has been clustered and found to contain many multi-row clusters. To construct a “best” record for each multi-row cluster, you need to select information from other records within a cluster.

Which type of rule allows you to perform this task?

Select one:
a. Clustering rule
b. Record rule
c. Business rule
d. Field rule

A

d. Field rule

Field rules is the correct answer as field rules are the only mechanism available for selecting information from other records in a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

In a data job that performs entity resolution for a single table, which node do you use to select a best record from a group of clustered records?

Select one:
a. Data Sorting node with NODUPKEY option
b. Match Report node with surviving option
c. Master Data Record node
d. Surviving Record Identification node

A

d. Surviving Record Identification node

The Surviving Record Identification node provides properties for selecting the best record from a group of clustered records. There is no NODUPKEY option on the Data Sorting node. There is no surviving option on the Match Report node. There is no Master Data Record node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Which match code fields were generated using the lowest sensitivity?

Select one:
a. “M@27B43743@Y87BB&YR8ß$$$$$$” and “B7W7P@3_~M4P4$$$$$$$$$$$$$$$$$$$$”

b. “43Y$$$$$$$$$$$$$$$$$$$$” and “BW$$$$$$$$$$$$$$$$$$$$$$”
c. “43&Y886437$$$$$B8íYR$$$$$$$$” and “B£W£P£3£~M4P4$$$$$$$$$$$$$$$$$$$$$$”
d. “43Y8B4$$$$$$$$B&Y$$$$$$$$$” and “#YVY@7$$$$$$$$$$$$$$$$$$$”

A

b. “43Y$$$$$$$$$$$$$$$$$$$$” and “BW$$$$$$$$$$$$$$$$$$$$$$”

Match codes in this example were generated using the lowest sensitivity, as the match codes are the shortest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are two SAS system options used for interacting with the Quality Knowledge Base (QKB) in SAS code and applications?

Select one or more:
a. DQSETUPLOC
b. DQLOCALE
c. DQQKBLOC
d. DQQKBPATH

A

a. DQSETUPLOC
b. DQLOCALE

The two system options that can be set programmatically in SAS to access the QKB components are DQSETUPLOC and DQLOCALE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the name of the language and geography component of the Quality Knowledge Base that is used to organize the available definitions?

Enter your answer in the space below. Case is ignored.

A

locale

A “locale” is the organizational component of the QKB that is used to group definitions by language and geography (for example, English (United States)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Which node in the Hadoop system holds the metadata for files distributed in HDFS?

Select one:
a. NameNode
b. DataNode
c. ControlNode
d. ClientNode

A

a. NameNode

The NameNode does not contain any physical data. It holds the data about the data, or metadata, for files distributed in HDFS. The DataNodes hold blocks of the actual files.

30
Q

All files in HDFS are stored in:

Select one:
a. serdes
b. maps
c. bags
d. blocks

A

d. blocks

All files in HDFS are stored in blocks. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks. All the blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. Rows of data may be split across different blocks.

31
Q

What is the command to list contents of a folder on HDFS?

Select one:

a. hdfs -ls /path/to/hdfs/folder
b. hdfs ls /path/to/hdfs/folder
c. hdfs dfs -ls /path/to/hdfs/folder
d. hdfs dfs ls /path/to/hdfs/folder

A

c. hdfs dfs -ls /path/to/hdfs/folder

When directing commands to HDFS, the command begins with hdfs dfs. Then this is followed by the traditional Linux command as an option, in this case, -ls. It is completed with the directory of interest.

32
Q

Which command will move all the files from a Linux sub-directory into an HDFS sub-directory?

Select one:
a. hdfs dfs –put /projectA/* /projects/projectA
b. hdfs dfs –put –r /projectA /projects
c. hdfs dfs –put /projects/ProjectA /projects/*
d. hdfs dfs –put /projects/ProjectA /projects -r

A

a. hdfs dfs –put /projectA/* /projects/projectA

Only A has the correct syntax. After -put, you name the file in Linux followed by the location in HDFS you wish to move the file to. To move all Linux files in a directory you can use * as a wildcard to indicate all files.

33
Q

Given the following HiveQL statements, what folder will be created in HDFS storing the data created in hr.salesstaff?

create external table hr.salesstaff
(empid int, jobtitle char(15), location char(16), hiredate date)
partitioned by (year int) location ‘/user/salesstaff’;

load data inpath ‘/user/2014salesstaff’
into table hr.salesstaff partition (year=2014);

Select one:
a. /user/hr.salesstaff/year=2014
b. /user/salesstaff/year=2014
c. /user/2014salesstaff
d. /user/hr.salesstaff/2014

A

b. /user/salesstaff/year=2014

Each partition will be stored in a sub-folder within the location of the table (in this case, /user/salesstaff as specified on the location clause). The name of each partition sub-folder is in the form partitioncolumnname=value (in this case, year=2014). So the full path is /user/salesstaff/year=2014.

34
Q

Web server logs are written in an HDFS directory. The following lines indicate the format and an example of the comma-separated values for one line in the log file.

IP address, timestamp, request, status, size
192.168.12.41,24/Nov/2015:10:09:58 -0500, “GET /services/config.xml HTTP/1.1”,200,816
Which CREATE TABLE statement enables a Hive query to access each of the fields?

Select one:
a. create external table weblogs
(ip string,
dt string,
req string,
status int,
sz string)
fields terminated by ‘,’
location ‘/data/weblogs’;

b. create external table weblogs
(ip string,
dt string,
req string,
status int,
sz string)
row format delimited
fields terminated by ‘,’
location ‘/data/weblogs’;

c. create external table weblogs
(ip string,
rest string)
row format delimited
fields terminated by ‘,’
location ‘/data/weblogs’;

d. create external table weblogs
(ip string,
dt string,
req string,
status int,
sz string)
fields delimited
fields by ‘,’
location ‘/data/weblogs’;

A

b. create external table weblogs
(ip string,
dt string,
req string,
status int,
sz string)
row format delimited
fields terminated by ‘,’
location ‘/data/weblogs’;

The “row format delimited fields terminated by ‘,’” is the required syntax.

35
Q

What type of table is preferred to ensure there is no data loss if a Hive table is dropped accidentally?

Select one:
a. Managed
b. External
c. Internal
d. Partitioned

A

b. External

External tables allow the table metadata to be deleted without affecting the original data.

36
Q

Consider the below two tables in Hive (partial data shown). How do you find out the top 10 customer names that have highest number of orders?

Select one:

a. select name, count(order_id) cnt
from customer c, order o
where c.customer_id = o.customer_id
group by c.customer_id
sort by cnt desc
limit 10

b. select name, max(order_id) cnt
from customer c, order o
where c.customer_id = o.customer_id
group by c.customer_id
sort by cnt desc
limit 10

c. select name, max(order_id) cnt
from customer c, order o
where c.customer_id = o.customer_id
group by c.customer_id
order by cnt desc
limit 10

d. select name, count(order_id) cnt
from customer c, order o
where c.customer_id = o.customer_id
group by c.customer_id
order by cnt desc
limit 10

A

d. select name, count(order_id) cnt
from customer c, order o
where c.customer_id = o.customer_id
group by c.customer_id
order by cnt desc
limit 10

In B and C the max() function incorrectly gives the largest number. In A and C, SORT BY provides reducer level sorting instead of job level sorting. D is correct.

37
Q

id, name, countrycode, district, population

This questions asks you to provide a segment of missing code.

An HDFS directory, /data/word, has population information in files. Here is a sample:

(728,Vereeniging,ZAF,Gauteng,328535)
(729,Wonderboom,ZAF,Gauteng,283289)
Select the code segment that completes the following Pig program to summarize the population by countrycode into a table alias.

t = load ‘/data/world/’ using PigStorage(‘,’) as (id:int,
name:chararray, countrycode:chararray, district:chararray,
population:int);
t1 = foreach t generate countrycode, population;

<insert>
Select one:

a. t2 = group t1 by (countrycode, SUM(t1.population));

b. t2 = group t1 by countrycode;
t3 = foreach t2 generate group, SUM(t1.population);

c. t2 = group t1 all;
t3 = foreach t2 generate SUM(t1.population);

d. t2 = group t1 by countrycode;
t3 = foreach t2 generate countrycode, SUM(t1.population);
</insert>

A

b. t2 = group t1 by countrycode;
t3 = foreach t2 generate group, SUM(t1.population);

B creates groups by countrycode in t2 and then generates each group and the summed population into table alias t3. A is invalid syntax. C is incorrect because the data are never grouped by the countrycode. D is incorrect because generate countrycode causes t3 to include the countrycode each time it appears in the data.

38
Q

Consider the following data and Pig program.

HDFS Data:
Quick brown fox jumps over the fence
Pig Program:

A = LOAD ‘/hdfs/data’ Using PigStorage(‘\b’);
B = FOREACH A GENERATE TOKENIZE($0);
STORE B INTO ‘/hdfs/output’;

What is the output B?

Select one:
a. Program does not compile. Throws error on TOKENIZE function.
b. A single Bag containing fields
{Quick, brown, fox, jumps, over, the, fence}
c. A single Tuple containing fields
{Quick, brown, fox, jumps, over, the, fence}
d. A single Bag containing Tuples
{(Quick), (brown), (fox), (jumps), (over), (the), (fence)}

A

d. A single Bag containing Tuples
{(Quick), (brown), (fox), (jumps), (over), (the), (fence)}

The TOKENIZE keyword will generate a bag of tuples.

39
Q

Given the following Pig program, what is the name of the field that corresponds to $1?

T = LOAD ‘data/census_2010.csv’ Using PigStorage() AS(county:chararray, population:int);
T0 = FOREACH T GENERATE $1;
STORE T0 INTO ‘output/load’;
Select one:

a. county
b. unknown
c. population
d. chararray

A

c. population

The $ is a positional reference starting with 0 as the first position. Therefore $1 references the second variable, population.

40
Q

How can you specify that your SAS session connects to the Hadoop server using WebHDFS rather than using JAR files?

Select one:
a. options set=SAS_HADOOP_RESTFUL webhdfs;
b. options set=SAS_HADOOP_RESTFUL 0;
c. options set=SAS_HADOOP_JAR 0;
d. options set=SAS_HADOOP_RESTFUL 1;

A

d. options set=SAS_HADOOP_RESTFUL 1;

You set the SAS_HADOOP_RESTFUL to 1 use WebHDFS or HttpFS to connect to the Hive server (rather than using JAR files). WebHDFS and HttpFS are HTTP REST APIs.

41
Q

Refer to the following SAS program:

filename config ‘C:\Users\myuser\hadoop\sample_config.xml’;
filename code ‘C:\Users\myuser\hadoop\sample_pig.txt’;

proc hadoop cfg=config username=’myuser’ password=’mypwd1’;
pig code=code
registerjar=’C:\Users\myuser\Hadoop\jars\myudf.jar’;
run;
Which statement correctly describes the program?

Select one:
a. The first FILENAME statement specifies the file reference for the XML file that enables Pig code to execute on your system.
b. The first FILENAME statement specifies the location of the file that contains the Pig code that is run by PROC HADOOP.
c. The second FILENAME statement specifies the location of the file that contains the Pig code that is run by PROC HADOOP.
d. The REGISTERJAR= option appends the contents of the myudf.jar file to the end of the Pig code file before it is called by PROC HADOOP.

A

c. The second FILENAME statement specifies the location of the file that contains the Pig code that is run by PROC HADOOP.

This statement identifies the Pig code (sample_pig.txt) that is called in PROC HADOOP.

42
Q

Which statement is NOT valid in PROC HADOOP?

Select one:
a. load
b. pig
c. mapreduce
d. hdfs

A

a. load

The Hadoop procedure has statements to support submitting Pig and Mapreduce programs, as well as executing HDFS commands.

43
Q

Refer to the following program:

proc sql;
connect to hadoop (server=myhdp user=myuser);
execute (create external table cars (col1 varchar(20))
location ‘/user/myuser/cars’) by hadoop;
execute (load data inpath ‘/user/myuser/data/cars_data.txt’
into table cars) by hadoop;
quit;
What is the state of the cars table and underlying HDFS file after the following program executes?

proc sql;
connect to hadoop (server=myhdp user=myuser);
execute (drop table cars) by hadoop;
quit;
Select one:

a. The cars table definition is dropped. The underlying HDFS file is not deleted.
b. The cars table definition is not dropped. The underlying HDFS file is deleted.
c. The cars table definition is dropped. The underlying HDFS file is deleted.
d. The DROP table statement fails because you cannot drop an external table that contains data.

A

a. The cars table definition is dropped. The underlying HDFS file is not deleted.

Defining a table as external decouples the table definition from the underlying data. Dropping the table does not delete the underlying HDFS file.

44
Q

A Hive table was created using the following CREATE TABLE statement:

create table products
(product_id int,
product_name string,
product_type varchar(15),
effective_date date))
row format delimited
fields terminated by ‘\001’;
The following PROC SQL program is submitted to create a SAS data set in the work library. Assume the library HIVELIB connects to the schema in Hive where the products table is stored.

proc sql;
create table work.products
as select * from HIVELIB.products
where product_type = ‘Home Furnishings’;
What will the attributes be of the variables in the SAS data set?

Select one:
a.
product_id: numeric type, length 8
product_name: character type, length 8
product_type: character type, length 15
effective_date: numeric type, length 8, format date9.

b.
product_id: numeric type, length 8
product_name: character type, length 32767
product_type: character type, length 15
effective_date: numeric type, length 8, format date9.

c.
product_id: numeric type, length 8
product_name: character type, length 1024
product_type: character type, length 15
effective_date: numeric type, length 8, format date9.

d.
product_id: numeric type, length 8
product_name: character type, length 32767
product_type: character type, length 15
effective_date: character type, length 8

A

b.
product_id: numeric type, length 8
product_name: character type, length 32767
product_type: character type, length 15
effective_date: numeric type, length 8, format date9.

String data types in Hive are interpreted by SAS as character variables with a length of 32767. Date variables in Hive are interpreted by SAS as numeric dates with a length of 8 bytes and a format of date9. is assigned by default. SAS uses the numeric value for the varcharNo data type in Hive to create a SAS character variable of length n. All numeric types in Hive are treated as SAS numeric with a length of 8.

45
Q

For Hive tables and HDFS, which statement is true?

Select one:

a. All Hive table data are stored in HDFS.
b. Hive managed (internal) tables are stored in HDFS. Hive external tables are stored outside of HDFS.
c. Hive external tables are distributed in HDFS. Hive managed (internal) tables are localized to the Hive server.
d. All Hive data are stored in a Hive-specific metadata store in each node parallel to HDFS.

A

a. All Hive table data are stored in HDFS.

All Hive tables are stored in HDFS. None of the other descriptions are remotely true.

46
Q

Complete the following statement to delete the mytable table definition from Hive:

proc sql;
connect to hadoop (server=myserver port=10000 user=myuser);
[Answer] (drop table mytable) by hadoop;
disconnect from hadoop;
quit;

Enter your answer in the space above. Case is ignored and standard SAS syntax rules apply.

A

execute

The EXECUTE statement allows HiveQL Data Definition Language (DDL) code to be passed to Hive for execution by Hive. The DDL code is contained in the parentheses (). The EXECUTE statement is ended with ‘by hadoop’ to indicate that the code in the parentheses is being passed to Hadoop.

47
Q

What is a good reason to use a Hive external table rather than a managed (internal) table?

Select one:
a. They can be stored outside of the Hadoop cluster.
b. They allow for storage in additional file formats.
c. They are made available to users not defined in metadata.
d. They can be dropped from Hive while the data remains in HDFS.

A

d. They can be dropped from Hive while the data remains in HDFS.

With managed tables, Hive controls the lifecycle of the data. When the table is dropped, the associated data in HDFS is also deleted. For external tables, Hive does not control the data in HDFS, allowing the table to be dropped without deleting the data in HDFS.

48
Q

When working with SAS/ACCESS to Hadoop, which statement is true about implicit pass-through?

Select one:
a. It sends user-written HiveQL code to the database.
b. It translates SAS language code into database-specific query code.
c. It generates and executes PROC SQL code in the database.
d. It starts a SAS process in Hadoop to execute SAS code.

A

b. It translates SAS language code into database-specific query code.

A benefit of the SAS/ACCESS interface to Hadoop is that it allows a SAS programmer to use traditional DATA Step programming methods and then SAS will translate this code into code that can be passed to Hadoop for execution in Hadoop.

49
Q

Refer to the following SAS program:

libname myhdp hadoop
server=’server1.demo.sas.com’ user=’me’ password=’mypwd’;
libname mysas ‘~/sasdata’;

proc copy in=mysas out=myhdp;
select staff;
run;

proc delete data=myhdp.staff;
run;
After running the SAS program, what is the state of the staff table in Hive and the underlying HDFS file?

Select one:
a. Both the staff table definition in Hive and the underlying HDFS file still exist because the PROC DELETE step failed to execute.
b. The staff table definition in Hive and the underlying HDFS file were never created because the PROC COPY step failed to execute.
c. The staff table definition in Hive has been dropped and the underlying HDFS file has been deleted.
d. Both the staff table definition in Hive and underlying HDFS file still exist. The table is no longer accessible in SAS using the myhdp libref.

A

c. The staff table definition in Hive has been dropped and the underlying HDFS file has been deleted.

The syntax of the program is valid, and all statements would be able to execute successfully. After the PROC COPY, a staff table will exist in Hive. It will be a managed table, and therefore, when it is deleted by PROC DELETE, both the table and the underlying data will be deleted.

50
Q

The following program joins a SAS table containing 3 rows to a large Hive table.

libname myhdp hadoop SERVER=’myserv’ user=’me’ pw=’mypwd’;

proc sql;
select o.*
from cust_list c
,myhdp.orders o
where o.ID=c.ID
order by o.ID, Date
;
quit;
What modification will reduce the amount of data retrieved from Hive when performing the join in SAS?

Select one:
a.
Modify the LIBNAME statement to read:
libname myhdp hadoop SERVER=’myserv’ user=’me’ pw=’mypwd’
MULTI_DATASRC_OPT=IN_CLAUSE;

b.
Add the following statement at the top of the program:
OPTIONS MULTI_DATASRC_OPT=IN_CLAUSE;

c.
Add the following statement at the top of the program:
OPTIONS THREADS;

d.
Modify the LIBNAME statement to read:
libname myhdp hadoop SERVER=’myserv’ user=’me’ pw=’mypwd’
BULKLOAD=YES;

A

a.
Modify the LIBNAME statement to read:
libname myhdp hadoop SERVER=’myserv’ user=’me’ pw=’mypwd’
MULTI_DATASRC_OPT=IN_CLAUSE;

The MULTI_DATASRC_OPT=IN_CLAUSE option on the LIBNAME statement reduces the amount of data retrieved from Hadoop in this cross-library join by creating a list of the distinct key column values from the smaller SAS table and using those values to construct an IN clause for the SQL query, which retrieves rows from the larger Hive table in Hadoop. This ensures that only rows that will contribute to the join are selected for retrieval from the Hive table.

51
Q

What is an advantage of using a LIBNAME statement to interact with your Hadoop cluster?

Select one:
a. It enables you to submit user-written HiveQL code to Hive.
b. The GENERATE_PIG_CODE= option enables you to bypass Hive and generate Pig Latin code.
c. It enables some SAS procedures to push processing into Hive.
d. It ensures that Hive will handle all processing.

A

c. It enables some SAS procedures to push processing into Hive.

SAS/ACCESS to Hadoop enables select SAS procedures to push processing into Hive for processing.

52
Q

Refer to the log message shown below:

58 proc ds2;
59 data test;
60 dcl double date;
61 method run();
62 set work.one;
63 mm=2;
64 end;
65 enddata;
66 run;
ERROR: Compilation error.
ERROR: Parse encountered type when expecting identifier.
ERROR: Parse failed on line 60: dcl double&raquo_space;> date «< ;
NOTE: PROC DS2 has set option NOEXEC and will continue to prepare statements.
67 quit;

Which of the following changes will fix the errors shown in the log? Replace line 60 with

Select one:
a. dcl double ‘date’;
b. dcl double “date”;
c. dcl string date;
d. dcl double ‘date’n;

A

b. dcl double “date”;

Date is a keyword and therefore needs to be enclosed in double quotes.

53
Q

Which segment of DS2 code will correctly assign values to an array with four elements?

Select one:
a.
dcl char(1) v(4);
v:=(‘a’,’b’,’c’,’d’);

b.
dcl char(1) v(4);
v:=(“a”,”b”,”c”,”d”);

c.
dcl char(1) v[4];
v:=(‘a’,’b’,’c’,’d’);

d.
dcl char(1) v[4];
v=(‘a’,’b’,’c’,’d’);

A

c.
dcl char(1) v[4];
v:=(‘a’,’b’,’c’,’d’);

The number of array elements is contained within brackets [] and the assignment statement requires the colon:.

54
Q

Which DS2 data type is considered non-coercible?

Select one:
a. TIMESTAMP
b. DOUBLE
c. INTEGER
d. VARCHAR

A

a. TIMESTAMP

Non-coercible data types automatically convert only to character data types. Date, time, and binary data types are not coercible.

55
Q

Assume the following statements are part of a DS2 data program:

declare date ExistingDate NewDate;
ExistingDate = date ‘2015-01-01’;
Which statement increments ExistingDate by 30 days and maintains the date variable type?

Select one:
a. NewDate = put(ExistingDate,yymmdd10.) + 30;
b. NewDate = intdt(ExistingDate,30);
c. NewDate = intnx(‘month’,ExistingDate,1);
d. NewDate = put(to_double(ExistingDate) + 30,yymmdd10.);

A

b. NewDate = intdt(ExistingDate,30);

INTDT will increment a date value and maintain the date data type. Using PUT would change the data type.

56
Q

Which transformation determines the type of data in a specified column using definitions in the SAS Quality Knowledge Base?

Select one:
a. Identification Analysis
b. Summarize Rows
c. Parse Data
d. Field Extraction

A

a. Identification Analysis

The Identification Analysis definition determines the type of data in a single column, whereas the field extraction transformation will separate the data into specific tokens in new table columns.

57
Q

Which directive allows you to create fuzzy representations of data values that can be used to join similar values in two tables?

Select one:
a. Cleanse Data in Hadoop
b. Sort and De-duplicate Data in Hadoop
c. Transform Data in Hadoop
d. Query or Join Data in Hadoop

A

a. Cleanse Data in Hadoop

The Cleanse Data directive contains the Generate Match Codes transformation that enables you to create a fuzzy text string representation of a data value based on an algorithm and sensitivity setting. These match codes can be used consistently over time to perform a fuzzy join of data from two or more tables. The Generate Match Codes transformation provides the following capabilities:

generates a similar match code value when two strings are not exact character matches
uses a data type definition and match sensitivity setting to determine the degree of similarity between strings to generate a match code value

58
Q

Which two statements are true regarding LIBNAME access to DBMS tables?

Select one or more:
a. You must be able to write native DBMS SQL code.
b. You can use any SAS programming method, naming DBMS tables as input.
c. All of the code you write will execute in the DBMS instead of SAS.
d. Automatically generated DBMS SQL might inadvertently cause all data to be passed to SAS for processing.

A

b. You can use any SAS programming method, naming DBMS tables as input.
d. Automatically generated DBMS SQL might inadvertently cause all data to be passed

LIBNAME access allows you to use DBMS tables as input in any SAS program. The LIBNAME engine generates native DBMS SQL, attempting to execute as much of the processing as possible in the DBMS. However, if the LIBNAME engine cannot convert the SAS process to native DBMS SQL, the entire table may be passed back to SAS for processing.

59
Q

What type of explicit SQL pass-through statement do you use to remove a table from the database?

Select one:
a. SELECT
b. EXECUTE
c. UPDATE
d. DELETE

A

b. EXECUTE

EXECUTE statements are used to submit non-query, Data Definition Language (DDL) statements to database.

60
Q

Which SAS Federation Server view is an actual copy of the data?

Select one:
a. Cached view
b. MDS view
c. FedSQL view
d. DS2 view

A

a. Cached view

When a FedSQL view is cached, a copy of the data is stored in the cache location.

61
Q

When maintaining security definitions on a table, which two rules apply?

Select one or more:
a. Use the DELETE command to remove all rows from a table, but keep the explicit security definitions.
b. Explicit security definitions must be redefined when a table is removed and added back.
c. Use the DROP command to remove all rows from a table, but keep the explicit security definitions.
d. Explicit security definitions are kept when a table is removed and added back.

A

a. Use the DELETE command to remove all rows from a table, but keep the explicit security definitions.
b. Explicit security definitions must be redefined when a table is removed and added back.

The DELETE command in a FedSQL data definition language statement is used to delete the rows of data in a table. Security definitions that are explicitly set on a table are lost when the table is removed. When adding a table back, the permissions are inherited from the upper level object in the hierarchy.

62
Q

Which SAS LIBNAME statement is correct to connect to a SAS Federation Server Oracle data connection in SAS code?

Select one:
a.
libname mydata fedsrv server=”myserver.sas.com” port=24141
user=”student” password=Metadata0 dsn=STUDENT

b.
libname mydata fedsrv server=”myserver.sas.com” port=24141
user=”student” password=Metadata0 schema=STUDENT

c.
libname mydata fedsvr server=”myserver.sas.com” port=24141
user=”student” password=Metadata0 dsn=STUDENT

d.
libname mydata fedsvr server=”myserver.sas.com” port=24141
user=”student” password=Metadata0 schema=STUDENT

A

c.
libname mydata fedsvr server=”myserver.sas.com” port=24141
user=”student” password=Metadata0 dsn=STUDENT

FEDSVR (and not FEDSRV) is the correct engine. Additionally, the SCHEMA option cannot be specified on a LIBNAME statement without the DSN option before it. That is, you must specify, in order, DSN= and then SCHEMA=. If a DSN option is specified without the SCHEMA option, then all schemas in the DBMS are available (based on permissions).

63
Q

SAS Event Stream Processing models can be written in which two languages?

Select one or more:
a. C++
b. R
c. SAS
d. XML

A

a. C++
d. XML

SAS Event Stream Processing models can be specified in C++ and XML. SAS Event Stream Processing does not support R or SAS as the ESP modeling language.

64
Q

Which statement is true regarding SAS Event Stream Processing?

Select one:
a. Sources publish into connectors, and targets subscribe to adapters.
b. Connectors run in process and adapters run independently.
c. Adapters are configured in the ESP model, while connectors are started from the command line.
d. Adapters can be orchestrated; connectors cannot be orchestrated.

A

b. Connectors run in process and adapters run independently.

Connectors are configured in the ESP model, run in the model process space and are subject to connector orchestration.

Adapters are started from the command line, run independently of the ESP model, and are not directly subject to connector orchestration.

Sources publish into source windows, targets subscribe to any window in a model.

65
Q

How do event streams enter a SAS Event Stream Processing project?

Select one:
a. through the Input window only
b. through any ESP window via a publisher connector
c. through the Source window only
d. through any ESP window via a subscriber connector

A

c. through the Source window only

Event streams can only enter a SAS Event Stream Processing project through a source window, either via a publisher connector or via a publisher adapter.

66
Q

SAS Event Stream Processing is multi-threaded. The THREADS= option is available to control resource usage. It is a property of which ESP element?

Select one:
a. Engine
b. Project
c. Continuous Query
d. Window

A

b. Project

The THREADS= option is a Project property.

67
Q

Which statement about SAS Event Stream Processing Studio is correct?

Select one:
a. It is browser based.
b. It is a Java application.
c. It runs on Windows only.
d. It runs on Linux only.

A

a. It is browser based.

SAS Event Stream Processing Studio is a browser application. It runs on any platform that supports browser applications.

68
Q

How are tags used in SAS Business Data Network?

Select one:
a. to flag terms for deletion from the business glossary
b. to group similar terms together in the terms list
c. to flag terms for review by another user
d. to identify terms that are not yet published

A

b. to group similar terms together in the terms list

Tags in SAS Business Data Network are used to group similar terms together in the terms list.

69
Q

In SAS Business Data Network, where can you add a tag to a term?

Select one:
a. Identification tab
b. Associated Items tab
c. Attributes tab
d. Hierarchy tab

A

a. Identification tab

The Identification tab contains the tags pane where a user can associate tags with a term.

70
Q

What are two ways tags are used in SAS Business Data Network?

Select one or more:
a. to subset the list of terms in the term list
b. to explore lineage for the selected tag
c. to identify terms that are flagged for deletion
d. to indicate terms that are not yet published

A

a. to subset the list of terms in the term list
b. to explore lineage for the selected tag

Tags in SAS Business Data Network are used to subset the list of terms displayed in the terms list, as well as explore lineage for the selected tag.

71
Q

What are the four components of the data governance methodology?

Select one:
a. Plan, Design, Collect, Integrate
b. Plan, Define, Collect, Integrate
c. Plan, Collect, Integrate, Analyze
d. Plan, Collect, Design, Analyze

A

d. Plan, Collect, Design, Analyze

The four components of the data governance methodology are (PCIA)
Plan,
Collect,
Integrate,
Analyze.