2. SAS Data Management Tools and Applications Flashcards

Question

What are jobs (processes) in SAS Data Integration Studio?

Answer 1

Jobs read from sources, perform necessary data transformations, and load targets. In other words, a job is a metadata object that organizes sources, targets, and transformations into processes that create output

Answer 2

table and external file objects

Answer 3

transformations

Answer 4

A transformation allows you to specify how to extract data, transform data, or load data into data stores. Each transformation that you specify in a process flow diagram generates or retrieves SAS code which can be customized.

Answer 5

The Splitter transformation is found in the Data category of transformations and can be used to create one or more subsets of a source.

Answer 6

The Join transformation is found in the SQL category of transformations and generates PROC SQL code. It is used to create an SQL query that runs in the context of a SAS Data Integration Studio job and supports all the join types such as inner join, left join, right join, and full join It supports subqueries and pass-through SQL and features a graphical interface for building and configuring the components of the SQL query.

Answer 7

* Main display area with Diagram tab, Code tab, and Log tab * Navigate pane - navigate the current query * SQL Clauses pane - add SQL clauses or edit the join type * Properties pane - view and edit properties of a selected item

Answer 8

A transformation used to read external files. This is not needed for SAS files.

Answer 9

Generates PROC SQL code and supports the following items: * one source table * SELECT clause * derived columns * WHERE clause * GROUP BY clause * ORDER BY clause

Answer 10

The Summary Statistics transformation generates PROC MEANS code and provides the following items: * descriptive statistics like SUM, MEAN, and STD * multiple analysis variables (for example, revenue and profit) * multiple classification (group by) variables (for example, gender and age group) * output to a report by default * output to a table

Answer 11

The transformations in the SQL group generate SQL code and enable you to * create tables (SAS and DBMS) * delete rows from a table * execute SQL statements in a DBMS * extract rows from a source table * insert rows into a target table * join tables * SQL merge (update or insert) * perform set operations * update rows in a target table.

Answer 12

Set operations allow multiple query result sets to be combined into a single result set. The sets are combined vertically, with operations combing rows from the source sets in different ways depending on the operation. UNIONS, EXCEPTS, INTESECTS

Answer 13

Custom transformations give data integration developers the flexibility to extend the functional capabilities of SAS Data Integration Studio. All transformations, including custom ones, generate SAS code to accomplish the following: * extract data * transform data * load data into data stores * create reports

Answer 14

Java Plug-in Transformation Templates - Created with the Java programming language Generated Transformation Templates - Created with the New Transformation Wizard and SAS code

Answer 15

Options and option groups are related to macro variables. When combined with macro variables, users can select values on the Options tab in the transformation properties to configure the transformation. These values will be assigned to the corresponding macro variables in the code. Options are surfaced to users on the Options tab of the transformation when they use the transformation in a job. Groups become the organizational categories in the left-hand column of the Options tab, and prompts become options. Basically parameters for custom transformations

Answer 16

a collection of files and algorithms that store data and logic for defining data management operations such as data cleansing and standardization. The definitions in the QKB that perform the cleansing tasks are geography and language-specific.

Answer 17

a combination of geography and language used in the QKB

Answer 18

In SAS Data Management Studio (aka DataFlux)

Answer 19

A definition is a set of steps for processing data values. Definitions are organized by type.

Answer 20

* Case - Transforms a text string by changing the case of its characters to uppercase, lowercase, or proper case. * Extraction - Extracts parts of the text string and assigns them to corresponding tokens for the specified data type. * Gender Analysis - Guesses the gender of the individual in the text string. * Identification Analysis - Identifies the text string as referring to a particular predefined category. * Language Guess - Guesses the language of a text string. * Locale Guess - Guesses the locale of a text string. * Match Generates - match codes for text strings where the match codes denote a fuzzy representation of the character content of the tokens in the text string. * Parse - Parses a text string into meaningful tokens. * Pattern Analysis - Transforms a text string into a particular pattern. * Standardization - Transforms a text string into a standard format.

Answer 21

a Windows-based desktop client application that enables you to create and manage processes for ensuring the accuracy and consistency of data during the data management life cycle. Typically, Data Management Studio is thought of as the development environment.

Answer 22

DataFlux Data Management Studio DataFlux Data Management Server SAS Quality Knowledge Base (QKB) Reference data source(s) DataFlux repository

Answer 23

provides a scalable server environment for executing the processes created in Data Management Studio. Typically, Data Management Server is thought of as the production environment.

Answer 24

third-party address verification, enrichment, and geo-coding databases available to validate and enhance your data. aka “data packs”

Answer 25

storage location for objects created and executed in Data Management Studio. Data Management Server also needs a repository of objects that will be available (and run) on the server.

Answer 26

Data Management Studio is used to * define data connections * establish links to the QKB and address verification data packs * create repositories * create data explorations, profiles, data jobs, and process jobs

Answer 27

Plan - Define and Discover Act - Design and Execute Monitor - Evaluate and Control

Answer 28

A DataFlux repository is the storage location for objects that the user creates in Data Management Studio. The repository is used to organize work and can be used to surface lineage between data sources, objects and files created.

Answer 29

Data storage part of the repository, stored in a database. It can contain * explorations and reports * profiles and reports * business rules * monitoring results * custom metrics * business data information * master data information File storage part of the repository, stored as operating system files. It can contain * data jobs * process jobs * match reports * entity resolution f iles * queries * entity definitions * other files

Answer 30

On the Administration riser bar in Data Management Studio

Answer 31

Locales are organized in a hierarchy according to their language and country. You can expand the QKB to display the available languages, and then expand one of the languages to view the locales that are associated with that language.

Answer 32

Reference data sources are used to verify and enrich data. The reference sources (also known as data packs) are typically a database used by Data Management Studio to compare user data to the reference source. Given enough information to match an address or location or phone, the reference data source can add a variety of additional f ields to further clarify and enrich your data. Data Management Studio allows direct use of data packs provided f rom United States Postal Service, Canada Post and Geo+Phone data.

Answer 33

On the Administration riser bar in Data Management Studio

Answer 34

Data connections enable you to access your data in Data Management Studio from many types of data sources.

Answer 35

* ODBC Connection - Displays the Microsoft Windows ODBC Data Source Administrator dialog box. * Domain Enabled ODBC Connection - Enables you to link an ODBC connection to an authentication server domain so that credentials for each user are automatically applied when you access the domain. * Custom Connection - Enables you to access data sources that are not otherwise supported in the Data Management Studio interface. * SAS Data Set Connection - Enables you to connect to a folder that contains one or more SAS data sets.

Answer 36

Data connections are created f rom the Data riser bar in Data Management Studio.

Answer 37

Define - First step of the planning phase where the people, processes, technologies, and data sources are defined. Roadmaps that include articulating acceptable outcomes are built. Finally, cross-functional teams across business units and between business and IT communities are created to define the data management business rules. Discover - This step of the methodology lets you explore metadata to verify that the right data sources are included in the data management program and create detailed data profiles of identified data sources to understand their strengths and weaknesses.

Answer 38

A data exploration will help you understand the structure of your data sources. Data explorations provide comparisons of structural information across data sources, showing, for example, variations in the spelling of field names. Data explorations can be used for the following purposes: * to identify data redundancies * to extract and organize metadata from multiple sources * to identify relationships between metadata * to identify specified business data types (city, address, phone, and so on)

Answer 39

A data profile helps in understanding the data values in the various data sources. Data profiles show distributions of field values and include useful metrics such as mean and standard deviation for numeric fields and pattern counts for text fields. Data profiling will also surface occurrences of null values, outliers, and other anomalies. Data profiles provide the following benefits: * improve understanding of existing data * aid in identifying issues early in the data management process, when they are easier and less expensive to manage * help determine the steps necessary to address issues that were identified * enable you to assess the quality of your data across time.

Answer 40

From reports generated for a data exploration, a data collection is a simple list of fields from possibly various tables from possibly various data connections * provides a convenient way to group related data fields from across data sources * proves to be beneficial in data governance efforts (for example, as a way to record all the fields that contain phone numbers) * can be used as an input source for profiles.

Answer 41

A standardization scheme takes various spellings or representations of a data value and defines a standard way to consistently write these values.

Answer 42

field name matching field name analysis sample data analysis

Answer 43

A type of analysis that uses a definition known as a match definition. This algorithm applies fuzzy logic to determine fields that might represent the same type of data, based on the field name. For example, if you had a field named Phone_Number in one table and a field named Phone_No in another table, the definition would identify these fields as potentially being the same, based on a match code generated from the field name

Answer 44

A type of analysis that uses a definition known as an identification analysis definition. This algorithm uses the actual words contained in the field name, and looks up the words in a vocabulary of words to categorize the words and identify their potential meanings. For example, a field named Fax_Number might be categorized as a PHONE type field.

Answer 45

A type of analysis that uses an identification analysis definition but provides the ability to sample the data in the table to determine whether the data is of a specific type. For example, a sampling of 50 data records could reveal that the data in a particular field has 10 digits for each value. The identification analysis definition might categorize the field (based on its data contents) as a PHONE type field.

Answer 46

The Field Relationship map provides a visual presentation of the field relationships between all the databases, tables, and fields that are included in the data exploration.

Answer 47

Field name matching analyzes the names of each field from the selected data sources to determine which fields have either an identical name or names that are similar enough to be considered a match. If Field name matching is selected, an appropriate match definition (from the specified locale) along with a preferred sensitivity must be specified.

Answer 48

Field name analysis analyzes the names of each field from the selected data sources to determine which identity to assign to the field. An identity assigns a semantic meaning to the type of information that the field contains (for example, address, phone number, and so on).

Answer 49

* select the data sources to be profiled * select the data tables to be profiled * select the columns to be profiled * specify the metrics to be calculated against the data * specify additional, more advanced options (Primary Key/Foreign Key Analysis, Redundant Data Analysis, and Alerts)

Answer 50

The Properties tab is used to identify the data sources, tables, and columns to be used in the profile. In addition, you can use the profile to control the default profile metrics that are calculated, set options for calculating the profile metrics, override the metrics for certain columns, and more. Also, you can specify custom metrics, assign business rules, add alerts, and create visualizations as part of the profile. The Report tab is used to view the output from running the profile. Specifically, you can view the profile metrics for the tables included in the profile, as well as any columns that are included in the profile. In addition, you can view the results of any custom metrics specified, any business rules that were triggered, and any alerts that were triggered. Also, you can interactively visualize the profile metrics.

Answer 51

False. Not all profile metrics are necessary for all the columns that are profiled. As part of the planning phase, you should determine which metrics are most important for the types of data fields that you want to profile. Calculating frequency distribution, pattern frequency distribution, percentiles, and outliers take additional processing time, so for performance reasons, be judicious in their selection.

Answer 52

text files, SQL queries, filters on tables, and collections.

Answer 53

The New Filtered Table option is used to create a new filtered table using the selected source table. This option enables you to use an interface to build an expression to be used for filtering data records from the input table. The filtered result set is then profiled. All records from the database are returned to the machine where the profile runs, and the filtering is performed on that machine. The New SQL Query option is used to create an SQL query with a WHERE clause (using the selected table) to filter the data records. This option opens an SQL Query window where you can enter the SQL query to be used to filter the data. The filtered result set is then profiled. The database does the filtering for the SQL query and returns the filtered result set only to the machine where the profile runs.

Answer 54

Phrase schemes are used to standardize entire phrases or strings that consist of more than one word. Some examples of the data types that are typically stored as a phrase include cities, organizations, addresses, and names. Element schemes are applied to each individual word in a phrase. This can be especially useful if you have the type of data where certain words are repeated frequently (for example, the qualifying extension on business names). Some data types that are typically standardized using element schemes include address, organization, city, and name.

Answer 55

Standardization schemes can easily be built in the results in a profile report. By selecting a column of data that is used in the profile, you can right-click on the column and select Build a Scheme from the menu.

Answer 56

Design - After you complete the first two steps, this phase enables you to take the different structures, formats, data sources, and data feeds and create an environment that accommodates the needs of your business. At this step, business and IT users build workflows to enforce business rules for data quality and data integration. They also create data models to house data in consolidated or master data sources. Execute - After business users establish how the data and rules should be defined, the IT staff can install them within the IT infrastructure and determine the integration method (real-time, batch, or virtual). These business rules can be reused and redeployed across applications, which helps increase data consistency in the enterprise.

Answer 57

Data jobs are the main way to process data in Data Management Studio.

Answer 58

Data jobs consist of nodes. Each node is designed to accomplish an objective (for example, generate a subset). Most data jobs start with a source node, process data in intermediate nodes, and end with a target node.

Answer 59

If Target is selected for the Output Fields option, then only the node connected to a target node will initially propagate fields to the target node. If Source and Target is selected for the Output Fields option, then 1. the source node will automatically propagate fields to the next node in the job flow diagram 2. the node connected to a target node will automatically propagate fields to the target node. If All is selected for the Output Fields option, then all nodes will automatically propagate all fields to the next node in the job flow diagram.

Answer 60

Standardization Scheme - ensures the standard representation of data values and can be applied to single words (element scheme) or the entire phrase (phrase scheme). * is a simple lookup table * values not found remain as is * e.g. West Arbor Road -> West Arbor Rd Standardization Definition - is like an algorithm * is more complex than a standardization scheme * can involve one or more standardization schemes * can also parse data and apply regular expression libraries and casing * e.g. Mister John Q. Smith, Junior -> Name Definition -> Mr. John Q Smith, Jr Note that both can be used for a given field

Answer 61

Identification analysis definitions specify data and logic that can be used to identify the semantic type of a data string. Used by the Identification Analysis Node and the Right Fielding Node in the Quality Grouping of nodes. For example, an identification analysis definition might be used to determine whether the data shown represents the name of an individual or an organization.

Answer 62

The purpose of an Identification Analysis node is to analyze the values in the field being analyzed and attempt to determine the identity of that data value. It produces a result set that flags every record with the type of data that is discovered or identified.

Answer 63

The purpose of a Right Fielding node is to analyze the field to determine the identity of that data value. Then, based on that identity, the value is moved into a new field that is created for that particular identity.

Answer 64

The Branch node is used to split one input in to multiple outputs.

Answer 65

Gender analysis definitions attempt to identify the gender of a person based on the value of a person's name. The values returned by the gender analysis definition are M(ale), F(emale), and U(nknown).

Answer 66

Data Management Studio allows direct use of data packs provided by the United States Postal Service, Canada Post, and Geo+Phone data.

Answer 67

Geocoding enhances address information with latitude and longitude values. Geocoding can be licensed to return this information for the centroid of the postal code or at the rooftop level.

Answer 68

The QKB has a definition type called a Match definition that can transform data into encoded forms called match codes. The match codes can then be compared across records to help identify potential duplicate records.

Answer 69

Encoded text strings (generated by a Match definition) that represent a data value based on a level of sensitivity. Match codes can be compared across records to identify potentially duplicate data that might be obvious to the human eye, but not necessarily obvious to a computer program.

Answer 70

* Data is parsed into tokens (for example, Given Name and Family Name). * Significant tokens are selected for use in the match code. * Ambiguities and noise words are removed (for example, “the”). * Transformations are made (for example, Johnathon > Jon). * Phonetics are applied (for example, HN sounds like N). * Based on the sensitivity selection, the following results occur: ** Relevant components are determined. ** A certain number of characters of the transformed, relevant components are used.

Answer 71

used to determine how much of the data string you want to use in the generation of the match code, as well as the number of transformations that get applied to the original data string.

Answer 72

The concept of clustering involves grouping similar data records. Some key points: * Clusters can be determined based on a variety of conditions. * Records can be clustered based on match codes or actual data values. * A new field that contains the numeric identifier for the cluster is created. * Additional fields can be added to indicate which of the clustering conditions were met.

Answer 73

The Clustering node enables you to match records based on multiple conditions. You have full flexibility to create conditions that support your business needs.

Answer 74

The Surviving Record Identification node provides the ability to specify two types of rules to select and build out the best surviving record for multi-row clusters: * record rules * field rules

Answer 75

Field rules are used to “borrow” information from other records in the cluster when the surviving record does not contain a data value for that particular field, or if a “better” value exists in one of the other records in the cluster.

Answer 76

* By default, the SRI node passes only surviving records to the next node in the data flow. * Keep duplicate records. A new field can be defined to flag the surviving record with a True versus False for non-surviving records. * Keep original duplicate records and have a new distinct record as the surviving record. This is particularly useful if you are using field rules to update values in the surviving record. A new field can be defined to flag the surviving record with a True versus False for non-surviving records. * Keep original duplicate records. A new field can be defined to use the primary key field of the surviving record as the surviving record ID value for the non-surviving records in a cluster.

Answer 77

* chop tables - Extract individual words from a text string * regular expression libraries - Standardization, categorization, casing, and pattern identification * phonetics libraries - Phonetic (sound-alike) reduction of words * schemes - Standardize phrases, words, and tokens * vocabularies - Categorize words * grammars - Identify patterns in word categories

Answer 78

A lookup table used to transform data values to a standard representation. Used to standardize phrases, words, and tokens

Answer 79

A collection of character-level rules used to create an ordered word list from a string. For each character represented in the table, you can specify the classification and the operation performed by the algorithm. Used to extract individual words from a text string

Answer 80

A collection of patterns that produce the same output string for input strings that have similar pronunciations or spellings. Phonetic (sound-alike) reduction of words

Answer 81

A collection of patterns that are matched against a text string (from left to right) for character-level cleansing and operations. Standardization, categorization, casing, and pattern identification

Answer 82

A collection of words, each associated with one or more categories and likelihoods. Categorize words

Answer 83

A collection of rules that represent extracted patterns of words in a given context Identify patterns in word categories

Answer 84

In the context of the QKB, a definition is a collection of metadata that defines an algorithm that can perform a data-cleansing operation. A definition type corresponds to a type of data-cleansing operation. For example, a match definition contains metadata used for creating a match code, and a parse definition contains metadata used for parsing a data string into its individual tokens. Each definition is associated with a data type (that is, the "Name" parse definition belongs to the "Name" data type).

Answer 85

* Case * Extraction * Gender Analysis * Identification Analysis * Language Guess * Locale Guess * Match * Parse * Pattern Analysis * Standardization

Answer 86

The DQSETUPLOC system option points to the root location of the QKB. -DQSETUPLOC "D:\ProgramData\SAS\QKB\CI27_MultipleLocales"

Answer 87

The DQLOCALE system option selects the default locale(s) to be used in SAS code. It sets an ordered list of locales for SAS to use for data cleansing processes. -DQLOCALE (ENUSA ENGBR FRCAN)

2. SAS Data Management Tools and Applications Flashcards

(111 cards)