2. SAS Data Management Tools and Applications Flashcards

1
Q

What is a SAS/ACCESS LIBNAME?

A

A SAS programmer uses a LIBNAME statement to connect to the DBMS. DBMS tables can be named in a SAS program wherever SAS data sets can be named. SAS implicitly converts the SAS code into native database SQL statements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is SQL pass-through?

A

A SAS programmer uses the SQL procedure in a SAS session to write and submit native database SQL to the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 2 types of SQL pass-through components that you can submit to the DBMS?

A

SELECT - produce output to SAS software (SQL procedure pass-through queries)
EXECUTE - perform all other non-query SQL statements that do not produce output (for example, GRANT, CREATE, or DROP) i.e. DDL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the SASTRACE option?

A

SASTRACE is an options statement which when used, the SAS log will display the details about the native database SQL that is generated by the SAS/ACCESS engines. Developers can use SASTRACE to evaluate queries to ensure they are producing efficient code and not needlessly returning large amounts of data from the RDBMS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are the SAS Platform Applications organized?

A

Analytics applications
Reporting applications
Programming applications
Data Management applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is DataFlux Data Management Studio?

A

Combines data profiling, data cleansing, entity resolution, and monitoring tools for incorporating data quality into an information management process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SAS Data Integration Studio provides role-based security. True or False?

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a SAS Package?

A

A SAS internal format that can be used to move metadata between SAS metadata repositories and maintain metadata backups. It supports most SAS platform metadata objects including objects relevant to SAS Data Integration Studio, such as jobs, libraries, tables, and external files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you import a package in SAS Data Integration Studio?

A

Right-click a folder from the Folders tab
Select Import
Navigate to the SAS Package
Verify the objects to Import
Import

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the status bar in SAS Data Integration Studio display?

A

The status bar displays the name of the currently selected object, the name of the default SAS Application Server if one is selected, the login ID and metadata identity of the current user, and the name of the current SAS Metadata Server.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What components are accessed through the Tree View in SAS Data Integration Studio?

A

Folders Tree - displays the metadata folder hierarchy that is shared across the SAS platform and can be customized
Inventory Tree - displays metadata in predefined categories and is not customizable
Transformations Tree - provides access to transformations that can be added to SAS Data Integration Studio jobs
Checkouts Tree - available to users working under change management
Basic Properties pane - displays the basic properties of an object selected in a tree view

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Job Editor Window in SAS Data Integration Studio?

A

The Job Editor window enables you to create, run, and troubleshoot SAS Data Integration Studio jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the tabs in the Job Editor Window in SAS Data Integration Studio?

A
  • The Diagram tab is used to build and update the process flow for a job.
  • The Code tab is used to review or update code for a job.
  • The Log tab is used to review the log for a submitted job.
  • The Output tab is used to review the output of a submitted job.
  • The Details pane is used to monitor and debug a job in the Job Editor.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Options Window in SAS Data Integration Studio?

A

The Options window is used to specify global settings for SAS Data Integration Studio options.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a library?

A

A library is a collection of one or more files that are referenced as a unit.

  • provide access to source tables and target tables
  • can be created in SAS Management Console by an administrator
  • can be created in SAS Data Integration Studio by a Data Integration developer
  • are created with the New Library Wizard.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are 2 ways of defining SAS Metadata Objects for Source Data in SAS Data Integration Studio?

A
  • The Register Tables Wizard can be used to register metadata for existing tables.
  • The External File wizards can be used to register metadata for an external file.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are 3 items that the Register Tables Wizard in SAS Data Integration Studio defines?

A
  • type of table (DBMS type)
  • library
  • metadata location
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are 3 items that you define with the External File Wizard in SAS Data Integration Studio?

A
  • file to read
  • parameters
  • column definition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the dependencies for SAS Table Metadata in SAS Data Integration Studio?

A

the library object is used by the table object
the table object is dependent on the library object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the dependencies for DBMS Table Metadata in SAS Data Integration Studio?

A

the server object is used by the library object,
the library object is used by the table object.
the table object is dependent on the library object
the library object is dependent on the server object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the dependencies for ODBC Table Metadata in SAS Data Integration Studio?

A

the ODBC data source is used by the server object
the server object is used by the library object
the library object is used by the table object.
the table object is dependent on the library object,
the library object is dependent on the server object
the server object is dependent on the ODBC data source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What tasks can be performed to define metadata for a new table using the New Table Wizard in SAS Data Integration Studio?

A
  • import metadata from tables/columns that are already registered in the metadata repository
  • override metadata that was imported (for example, change a column name)
  • define new attributes for the table that is defined (for example, indexes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the Common Warehouse Metamodel (CWM)?

A

an industry-standard format that is supported by many software vendors. The CWM format supports relational metadata such as tables, columns, indexes, and keys.

Relational metadata, including the CWM format, can be used to exchange relational metadata between software applications and import models from third-party data modeling tools into SAS Data Integration Studio.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are SAS Metadata Bridges?

A

A tool for reading into SAS each supported vendor-specific CWM format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are jobs (processes) in SAS Data Integration Studio?

A

Jobs read from sources, perform necessary data transformations, and load targets. In other words, a job is a metadata object that organizes sources, targets, and transformations into processes that create output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are tan-colored objects in a SAS Data Integration Studio process flow?

A

table and external file objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are blue-colored objects in a SAS Data Integration Studio process flow?

A

transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a transformation in SAS Data Integration Studio?

A

A transformation allows you to specify how to extract data, transform data, or load data into data stores. Each transformation that you specify in a process flow diagram generates or retrieves SAS code which can be customized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the Splitter transformation in SAS Data Integration Studio?

A

The Splitter transformation is found in the Data category of transformations and can be used to create one or more subsets of a source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the Join Transformation in SAS Data Integration Studio?

A

The Join transformation is found in the SQL category of transformations and generates PROC SQL code.

It is used to create an SQL query that runs in the context of a SAS Data Integration Studio job and supports all the join types such as inner join, left join, right join, and full
join

It supports subqueries and pass-through SQL and features a graphical interface for building and configuring the components of the SQL query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are the main views in the Join Designer Window of SAS Data Integration Studio?

A
  • Main display area with Diagram tab, Code tab, and Log tab
  • Navigate pane - navigate the current query
  • SQL Clauses pane - add SQL clauses or edit the join type
  • Properties pane - view and edit properties of a selected item
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the File Reader transformation in SAS Data Integration Studio?

A

A transformation used to read external files. This is not needed for SAS files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the Extract transformation in SAS Data Integration Studio?

A

Generates PROC SQL code and supports the following items:
* one source table
* SELECT clause
* derived columns
* WHERE clause
* GROUP BY clause
* ORDER BY clause

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the Summary Statistics transformation in SAS Data Integration Studio?

A

The Summary Statistics transformation generates PROC MEANS code and provides the following items:
* descriptive statistics like SUM, MEAN, and STD
* multiple analysis variables (for example, revenue and profit)
* multiple classification (group by) variables (for example, gender and age group)
* output to a report by default
* output to a table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are the SQL transformations in SAS Data Integration Studio?

A

The transformations in the SQL group generate SQL code and enable you to
* create tables (SAS and DBMS)
* delete rows from a table
* execute SQL statements in a DBMS
* extract rows from a source table
* insert rows into a target table
* join tables
* SQL merge (update or insert)
* perform set operations
* update rows in a target table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the Set Operators Transformation in SAS Data Integration Studio?

A

Set operations allow multiple query result sets to be combined into a single result set. The sets are combined vertically, with operations combing rows from the source sets in different ways depending on the operation.

UNIONS, EXCEPTS, INTESECTS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are Custom Transformation in SAS Data Integration Studio?

A

Custom transformations give data integration developers the flexibility to extend the functional capabilities of SAS Data Integration Studio.

All transformations, including custom ones, generate SAS code to accomplish the following:
* extract data
* transform data
* load data into data stores
* create reports

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the two transformation templates available in SAS Data Integration Studio?

A

Java Plug-in Transformation Templates - Created with the Java programming language

Generated Transformation Templates - Created with the New Transformation Wizard and SAS code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are options and option groups in custom transformations in SAS Data Integration Studio?

A

Options and option groups are related to macro variables. When combined with macro variables, users can select values on the Options tab in the transformation properties to configure the transformation. These values will be assigned to the corresponding macro variables in the code.

Options are surfaced to users on the Options tab of the transformation when they use the transformation in a job. Groups become the organizational categories in the left-hand column of the Options tab, and prompts become options.

Basically parameters for custom transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is the SAS Quality Knowledge Base (QKB)?

A

a collection of files and algorithms that store data and logic for defining data management operations such as data cleansing and standardization. The definitions in the QKB that perform the cleansing tasks are geography and language-specific.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is a locale?

A

a combination of geography and language used in the QKB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Where are QKB components modified?

A

In SAS Data Management Studio (aka DataFlux)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is a QKB definition?

A

A definition is a set of steps for processing data values. Definitions are organized by type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What are the types of QKB definitions?

A
  • Case - Transforms a text string by changing the case of its characters to uppercase, lowercase, or proper case.
  • Extraction - Extracts parts of the text string and assigns them to corresponding tokens for the specified data type.
  • Gender Analysis - Guesses the gender of the individual in the text string.
  • Identification Analysis - Identifies the text string as referring to a particular predefined category.
  • Language Guess - Guesses the language of a text string.
  • Locale Guess - Guesses the locale of a text string.
  • Match Generates - match codes for text strings where the match codes denote a fuzzy representation of the character content of the tokens in the text string.
  • Parse - Parses a text string into meaningful tokens.
  • Pattern Analysis - Transforms a text string into a particular pattern.
  • Standardization - Transforms a text string into a standard format.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is DataFlux Data Management Studio?

A

a Windows-based desktop client application that enables you to create and manage processes for ensuring the accuracy and consistency of data during the data management life cycle. Typically, Data Management Studio is thought of as the
development environment.

46
Q

What are the components of the SAS data quality offering?

A

DataFlux Data Management Studio
DataFlux Data Management Server
SAS Quality Knowledge Base (QKB)
Reference data source(s)
DataFlux repository

47
Q

What is DataFlux Data Management Server?

A

provides a scalable server environment for executing the processes created in Data Management Studio. Typically, Data Management Server is thought of as the production environment.

48
Q

What are Reference data source(s)?

A

third-party address verification, enrichment, and geo-coding databases available to validate and enhance your data.

aka “data packs”

49
Q

What is the DataFlux repository?

A

storage location for objects created and executed in Data Management Studio. Data Management Server also needs a repository of objects that will be available (and run) on the server.

50
Q

What is Data Management Studio used for?

A

Data Management Studio is used to
* define data connections
* establish links to the QKB and address verification data packs
* create repositories
* create data explorations, profiles, data jobs, and process jobs

51
Q

What are the 3 phases for performing data management tasks
such as data quality, data integration, data migrations, and master data management (MDM).

A

Plan - Define and Discover
Act - Design and Execute
Monitor - Evaluate and Control

52
Q

What Is a DataFlux Repository?

A

A DataFlux repository is the storage location for objects that the user creates in Data Management Studio.

The repository is used to organize work and can be used to surface lineage between data sources, objects and files created.

53
Q

What are the 2 components of a DataFlux Repository?

A

Data storage part of the repository, stored in a database. It can contain
* explorations and reports
* profiles and reports
* business rules
* monitoring results
* custom metrics
* business data information
* master data information

File storage part of the repository, stored as operating system files. It can contain
* data jobs
* process jobs
* match reports
* entity resolution f iles
* queries
* entity definitions
* other files

54
Q

Where are QKB files registered?

A

On the Administration riser bar in Data Management Studio

55
Q

How are locales organized in Data Management Studio?

A

Locales are organized in a hierarchy according to their language and country. You can expand the QKB to display the available languages, and then expand one of the languages to view the locales that are associated with that language.

56
Q

What Is a Reference Source?

A

Reference data sources are used to verify and enrich data. The reference sources (also known as data packs) are typically a database used by Data Management Studio to compare user data to the reference source. Given enough information to match an address or location or phone, the reference data source can add a variety of additional f ields to further clarify and enrich your data.

Data Management Studio allows direct use of data packs provided f rom United States Postal Service, Canada Post and Geo+Phone data.

57
Q

How are Reference Sources organized in Data Management Studio?

A

On the Administration riser bar in Data Management Studio

58
Q

What Is a Data Connection?

A

Data connections enable you to access your data in Data Management Studio from many types of data sources.

59
Q

What are the types of Data Connections ?

A
  • ODBC Connection - Displays the Microsoft Windows ODBC Data Source
    Administrator dialog box.
  • Domain Enabled ODBC Connection - Enables you to link an ODBC connection to an authentication server domain so that credentials for each user are automatically applied when you access the domain.
  • Custom Connection - Enables you to access data sources that are not otherwise
    supported in the Data Management Studio interface.
  • SAS Data Set Connection - Enables you to connect to a folder that contains one or more SAS data sets.
60
Q

Where are data connections created?

A

Data connections are created f rom the Data riser bar in Data Management Studio.

61
Q

Describe the two primary areas of the Plan phase.

A

Define - First step of the planning phase where the people, processes, technologies, and data sources are defined. Roadmaps that include articulating acceptable outcomes are built. Finally, cross-functional teams across business units and between business and IT communities are created to define the data management business rules.

Discover - This step of the methodology lets you explore metadata to verify that the right data sources are included in the data management program and create detailed data profiles of identified data sources to understand their strengths and weaknesses.

62
Q

What is a data exploration in Data Management Studio?

A

A data exploration will help you understand the structure of your data sources. Data explorations provide comparisons of structural information across data sources, showing, for example, variations in the spelling of field names.

Data explorations can be used for the following purposes:
* to identify data redundancies
* to extract and organize metadata from multiple sources
* to identify relationships between metadata
* to identify specified business data types (city, address, phone, and so on)

63
Q

What is a data profile in Data Management Studio?

A

A data profile helps in understanding the data values in the various data sources. Data profiles show distributions of field values and include useful metrics such as mean and standard deviation for numeric fields and pattern counts for text fields. Data profiling will also surface occurrences of null values, outliers, and other anomalies.

Data profiles provide the following benefits:
* improve understanding of existing data
* aid in identifying issues early in the data management process, when they are easier and less expensive to manage
* help determine the steps necessary to address issues that were identified
* enable you to assess the quality of your data across time.

64
Q

What is a data collection in Data Management Studio?

A

From reports generated for a data exploration, a data collection is a simple list of fields from possibly various tables from possibly various data connections

  • provides a convenient way to group related data fields from across data sources
  • proves to be beneficial in data governance efforts (for example, as a way to record all the fields that contain phone numbers)
  • can be used as an input source for profiles.
65
Q

What is a standardization scheme in Data Management Studio?

A

A standardization scheme takes various spellings or representations of a data value and defines a standard way to consistently write these values.

66
Q

What are the three types of analyses available for exploring data

A

field name matching
field name analysis
sample data analysis

67
Q

What is field name matching?

A

A type of analysis that uses a definition known as a match definition. This algorithm applies fuzzy logic to determine fields that might represent the same type of data, based on the field name. For example, if you had a field named Phone_Number in one table and a field named Phone_No in another table, the definition would identify these fields as potentially being the same, based on a match code generated from the field name

68
Q

What is field name analysis?

A

A type of analysis that uses a definition known as an identification analysis definition. This algorithm uses the actual words contained in the field name, and looks up the words in a vocabulary of words to categorize the words and identify their potential meanings. For example, a field named Fax_Number might be categorized as a PHONE type field.

69
Q

What is sample data analysis?

A

A type of analysis that uses an identification analysis definition but provides the ability to sample the data in the table to determine whether the data is of a specific type. For example, a sampling of 50 data records could reveal that the data in a particular field has 10 digits for each value. The identification analysis definition might categorize the field (based on its data contents) as a PHONE type field.

70
Q

What is a Field Relationship Map?

A

The Field Relationship map provides a visual presentation of the field relationships between all the databases, tables, and fields that are included in the data exploration.

71
Q

What is Field Name Matching?

A

Field name matching analyzes the names of each field from the selected data sources to determine which fields have either an identical name or names that are similar enough to be considered a match.

If Field name matching is selected, an appropriate match definition (from the specified locale) along with a preferred sensitivity must be specified.

72
Q

What is Field Name Analysis?

A

Field name analysis analyzes the names of each field from the selected data sources to determine which identity to assign to the field. An identity assigns a semantic meaning to the type of information that the field contains (for example, address, phone number, and so on).

73
Q

What are the required actions to perform a data profile?

A
  • select the data sources to be profiled
  • select the data tables to be profiled
  • select the columns to be profiled
  • specify the metrics to be calculated against the data
  • specify additional, more advanced options (Primary Key/Foreign Key Analysis, Redundant Data Analysis, and Alerts)
74
Q

What are the two main components of a data profile interface?

A

The Properties tab is used to identify the data sources, tables, and columns to be used in the profile. In addition, you can use the profile to control the default profile metrics that are calculated, set options for calculating the profile metrics, override the metrics for certain columns, and more. Also, you can specify custom metrics, assign business rules, add alerts, and create visualizations as part of the profile.

The Report tab is used to view the output from running the profile. Specifically, you can view the profile metrics for the tables included in the profile, as well as any columns that are included in the profile. In addition, you can view the results of any custom metrics specified, any business rules that were triggered, and any alerts that were triggered. Also, you can interactively visualize the profile metrics.

75
Q

All profile metrics are necessary for all columns that are profiled. True or False?

A

False. Not all profile metrics are necessary for all the columns that are profiled. As part of the planning phase, you should determine which metrics are most important for the types of data fields that you want to profile.

Calculating frequency distribution, pattern frequency distribution, percentiles, and outliers take additional processing time, so for performance reasons, be judicious in their selection.

76
Q

Aside from data sources defined in data connections, what other input sources can be profiled?

A

text files, SQL queries, filters on tables, and collections.

77
Q

What are the 2 ways to profile a subset of data from a table?

A

The New Filtered Table option is used to create a new filtered table using the selected source table. This option enables you to use an interface to build an expression to be used for filtering data records from the input table. The filtered result set is then profiled. All records from the database are returned to the machine where the profile runs, and the filtering is performed on that machine.

The New SQL Query option is used to create an SQL query with a WHERE clause (using the selected table) to filter the data records. This option opens an SQL Query window where you can enter the SQL query to be used to filter the data. The filtered result set is then profiled. The database does the filtering for the SQL query and returns the filtered result set only to the machine where the profile runs.

78
Q

What are the 2 types of standardization schemes?

A

Phrase schemes are used to standardize entire phrases or strings that consist of more than one word. Some examples of the data types that are typically stored as a phrase include cities, organizations, addresses, and names.

Element schemes are applied to each individual word in a phrase. This can be especially useful if you have the type of data where certain words are repeated frequently (for example, the qualifying extension on business names). Some data types that are typically standardized using element schemes include address, organization, city, and name.

79
Q

Where are standardization schemes built in Data Management Studio?

A

Standardization schemes can easily be built in the results in a profile report. By selecting a column of data that is used in the profile, you can right-click on the column and select Build a Scheme from the menu.

80
Q

What are the 2 primary areas in the Act Phase of the DataFlux Data Management Methodology?

A

Design - After you complete the first two steps, this phase enables you to take the different structures, formats, data sources, and data feeds and create an environment that accommodates the needs of your business. At this step, business and IT users build workflows to enforce business rules for data quality and data integration. They also create data models to house data in consolidated or master data sources.

Execute - After business users establish how the data and rules should be defined, the IT staff can install them within the IT infrastructure and determine the integration method (real-time, batch, or virtual). These business rules can be reused and redeployed across applications, which helps increase data consistency in the enterprise.

81
Q

What are Data Jobs in Data Management Studio?

A

Data jobs are the main way to process data in Data Management Studio.

82
Q

What is a data job node?

A

Data jobs consist of nodes. Each node is designed to accomplish an objective (for example, generate a subset). Most data jobs start with a source node, process data in intermediate nodes, and end with a target node.

83
Q

What are the 3 Field Propagation Options in Data Management Studio?

A

If Target is selected for the Output Fields option, then only the node connected to a target node will initially propagate fields to the target node.

If Source and Target is selected for the Output Fields option, then
1. the source node will automatically propagate fields to the next node in the job flow diagram
2. the node connected to a target node will automatically propagate fields to the target node.

If All is selected for the Output Fields option, then all nodes will automatically propagate all fields to the next node in the job flow diagram.

84
Q

What are the 2 ways to standardize data with the Standardize Node?

A

Standardization Scheme - ensures the standard representation of data values and can be applied to single words (element scheme) or the entire phrase (phrase scheme).
* is a simple lookup table
* values not found remain as is
* e.g. West Arbor Road -> West Arbor Rd

Standardization Definition - is like an algorithm
* is more complex than a standardization scheme
* can involve one or more standardization schemes
* can also parse data and apply regular expression libraries and casing
* e.g. Mister John Q. Smith, Junior -> Name Definition -> Mr. John Q Smith, Jr

Note that both can be used for a given field

85
Q

What are Identification analysis definitions?

A

Identification analysis definitions specify data and logic that can be used to identify the semantic type of a data string. Used by the Identification Analysis Node and the Right Fielding Node in the Quality Grouping of nodes.

For example, an identification analysis definition might be used to determine whether the data shown represents the name of an individual or an organization.

86
Q

What is the Identification Analysis Node?

A

The purpose of an Identification Analysis node is to analyze the values in the field being analyzed and attempt to determine the identity of that data value. It produces a result set that flags every record with the type of data that is discovered or identified.

87
Q

What is the Right Fielding node?

A

The purpose of a Right Fielding node is to analyze the field to determine the identity of that data value. Then, based on that identity, the value is moved into a new field that is created for that particular identity.

88
Q

What is a branch node?

A

The Branch node is used to split one input in to multiple outputs.

89
Q

What are Gender analysis definitions?

A

Gender analysis definitions attempt to identify the gender of a person based on the value of a person’s name. The values returned by the gender analysis definition are M(ale), F(emale), and U(nknown).

90
Q

What reference data packs are available in Data Management Studio?

A

Data Management Studio allows direct use of data packs provided by the United States Postal Service, Canada Post, and Geo+Phone data.

91
Q

What is geocoding?

A

Geocoding enhances address information with latitude and longitude values. Geocoding can be licensed to return this information for the centroid of the postal code or at the rooftop level.

92
Q

What is a QKB Match Definition?

A

The QKB has a definition type called a Match definition that can transform data into encoded forms called match codes. The match codes can then be compared across records to help identify potential duplicate records.

93
Q

What are Match Codes?

A

Encoded text strings (generated by a Match definition) that represent a data value based on a level of sensitivity. Match codes can be compared across records to identify potentially duplicate data that might be obvious to the human eye, but not necessarily obvious to a computer program.

94
Q

Describe the steps of the match code generation process.

A
  • Data is parsed into tokens (for example, Given Name and Family Name).
  • Significant tokens are selected for use in the match code.
  • Ambiguities and noise words are removed (for example, “the”).
  • Transformations are made (for example, Johnathon > Jon).
  • Phonetics are applied (for example, HN sounds like N).
  • Based on the sensitivity selection, the following results occur:
    ** Relevant components are determined.
    ** A certain number of characters of the transformed, relevant components are used.
95
Q

What is match code sensitivity?

A

used to determine how much of the data string you want to use in the generation of the match code, as well as the number of transformations that get applied to the original data string.

96
Q

What is clustering?

A

The concept of clustering involves grouping similar data records.

Some key points:
* Clusters can be determined based on a variety of conditions.
* Records can be clustered based on match codes or actual data values.
* A new field that contains the numeric identifier for the cluster is created.
* Additional fields can be added to indicate which of the clustering conditions were met.

97
Q

What is the Clustering Node in Data Management Studio?

A

The Clustering node enables you to match records based on multiple conditions. You have full flexibility to create conditions that support your business needs.

98
Q

What is the Surviving Record Identification Node (SRI) Data Management Studio?

A

The Surviving Record Identification node provides the ability to specify two types of rules to select and build out the best surviving record for multi-row clusters:
* record rules
* field rules

99
Q

What are Field Rules in Data Management Studio?

A

Field rules are used to “borrow” information from other records in the cluster when the surviving record does not contain a data value for that particular field, or if a “better” value exists in one of the other records in the cluster.

100
Q

What are the options in the SRI node for passing on records?

A
  • By default, the SRI node passes only surviving records to the next node in the data flow.
  • Keep duplicate records. A new field can be defined to flag the surviving record with a True versus False for non-surviving records.
  • Keep original duplicate records and have a new distinct record as the surviving record. This is particularly useful if you are using field rules to update values in the surviving record. A new field can be defined to flag the surviving record with a True versus False for non-surviving records.
  • Keep original duplicate records. A new field can be defined to use the primary key field of the surviving record as the surviving record ID value for the non-surviving records in a cluster.
101
Q

What are the 6 component tables of the QKB?

A
  • chop tables - Extract individual words from a text string
  • regular expression libraries - Standardization, categorization, casing, and pattern identification
  • phonetics libraries - Phonetic (sound-alike) reduction of words
  • schemes - Standardize phrases, words, and tokens
  • vocabularies - Categorize words
  • grammars - Identify patterns in word categories
102
Q

Describe the Schemes QKB component File.

A

A lookup table used to transform data values to a standard representation.

Used to standardize phrases, words, and tokens

103
Q

Describe the Chop Tables QKB component File.

A

A collection of character-level rules used to create an ordered word list from a string. For each character represented in the table, you can specify the classification and the operation performed by the algorithm.

Used to extract individual words from a text string

104
Q

Describe the Phonetics Libraries QKB component File.

A

A collection of patterns that produce the same output string for input strings that have similar pronunciations or spellings.

Phonetic (sound-alike) reduction of words

105
Q

Describe the Regex Libraries QKB component File.

A

A collection of patterns that are matched against a text string (from left to right) for character-level cleansing and operations.

Standardization, categorization, casing, and pattern identification

106
Q

Describe the Vocabulary Libraries QKB component File.

A

A collection of words, each associated with one or more categories and likelihoods.

Categorize words

107
Q

Describe the Grammars QKB component File.

A

A collection of rules that represent extracted patterns of words in a given context

Identify patterns in word categories

108
Q

What is a QKB definition?

A

In the context of the QKB, a definition is a collection of metadata that defines an algorithm that can perform a data-cleansing operation. A definition type corresponds to a type of data-cleansing operation. For example, a match definition contains metadata used for creating a match code, and a parse definition contains metadata used for parsing a data string into its individual tokens. Each definition is associated with a data type (that is, the “Name” parse definition belongs to the “Name” data type).

109
Q

What are the types of QKB definitions in Data Management Studio?

A
  • Case
  • Extraction
  • Gender Analysis
  • Identification Analysis
  • Language Guess
  • Locale Guess
  • Match
  • Parse
  • Pattern Analysis
  • Standardization
110
Q

What is DQSETUPLOC?

A

The DQSETUPLOC system option points to the root location of the QKB.

-DQSETUPLOC “D:\ProgramData\SAS\QKB\CI27_MultipleLocales”

111
Q

What is DQLOCALE?

A

The DQLOCALE system option selects the default locale(s) to be used in SAS code. It sets an ordered list of locales for SAS to use for data cleansing processes.

-DQLOCALE (ENUSA ENGBR FRCAN)