SAS: Base Programming Flashcards
Lesson 1: Getting Started with SAS
SAS Modes
Interactive
Batch
Noninteractive
Lesson 1: Getting Started with SAS
Interactive Mode
Code runs immediately
Lesson 1: Getting Started with SAS
Batch Mode
SAS runs the program in the background by submitting it to the operating environment for batch execution.
Program runs when the computer has resources available.
Lesson 1: Getting Started with SAS
Noninteractive Mode
Store store SAS programming statements in a file.
When you issue a SAS command referencing the file, SAS opens, execute the code immediately, and stores any log messages and output in files.
Then SAS closes.
Lesson 1: Getting Started with SAS
Types of steps in a SAS program?
Data or Proc
Lesson 1: Getting Started with SAS
Data Step
A DATA step typically takes data, processes it, and creates a SAS data set, which is data in a form that SAS understands.
A DATA step can create new variables.
Lesson 1: Getting Started with SAS
Proc Step
A PROC step typically processes a SAS data set.
PROC steps can generate reports and graphs, manage data, and sort data.
Lesson 1: Getting Started with SAS
Data read by SAS
Oracle files
SAS data sets
Raw data files
Excel files
Other types of files
Lesson 1: Getting Started with SAS
Tasks useful in interactive mode
Debugging programs
Editing programs
Testing programs
Writing programs
Lesson 2: Navigating and Using the SAS Interface
Three main programming windows
Editor window
Log window
Output window
Lesson 2: Navigating and Using the SAS Interface
True statements about SAS output
Can create both listing and HTML output
Manage all SAS output in results window
Listing and HTML output are seperate in the results window
Lesson 2: Navigating and Using the SAS Interface
Editor Window
Also known as Enhanced Editor
Lesson 2: Navigating and Using the SAS Interface
Program Editor Window
Default code editor in operating environments other than Windows
Lesson 2: Navigating and Using the SAS Interface
Log Window
Provides messages about your program
Lesson 2: Navigating and Using the SAS Interface
Output window
SAS continually adds output to this window
Lesson 2: Navigating and Using the SAS Interface
Results vs Output Windows
Results window displays a tree view of output that is created during a SAS session.
Output window is available in all SAS environment.
Lesson 2: Navigating and Using the SAS Interface
Are session result preferences stored in SAS program.
No
Lesson 2: Navigating and Using the SAS Interface
What can be done in Explorer window?
Move, copy and delete SAS files
View contents of SAS libraries
Open and create new SAS files
Lesson 3: Working with SAS Code
Diagnosing and correcting unbalanced quotation marks
Lesson 3: Working with SAS Code
Diagnosing and correcting syntax errors
Lesson 3: Working with SAS Code
Syntax Error
When programming statements don’t conform to the rules of the SAS language at compile time
Lesson 3: Working with SAS Code
Semantic Error
When the language element is correct, but the element might not be valid for a particular usage at compile time
Lesson 3: Working with SAS Code
Execution Time Error
When SAS attempts to execute a program and execution fails at execution time
Lesson 3: Working with SAS Code
Data Error
When data values are invalid at execution time
Lesson 3: Working with SAS Code
Macrorelated Error
When you use the macro facility incorrectly
Lesson 3: Working with SAS Code
How SAS compiles and executes
SAS compiles and executes each DATA or PROC step independently based on step boundaries, such as a RUN statement, a QUIT statement, or the beginning of a new DATA or PROC step.
Lesson 3: Working with SAS Code
Block Comment
Can appear anywhere a single blank can appear
Cancontain semicolons or unmatched quotation marks
Denoted with /* COMMENT */
Lesson 3: Working with SAS Code
Step Boundaries
Run
Quit
Data Statement
Proc Statement
Lesson 3: Working with SAS Code
Long running message at top of window indicating that a step is running
Missing run statement or unbalanced quotation marks
Lesson 3: Working with SAS Code
A log message indicating that an option seems to be invalid
Missing Semicolon
Lesson 3: Working with SAS Code
Log Message indicating that a word wasn’t recognized
Spelling Error
Lesson 4: Working with SAS Libraries and SAS Data Sets
SAS Library
Collection of one or more SAS files that are recognized by SAS and that are referenced and stored as a unit
Lesson 4: Working with SAS Libraries and SAS Data Sets
Libref
Logical name to reference a SAS library
Lesson 4: Working with SAS Libraries and SAS Data Sets
Libraries created at SAS session start
Work: Temporary library
Sasuser: Permanent library
Lesson 4: Working with SAS Libraries and SAS Data Sets
Naming a SAS data set
libref. data-set-name
work. data
library_name.data-set-name
Lesson 4: Working with SAS Libraries and SAS Data Sets
Libname
Lesson 4: Working with SAS Libraries and SAS Data Sets
Assigning Libref
In an interactive SAS session, a libref that you assign with a LIBNAME statement remains assigned until you cancel or change the libref or until you end your SAS session. After you end your SAS session, the contents of a permanent library still exist in their physical location. However, each time you start a new SAS session, you must resubmit a LIBNAME statement to reassign the libref to that permanent library.
Lesson 4: Working with SAS Libraries and SAS Data Sets
Proc Contents
View the contents of a SAS library
PROC CONTENTS DATA=libref._ALL_ NODS;
**RUN; **
Lesson 4: Working with SAS Libraries and SAS Data Sets
_ALL_ in the PROC CONTENTS
Displays a list of all the SAS files that are in the specified SAS library.
Lesson 4: Working with SAS Libraries and SAS Data Sets
NODS in PROC CONTENTS
Default PROC CONTENTS report includes the descriptor portion of each data set in the SAS library.
NODS option to suppress the descriptor portions in the report.
Lesson 4: Working with SAS Libraries and SAS Data Sets
Variable
Column in SAS data set
Lesson 4: Working with SAS Libraries and SAS Data Sets
Observation
Row in SAS data set
Lesson 4: Working with SAS Libraries and SAS Data Sets
Value vs Descriptor
The data portion of a SAS data set contains the data values.
The descriptor portion of a SAS data set contains information about the attributes of the data set and information about each variable.
The information in the descriptor portion is called metadata.
Lesson 4: Working with SAS Libraries and SAS Data Sets
Viewing the Descriptor Portion of SAS Data Sets
You can use a PROC CONTENTS step to view the descriptor portion of a data set.
When you specify a single data set in the PROC CONTENTS step and do not include the NODS option, the step creates a report of the descriptor portion for the specified data set.
PROC CONTENTS DATA=SAS-data-set;
RUN;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Character Variable
Default Lenth of 1 to 32,767 bytes
Missing values are represented by blanks
Lesson 4: Working with SAS Libraries and SAS Data Sets
Numeric Variable
Default length of 8 bytes
Missing values represented by ‘.’
Lesson 4: Working with SAS Libraries and SAS Data Sets
Var Statement
Control which variables SAS includes in the Proc Print report
The VAR statement also controls the order in which SAS displays the variables in the report.
PROC PRINT DATA=SAS-data-set;
VAR variable(s);
RUN;
Lesson 4: Working with SAS Libraries and SAS Data Sets
NOOBS
BY default, SAS includes the observation numbers in the report.
Suppress the Obs column by using the NOOBS option in the PROC PRINT statement.
PROC PRINT DATA=SAS-data-set NOOBS;
VAR variable(s);
RUN;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Viewing the Data Portion of SAS Data Sets
You can use the PRINT procedure to display the data portion of a SAS data set.
Missing values are represented by blanks for character variables and by periods for numeric variables.
PROC PRINT DATA=SAS-data-set;
RUN;
Lesson 4: Working with SAS Libraries and SAS Data Sets
PROC SORT
Sorting Observations in SAS Data Sets
By default, replaces the original data set with the sorted data set. OUT= option to create a new output data set instead of replacing the input data set
By default sorts observations in ascending order of the values for the variable or variables listed in the BY statement.
You can use the DESCENDING option to sort on the variable or variables in descending order instead.
SAS treats missing values as the smallest possible values.
PROC SORT DATA=input-SAS-data-set;
BY BY-variable(s);
RUN;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Defining SAS Libraries for Relational Databases
After a database is associated with a libref, you can use a SAS two-level name to specify any table in the database and then work with that table as you would work with a SAS data set.
You use the CLEAR option in the LIBNAME statement to disassociate a libref that was previously assigned. Disassociating a libref disconnects the database engine from the database and closes any resources that are associated with that libref’s connection.
LIBNAME libref engine-name ;
LIBNAME libref CLEAR;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Sample Code
Assigning a Libref and Viewing the Contents of the Library
libname orion ‘my-file-path’;
proc contents data=orion._all_nods;
run;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Sample Code
Creating Work.NewSalesemps
data work.newsalesemps;
length First_Name $ 12 Last_Name $ 18 Job_Title $ 25;
infile ‘my-file-path\newemployees.csv’ dlm=’,’;
input First_Name $ Last_Name $ Job_Title $ Salary;
run;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Sample Code
Displaying the Descriptor Portion of a Data Set
proc contents data=work.newsalesemps;
run;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Sample Code
Displaying the Data Portion of a Data Set
proc print data=work.newsalesemps noobs;
var Last_Name First_Name Salary;
run;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Sample Code
Sorting a Data Set by Two Variables and Displaying the Output Data Set
proc sort data=work.newsalesemps
out=work.newsalesemps3;
by Job_Title descending Salary;
run;
proc print data=work.newsalesemps3;
run;
Lesson 4: Working with SAS Libraries and SAS Data Sets
Steps/statements that create printed output
Proc Print
Proc Contents
Lesson 4: Working with SAS Libraries and SAS Data Sets
Temporary Data Set
work.data or data
when specify a one level data set name SAS assumes the library is work and the data set is temp
Lesson 5: Creating SAS Data Sets
$w.
writes standard character data.
Lesson 5: Creating SAS Data Sets
w.d
writes standard numeric data
Lesson 5: Creating SAS Data Sets
COMMAw.d
writes numeric values with a comma that separates every three digits and a period
that separates every decimal fraction.
COMMA7.0 = 12,354
Lesson 5: Creating SAS Data Sets
COMMAXw.d
writes numeric values with a period that separates every three digits and a comma
that separates the decimal fraction.
COMMAX7.0 = 12.354
Lesson 5: Creating SAS Data Sets
DOLLARw.d
writes numeric values with a leading dollar sign, a comma that separates every three
digits, and a period that separates the decimal fraction.
DOLLAR7.0 = $12,345
Lesson 5: Creating SAS Data Sets
EUROXw.d
writes numeric values with a leading euro symbol (€), a period that separates every
three digits, and a comma that separates the decimal fraction.
Lesson 5: Creating SAS Data Sets
MMDDYY6.
010160
Lesson 5: Creating SAS Data Sets
MMDDYY8.
01/01/60
Lesson 5: Creating SAS Data Sets
MMDDYY10.
01/01/1960
Lesson 5: Creating SAS Data Sets
DDMMYY6.
311260
Lesson 5: Creating SAS Data Sets
DDMMYY8.
31/12/60
Lesson 5: Creating SAS Data Sets
DDMMYY10.
31/12/1960
Lesson 5: Creating SAS Data Sets
DATE7.
31DEC59
Lesson 5: Creating SAS Data Sets
DATE9.
31DEC1959
Lesson 5: Creating SAS Data Sets
WORDDATE.
January 1, 1960
Lesson 5: Creating SAS Data Sets
WEEKDATE.
Friday, January 1, 1960
Lesson 5: Creating SAS Data Sets
MONYY7.
JAN1960
Lesson 5: Creating SAS Data Sets
YEAR4.
1960
Lesson 5: Creating SAS Data Sets
Creating A SAS Data Set from a SAS Data Set
A basic DATA step that contains only a SET statement creates an exact copy of the input SAS data set.
A DATA step can reference temporary or permanent libraries.
Remember that you use a LIBNAME statement to assign a libref to a SAS library.
DATA output-SAS-data-set-name;
SET input-SAS-data-set-name;
RUN;
Lesson 5: Creating SAS Data Sets
Phases of DATA Step
- Compilation Phase: SAS scans the step for syntax errors. Then, SAS creates the program data vector (PDV) and the descriptor portion of the output data set. The PDV contains two temporary variables: _N_, which tracks the iterations of the DATA step, and _ERROR_, which signals the occurrence of an error that is caused by the data during execution.
- Execution Phase: SAS reads and processes the input data set, and creates the data portion of the output data set.
Lesson 5: Creating SAS Data Sets
Subsetting Observations in the DATA Step
Where Statement
Ccontrols which observations SAS writes to the output data set
WHERE where-expression;
Lesson 5: Creating SAS Data Sets
Subsetting Variables in the DATA Step
Drop + Keep Statements
DROP: Names the variables to exclude from the output data set
KEEP: Names the variables to include in the output data set
DROP variable-list;
KEEP variable-list;
Lesson 5: Creating SAS Data Sets
Assigning Permanent Labels and Formats
You can assign permanent labels and format to the variables in your output data set to control the way that variables names and values appear in reports.
Labels and formats don’t affect the stored names or values; they are stored in the descriptor portion of the data set.
You use a LABEL statement to assign a label to a variable.
To specify that SAS should display variable labels rather than names in a PROC PRINT report, you use the LABEL
option in the PROC PRINT statement.
You use a FORMAT statement to assign a format to a variable.
Lesson 5: Creating SAS Data Sets
Sample Code
Where Variable
data work.subset1;
set orion.sales;
where Country=’AU’ and Job_Title contains ‘Rep’;
run;
Lesson 5: Creating SAS Data Sets
Sample Code
Keep Statement
data work.subset1;
set orion.sales;
where Country=’AU’ and Job_Title contains ‘Rep’;
keep First_Name Last_Name Salary Job_Title Hire_Date;
run;
proc print data=work.subset1 label;
run;
Lesson 5: Creating SAS Data Sets
Sample Code
Label Statement
data work.subset1;
set orion.sales;
where Country=’AU’ and Job_Title contains ‘Rep’;
keep First_Name Last_Name Salary Job_Title Hire_Date;
label Job_Title=’Sales Title’ Hire_Date=’Date Hired’;
run;
Lesson 5: Creating SAS Data Sets
Sample Code
Format Statement
data work.subset1;
set orion.sales;
where Country=’AU’ and Job_Title contains ‘Rep’;
keep First_Name Last_Name Salary Job_Title Hire_Date;
label Job_Title=’Sales Title’ Hire_Date=’Date Hired’;
format Salary comma8. Hire_Date ddmmyy10.;
run;
Lesson 5: Creating SAS Data Sets
Sample Code
Label in Proc Print
data work.subset1;
set orion.sales;
where Country=’AU’ and Job_Title contains ‘Rep’;
keep First_Name Last_Name Salary Job_Title Hire_Date;
label Job_Title=’Sales Title’ Hire_Date=’Date Hired’;
format Salary comma8. Hire_Date ddmmyy10.;
run;
proc print data=work.subset1 label;
run;
Lesson 5: Creating SAS Data Sets
When does Data Step stop executing?
No more observations to read in the input data set.
Lesson 5: Creating SAS Data Sets
SAS Base Date
January 1, 1960
Lesson 5: Creating SAS Data Sets
PDV Variables
_N_, which tracks the iterations of the DATA step
_ERROR_, which signals the occurrence of an error that is caused by the data during execution
If there is an Error = 1
Lesson 5: Creating SAS Data Sets
Operand
varianble name or constant
numeric or character
Lesson 5: Creating SAS Data Sets
operator
arithmetic, logical or comparison
Lesson 5: Creating SAS Data Sets
Created in Compilation Phase
descriptor portion of output data set
Program Data Vector: (PDV)
The _N_ and _ERROR_ automatic variables
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Accessing and Viewing Excel Data in SAS
You can use the SAS/ACCESS LIBNAME statement to assign a libref to a Microsoft Excel workbook.
Then, SAS treats each worksheet in the workbook as though it is a SAS data set.
You can use the Explorer window or the CONTENTS procedure to view the worksheets, or you can reference a worksheet directly in a DATA or PROC step.
In SAS, Excel worksheet names contain a dollar sign.
To reference an Excel worksheet directly in a DATA or PROC step, you use a SAS name literal because a valid SAS name cannot contain a dollar sign.
LIBNAME libref ‘physical-file-name’;
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Using an Excel Worksheet as Input in the DATA Step
You can use an Excel worksheet as input data in a DATA step to create a new SAS data set.
You use a SET statement with a SAS name literal to read from an Excel worksheet.
You can also use other DATA step statements such as WHERE, KEEP, DROP, LABEL, and FORMAT statements with input from an Excel worksheet.
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Importing an Excel Worksheet
You can use a PROC IMPORT step to read an Excel worksheet and create a SAS data set from it.
The Import Wizard is a point-and-click interface that generates PROC IMPORT code for you.
PROC IMPORT OUT= output-data-set
** DATAFILE=’input=excel-workbook’**
** DBMS=EXCEL REPLACE;**
** RANGE=’range-name’;**
RUN;
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Creating an Excel Workbook in SAS with Data Step
You can use the DATA step to create an Excel worksheet from a SAS data set if you use the SAS/ACCESS LIBNAME statement to assign a libref to the Excel workbook that contains the worksheet.
You use one DATA step for each worksheet that you want to create.
LIBNAME libref ‘physical-file-name’;
DATA output-excel-worksheet;
** SET input-data-set;**
RUN;
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Creating an Excel Workbook in SAS
PROC COPY
You can use the SAS/ACCESS LIBNAME statement along with the COPY procedure to create an Excel workbook that contains one or more worksheets from one or more SAS data sets.
The COPY procedure creates a worksheet for each SAS data set that is listed in the SELECT statement.
LIBNAME output-libref ‘physical-file-name’;
PROC COPY IN=input-libref OUT=output-libref;
** SELECT input-data-set1 input-data-set2;**
RUN;
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Creating an Excel Workbook in SAS
PROC EXPORT
The Export Wizard is a point-and-click interface that generates PROC EXPORT code for you.
PROC EXPORT DATA= input-data-set
** OUTFILE=’output-excel-workbook’**
** DBMS=EXCEL REPLACE;**
RUN;
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Sample Code
Using PROC IMPORT to Create a SAS Data Set from an Excel Worksheet
proc import out=work.subset2a;
datafile=’my-file-path\sales.xls’;
range=”Australia$”n;
getnames=yes;
mixed=no;
scantext=yes;
usedate=yes;
scantime=yes;
run;
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Sample Code
Using the DATA Step to Create an Excel Workbook from SAS Data Sets
libname orionxls ‘my-file-path\qtr20007a.xls’;
data orionxls.qtr1_2007;
set orion.qtr2_2007;
run;
data orionxls.qwtr2_2007;
set orion.qtr2_2007;
run;
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
Sample Code
Using PROC COPY to Create an Excel Workbook from SAS Data Sets
libname orionxls ‘my-file-path\qtr20007b.xls’;
proc copy in=orion out=orionxls;
select qtr1_2007 qtr2_2007;
run;
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
By default, how does the Data Step store an Excel value that is formatted as a date?
numeric SAS date value with the DATE9. format
Lesson 6: Creating SAS Data Sets from Microsoft Excel Worksheets
What do headings with spaces in Excel worksheets become
labels
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Data Step Processing Diagram
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Raw Data File
A raw data file is an external file whose records contain data values that are organized in fields.
Raw data files are not software-specific.
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Delimited Raw Data File
File in which the
data values are separated by spaces or other special characters.
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Creating a SAS Data Set from a Raw Data File
You can use a DATA step to read input from a raw data file and create a SAS data set from it.
Instead of a SET statement, you use an INFILE statement and an INPUT statement.
The INFILE statement specifies the name and location of the raw data file, and the INPUT statement provides specifications for SAS to read the data values as variables.
You use the DLM= option in the INFILE statement to specify a delimiter if the raw data file uses a character other than a blank space to delimit data values.
Delimited data is also known as list input. Other types of raw data files might contain column input or formatted input.
Raw data files can contain either standard or nonstandard data values.
A raw data file does not contain names for the data fields.
You specify variable names for the values in the INPUT statement, and the names that you use must follow the rules for SAS names.
DATA output-SAS-data-set-name;
INFILE ‘raw-data-file-name’ ;
INPUT specifications;
RUN;
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Specifying Lengths of Variables from Raw Data Files
Default each variable with a length of 8 bytes (SAS truncates character values that are longer than 8 characters)
You can specify multiple variables in one LENGTH statement.
You can either specify different lengths for each variable that you list in the LENGTH statement or you can specify one length for all of the variables listed.
The LENGTH statement must precede the INPUT statement in a DATA step, because SAS determines variable attributes such as length the first time it encounters a variable.
LENGTH variable(s) $ length;
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Working with Nonstandard Data
Delimited Raw Data Files
Nonstandard data is data that SAS cannot read without extra instructions.
To read nonstandard data, you use the colon format modifier and an informat in the INPUTstatement.
Informats are similar to formats.
An informat is the instruction that SAS uses to read data values into a variable.
An informat also provides a length for a character variable.
INPUT variable <$> variable <:informat>;
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Further DATA Step Processing
Delimited Raw Data Files
Can use a DROP statement or a KEEP statement
Can use FORMAT and LABEL statements to add permanent attributes to your output data
CANNOT use a WHERE statement
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Further DATA Step Processing
Reading Instream Data
Instream data is data that you specify within your program
You use a DATALINES statement to read instream data
You use a NULL statement to indicate the end of the input data
You use the INPUT statement for list input to specify that the instream data is delimited with blanks.
You could also use a LENGTH statement before a DATALINES statement to specify lengths for variables within the instream data that are shorter or longer than 8 bytes.
DATA output-SAS-data-set;
INPUT specifications;
DATALINES;
instream data
;
RUN;
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Sample Code
Creating a SAS Data Set from a Raw Data File
data work.subset3;
infile ‘my-file-path\sales.csv’ dlm=’,’;
length First_Name $ 12 Last_Name $ 18 Gender $ 1
Job_Title $ 25 Country $ 2;
input Employee_ID First_Name $ Last_Name $ Gender $
Salary Job_Title $ Country $ Birth_Date :date9.
Hire_Date :mmddyy10.;
keep First_Name Last_Name Salary Job_Title Hire_Date;
label Job_Title=’Sales Title’ Hire_Date=’Date Hired’;
format Salary dollar12. Hire_Date monyy7.;
run;
Lesson 7: Creating SAS Data Sets from Delimited Raw Data Files
Sample Code
Reading Instream Data
data work.subset4;
input Employee_ID First_Name $ Last_Name $;
datalines;
120102 Tom Zhou
120103 Wilson Dawes
120121 Irenie Elvish
;
run;
Lesson 8: Validating and Cleaning Data
Validating Data
Process of identifying invalid values in your data.
For example, you might have character values stored in a numeric variable.
Or you might have invalid date values such as 99NOV1978.
Invalid data values cause data errors when the program runs.
Lesson 8: Validating and Cleaning Data
SAS procedures for detecting invalid data
PROC PRINT
PROC FREQ
PROC MEANS
PROC UNIVARIATE
Lesson 8: Validating and Cleaning Data
Examining Data Errors When Reading Raw Data Files
PROC PRINT
You can use a PROC PRINT step with a VAR statement and a WHERE statement to create a report of observations with invalid values.
For example, you can create a report that includes observations that have missing values for a particular variable.
You use a SAS date constant to convert a calendar date to a SAS date value.
PROC PRINT DATA=SAS-data-set;
** VAR variable(s);**
** WHERE where-expression;**
RUN;
Lesson 8: Validating and Cleaning Data
Examining Data Errors When Reading Raw Data Files
PROC FREQ
You cannot use PROC PRINT to detect values that are not unique.
Instead, you can use PROC FREQ to view a frequency table of the unique values for a variable.
You can use the TABLES statement in a PROC FREQ step to specify which frequency tables to produce.
You can use the NLEVELS option to display a table that provides the number of distinct values for each variable named in the TABLES statement.
You can use the NOPRINT option in the TABLES statement to suppress the individual frequency tables.
PROC FREQ DATA=SAS-data-set ;
** TABLES variable(s) ;**
RUN;
Lesson 8: Validating and Cleaning Data
Validating Data
PROC MEANS
You can use PROC MEANS to create summary reports of descriptive statistics such as SUM, MEAN, and RANGE
Default statistics for all numeric variables: N, MEAN, STD, MIN, and MAX
PROC MEAN step, such as NMISS, which is the number of observations with missing values
Lesson 8: Validating and Cleaning Data
Validating Data
PROC UNIVARIATE
PROC UNIVARIATE produces summary reports of descriptive statistics.
In addition, PROC UNIVARIATE produces tables of extreme observations and missing values.
The NEXTROBS= option in the PROC UNIVARIATE statement specifies the number of extreme observations that PROC UNIVARIATE lists.
Lesson 8: Validating and Cleaning Data
Cleaning Data Programmatically Using the DATA Step
Assignment Statement
You can use an assignment statement to convert values of a variable.
The assignment statement is one of the few SAS statements that don’t begin with a keyword.
An assignment statement evaluates an expression and assigns the resulting value to a new or existing variable.
variable=expression
Lesson 8: Validating and Cleaning Data
Cleaning Data Programmatically Using the DATA Step
Upcase
Converts all letters in an argument to uppercase
UPCASE(argument)
Lesson 8: Validating and Cleaning Data
Cleaning Data Programmatically Using the DATA Step
IF-THEN
By default, an assignment statement in a DATA step applies to every observation in the data set.
To apply different assignment statements to different observations, you can use conditional statements.
The IF-THEN statement executes a SAS statement for observations that meet specific conditions.
IF expression THEN statement;
Lesson 8: Validating and Cleaning Data
Cleaning Data Programmatically Using PROC SORT
NODUPRECS
You can use PROC SORT to remove duplicate observations from your data by specifying the NODUPRECS option in the PROC SORT statement.
NODUPRECS checks only consecutive observations, so it’s a good idea to sort by all variables when you use NODUPRECS.
You can use the NODUPKEY option in the PROC SORT statement to remove observations with duplicate BY values.
You can use the EQUALS option along with the NODUPRECS and NODUPKEY options in the PROC SORT statement to maintain the relative order of the observations within the input data set and the output data set.
Lesson 8: Validating and Cleaning Data
Sample Code
Using PROC PRINT to Validate Data
proc print data=orion.nonsales;
var Employee_ID Gender Salary Job_Title Country
Birth_Date Hire_Date;
where Employee_ID = . or
Gender not in (‘F’,’M’) or
Salary not between 24000 and 500000 or
Job_Title = ‘ ‘ or
Country not in (‘AU’,’US’) or
Birth_Date > Hire_Date or
Hire_Date < ‘01JAN1974’d;
run;
Lesson 8: Validating and Cleaning Data
Sample Code
Using PROC FREQ to Validate Data
proc freq data=orion.nonsales nlevels;
table _all_ / noprint;
run;
Lesson 8: Validating and Cleaning Data
Sample Code
Using PROC MEANS to Validate Data
proc means data=orion.nonsales n nmiss min max;
var Salary;
run;
Lesson 8: Validating and Cleaning Data
Sample Code
Using PROC UNIVARIATE to Validate Data
proc univariate data=orion.nonsales nextrobs=8;
var Salary;
run;
Lesson 8: Validating and Cleaning Data
Sample Code
Cleaning Data Programmatically Using Assignment Statements
data work.clean;
set orion.nonsales;
Country=upcase(Country);
run;
Lesson 8: Validating and Cleaning Data
Sample Code
Cleaning Data Programmatically Using IF-THEN/ELSE Statements
data work.clean;
set orion.nonsales;
Country=upcase(Country);
if Employee_ID=120106 then Salary=26960;
else if Employee_ID=120115 then Salary=26500;
else if Employee_ID=120191 then Salary=24015;
else if Employee_ID=120107 then
Hire_Date=’21JAN1995’d;
else if Employee_ID=12011 then
Hire_Date=’01NOV1978’d;
else if Employee_ID=121011 then
Hire_Date=’01JAN1998’d;
run;
Lesson 8: Validating and Cleaning Data
Sample Code
Removing Duplicates Using PROC SORT
proc sort data=orion.nonsalesdupes out=sorted nodupkey
equals;
by Employee_ID;
run;
Lesson 8: Validating and Cleaning Data
Integrity Contraints
Integrity constraints are data validation rules that you can specify to restrict the data values that can be stored for a variable in a SAS data set.
To eliminate the possibility of invalid data in your data set, you can place integrity constraints on the data set.
Integrity constraints help you preserve the validity and consistency of your data.
When the values associated with an integrity constraint variable are added, updated, or deleted, SAS enforces the integrity constraints.
Lesson 8: Validating and Cleaning Data
General Integrity Constraints
General integrity constraints restrict the values of variables within a single file
Lesson 8: Validating and Cleaning Data
General Integrity Constraints
Check
Limits the data values of variables to a specific set, range, or list ofvalues. Check constraints can also be used to ensure that the data values in one variable within an observation are contingent on the data values of another variable in the same observation.
Lesson 8: Validating and Cleaning Data
General Integrity Constraints
Not Null
Requires that a variable contain a data value. Null (missing) values are not allowed.
Lesson 8: Validating and Cleaning Data
General Integrity Constraints
Unique
Requires that the specified variable or variables contain unique data values. A null data value is allowed but is limited to a single instance.
Lesson 8: Validating and Cleaning Data
General Integrity Constraints
Primary Key
Requires that the specified variable or variables contain unique data values and that null data values are not allowed. Only one primary key can exist in a data file.
Lesson 8: Validating and Cleaning Data
Referential Integrity Constraints
A referential integrity constraint is created when a primary key integrity constraint in one data file is referenced by a foreign key integrity constraint in another data file.
The foreign key constraint links the data values of one or more variables in the foreign key data file, to corresponding variables and values in the primary key data file.
Data values in the foreign key data file must have a matching value in the primary key data file, or they must be null.
When data is updated or deleted in the primary key data file, the modifications are controlled by a referential action that is defined as part of the foreign key constraint. You can define separate referential actions for the update and delete
operations.
Lesson 8: Validating and Cleaning Data
Referential Integrity Constraints
Restrict
Prevents the data values of the primary key variables from being updated or deleted if there is a matching value in one of the foreign key data file’s corresponding foreign key variables.
The restrict type of action is the default action if one is not specified.
Lesson 8: Validating and Cleaning Data
Referential Integrity Constraints
Set Null
Enables the data values of the primary key variables to be updated or deleted, but matching data values in the foreign key data files are changed to null (missing) values.
Lesson 8: Validating and Cleaning Data
Referential Integrity Constraints
Cascade
Enables the data values in the primary key variables to be updated, and additionally updates matching data values in the foreign key data files to the same value.
The cascade type of action is supported only for update operations.
Lesson 9: Manipulating Data
OUTPUT Statement
Use in a conditional statement to write the current observation to a specific data set that is listed in the DATA statement.
Tells SAS to write the current observation to a SAS data set immediately, not at the end of the DATA step.
OUTPUT ;
Lesson 9: Manipulating Data
Creating Multiple Data Sets
OUTPUT Statement
The SAS data set Orion.Employee_Addresses contains addresses for all Orion Star employees.
Suppose you want to create three separate data sets based on the value of the variable Country.
If the value of Country is US, write the observation to the USA data set.
If the value of Country is AU, write the observation to the Australia data set.
If the value of Country is anything else, write the observation to the Other data set.
data usa australia other;
set orion.employee_addresses;
if Country = ‘AU’ then output australia;
else if Country = ‘US’ then output usa;
else output other;
run;
Lesson 9: Manipulating Data
Writing Errors to a Separate Data Set
OUTPUT Statement
Suppose you’re reading a raw data file and want to create two data sets based on the value of the automatic temporary variable _ERROR_.
If the observation contains errors, write the observation to the data set Errors.
Otherwise, write the observation to the data set Correct.
data errors correct;
infile ‘sales.csv’ dlm=’,’;
input Employee_ID First $ Last $
Gender $ Salary Job_Title $ Country $
Birth_Date :date9.
Hire_Date :mmddyy10.;
if _ERROR_ = 1 then output errors;
else output correct;
run;
Lesson 9: Manipulating Data
What does SAS do when encountering a data error in a raw data record
Prints a ruler and the raw data record in the log
Assigns a missing value to the variable that the invalid data affects
Prints a note about the error in the log
Prints the variable values in the corresponding SAS observation in the log
Lesson 9: Manipulating Data
Using PROC Print to Validate Data
Lesson 9: Manipulating Data
Uses of SAS Functions
calculate sample statistics
create SAS date values
convert U.S. ZIP codes to state postal codes
round values
generate random numbers
extract a portion of a character value
convert data from one data type to another
Lesson 9: Manipulating Data
SAS Functions That Compute Sample Statistics
Lesson 9: Manipulating Data
SAS Date Functions
Lesson 9: Manipulating Data
IF/THEN Statement
Use to perform an action conditionally
The IF/THEN statement executes a SAS statement when the condition in the IF clause is true.
IF expression THEN statement;
To assign the value 26960 to the variable Salary when the value of Employee_ID is 120106, add the following IFTHEN statement to your DATA step.
data work.clean;
set orion.nonsales;
if Employee_ID=120106 then Salary=26960;
run;
Lesson 9: Manipulating Data
ELSE Statement
Providing an Alternative Action
Suppose you want to assign a value to Salary based on the other possible values of Employee_ID. One way to do this is to add IF/THEN statements for the other two conditions, as shown below:
if Employee_ID=120106 then Salary=26960;
if Employee_ID=120115 then Salary=26500;
if Employee_ID=120191 then Salary=24015;
However, when the DATA step executes, SAS evaluates each IF statement in order, even if the first condition is true.
This wastes system resources and slows the processing of your program.
The ELSE statement must immediately follow the IFTHEN statement in your program.
An ELSE statement executes only if the previous IF/THEN/ELSE statement is false.
data work.clean;
set orion.nonsales;
if Employee_ID=120106 then Salary=26960;
else if Employee_ID=120115 then Salary=26500;
else if Employee_ID=120191 then Salary=24015;
run;
Lesson 9: Manipulating Data
Length Statement
LENGTH variable(s) $ length;
Variable name, a dollar sign if it is a character variable, and then the length.
You can specify multiple variables in one LENGTH statement.
Remember that SAS determines variable attributes such as length and type the first time it encounters a variable. So, for the LENGTH statement to define the length for the variables in the output data set, it needs to precede any other references to the variable.
data work.subset3;
infile ‘c:\oriondata\sales.csv’;
length First_Name $ 12 Last_Name $ 18 Gender $ 1;
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
Lesson 9: Manipulating Data
WHERE Statement
WHERE whereexpression;
Specifies a condition for selecting observations and can be any valid SAS expression.
An expression is a sequence of operands and operators that form a set of instructions.
Must follow several rules:
Can include variables, constants, or both
Variables must exist in the input data set
Character constants must be enclosed in quotation marks
Numeric constants cannot be enclosed in quotation marks
data work.subset1;
set orion.sales;
where Country=’AU’ and Job_Title contains ‘Rep’;
run;
Lesson 9: Manipulating Data
Creating Variables
Assignment Statement
variable=expression;
If the expression in an assignment statement contains a missing value, the result of the expression is also a missing value.
Lesson 9: Manipulating Data
Creating Variables
Sum Statement
SUM(argument1,argument2,…)
Returns the sum of the specified arguments.
If an argument contains a missing value, it ignores the missing value and returns the sum of the non-missing values.
Lesson 9: Manipulating Data
Creating Variables
Date Functions
MONTH(SAS-date)
You can use SAS date functions in assignment statements to specify SAS date values as part of your expressions or to create SAS date values.
The MONTH function extracts a month from a SAS date value.
Lesson 9: Manipulating Data
Subsetting Variables
Remember that you can use a DROP statement or KEEP statement in a DATA step to subset variables in the output data set.
When SAS processes a DROP statement, the variables listed in the DROP statement are included in the PDV.
SAS reads values for those variables into the PDV for each observation, but does not write those variables to the output data set.
Lesson 9: Manipulating Data
Creating Variables Conditionally
DO statement within an IF-THEN/ELSE statement
The DO statement, together with the other SAS statements it contains, is known as a DO group.
Each DO group must end with an END statement.
SAS executes the statements withing the first DO group only if the expression in the IF-THEN statement is true.
SAS executes the statements within the second DO group only if the expression in the first IF expression is false and the expression in the ELSE IF expression is true.
Remember that SAS creates the attributes for a variable, including length, based on the first time it encounters that variable.
You use a LENGTH statement to specify the length for a variable explicitly.
IF expression THEN
** DO;**
** executable statements**
** END;**
ELSE IF expression THEN
** DO;**
** executable statements**
END;
Lesson 9: Manipulating Data
Subsetting Observations
Where vs If
The WHERE statement selects observations when they are read from the input data set, so you can’t use the WHERE statement to subset observations based on a variable that you create in the same DATA step.
You can use the subsetting IF statement to subset observations based on the value of a variable you create.
You can specify multiple expressions in a subsetting IF statement.
You cannot use special WHERE operators in IF expressions.
You must use the WHERE statement rather than a subsetting IF statement in a PROC step
You can always use a subsetting IF statement in a DATA step.
You can use a WHERE statement in a DATA step if the expression in the WHERE statement references only variables from the input data set, or from all input data sets if the step uses multiple input data sets.
IF expression;
Lesson 9: Manipulating Data
DELETE statement
You can use a DELETE statement within an IF-THEN statement as an alternative to the subsetting IF statement.
The DELETE statement is especially useful if it is easier to specify a condition for excluding observations than for selecting observations.
IF expression THEN DELETE;
Lesson 9: Manipulating Data
Sample Code
Creating Variables
data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
run;
Lesson 9: Manipulating Data
Sample Code
Subsetting Variables
data work.comp;
set orion.sales;
drop Gender Salary Job_Title Country
Birth_Date Hire_Date;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
run;
Lesson 9: Manipulating Data
Sample Code
Creating Variables Conditionally
data work.bonus;
set orion.sales;
length Freq $ 12;
if Country=’US’ then
do;
Bonus=500;
Freq=’Once a Year’;
end;
else if Country=’AU’ then
do;
Bonus=500;
Freq=’Twice a Year’;
end;
run;
Lesson 9: Manipulating Data
Sample Code
Subsetting Observations
data work.december;
set orion.sales;
BonusMonth=month(Hire_Date);
if Country=’AU’ and BonusMonth=12;
Bonus=500;
Compensation=sum(Salary,Bonus);
run;
Lesson 9: Manipulating Data
YEAR(SASdate)
Value Extracted: the year
Value Returned: a fourdigit year
QTR(SASdate)
the quarter a number from 1 to 4
MONTH(SASdate)
the month a number from 1 to 12
DAY(SASdate)
the day of the month a number from 1 to 31
WEEKDAY(SASdate)
the day of the week a number from 1 to 7 (1=Sunday, 2=Monday, and so on)
Lesson 9: Manipulating Data
QTR(SASdate)
Value Extracted: the quarter
Value Returned: a number from 1 to 4
Lesson 9: Manipulating Data
MONTH(SASdate)
Value Extracted: the month
Value Returned: a number from 1 to 12
Lesson 9: Manipulating Data
DAY(SASdate)
Value Extracted: the day of the month
Value Returned: a number from 1 to 31
Lesson 9: Manipulating Data
WEEKDAY(SASdate)
Value Extracted: the day of the week
Value Returned: a number from 1 to 7 (1=Sunday, 2=Monday, and so on)
Lesson 9: Manipulating Data
TODAY()
SAS Date Value Created: the current date
Lesson 9: Manipulating Data
MDY(month,day,year)
SAS Date Value Created: a date with numeric month, day, and year
Lesson 10: Combining SAS Data Sets Vertically
When to Use the FORCE Option in PROC APPEND
DATA= data set variables are not in the BASE= data set: Drops the variable not present in the BASE=data set.
DATA= data set variables have a different type than the variables in the BASE= data set: Replaces all values for the variable in the DATA= data set with missing values + keeps the variable type of the variable specified in the BASE= data set.
DATA= data set variables are longer than the variables in the BASE= data set: Truncates values from the DATA= data set to fit them into the length that is specified in the BASE= data set.
Lesson 10: Combining SAS Data Sets Vertically
Comparing PROC APPEND and the SET Statement
Lesson 10: Combining SAS Data Sets Vertically
Overview of Combining SAS Data Sets Vertically
Places the observations from one or more data sets below or above the observations in another data set.
Three methods of combining data vertically: appending, concatenating, and interleaving.
Appending and concatenating data sets both place observations from one data set after the observations from another data set in a single data set.
In both methods, the observations in the combined data set are in the same order that they were in each original data set.
Lesson 10: Combining SAS Data Sets Vertically
Appending Data Sets
Definition
When you append data sets, you can only use two input data sets.
Appending adds all of the observations from the second data set to the end of the first data set.
Lesson 10: Combining SAS Data Sets Vertically
Concatenating Data Sets
Definition
Concatenating copies all of the observations from the first data set and all of the observations from the second data set and writes them to a new data set.
Lesson 10: Combining SAS Data Sets Vertically
Interleaving Data Sets
Definition
Interleaving intersperses the observations from two or more input data sets based on the value of one or more common variables, in a new data set.
Lesson 10: Combining SAS Data Sets Vertically
Appending SAS Data Sets
PROC APPEND adds the observations from one SAS data set to the end of another SAS data set.
BASE= names the data set to which the observations are added, and DATA= names the data set containing observations that are added to the base data set.
Can specify only two data sets.
When SAS processes a PROC APPEND step, SAS does not read the observations in the base data set.
SAS cannot change any variable information in the descriptor portion of the base data set.
The base data set has the same number of variables before and after appending.
When the DATA= data set contains variables that are not in the BASE= data set, you can use the FORCE option to force SAS to append the observations.
The FORCE options causes SAS to drop the extra variables in the DATA= data set, and then to issue a warning message to the log.
PROC APPEND BASE=SAS-data-set
DATA=SAS-data-set ;
RUN;
Lesson 10: Combining SAS Data Sets Vertically
Concatenating SAS Data Sets
When you specify multiple data sets in the SET statement, SAS combines them into a single data set.
In the combined data set, the observations appear in the order in which the data sets are listed in the SET statement.
You can add additional DATA step statements, such as an assignment statement to create new variables in the output data set.
Because the DATA step creates a new data set, the input data sets can contain different variables.
If the data sets specified in the SET statement have a variable with the same name but different types, SAS generates a compile-time error by default.
**DATA SAS-data-set;
SET SAS-data-set1 SAS-data-set2 …;
RUN;**
Lesson 10: Combining SAS Data Sets Vertically
Concatenating SAS Data Sets
RENAME=
You can use the RENAME= option to change the name of one or more variables.
If the RENAME= option is associated with an input data set in the SET statement, it renames one or more variables being read from that data set.
However, the RENAME= option does not rename variables in the input data set. Instead, the RENAME= option tells SAS which slot in the PDV to use when SAS is building observations from the new data set.
You can use multiple RENAME= options in one SET statement if you want to rename variables from multiple data sets.
SAS-data-set (RENAME=(old-name-1 = new-name-1
old-name-2 = new-name-2
…
old-name-n = new-name-n))
Lesson 10: Combining SAS Data Sets Vertically
Interleaving SAS Data Sets
Use a DATA step with a BY statement. The BY statement specifies one or more BY variables. SAS uses the values of the BY variables to arrange the observations in the output data set.
Any BY variable that you specify must be common to all input data sets.
By default, SAS arranges observations in ascending order unless you specify the DESCENDING option.
Before you run a DATA step that contains a BY statement, your input data sets must be sorted on the BY variables.
SAS outputs observations with duplicate BY values in the order in which the data sets are listed in the SET statement.
If duplicate values are in the same data set, SAS outputs those observations in the order in which they appear in the data set.
You could also interleave data sets by concatenating them first, and then sorting the output file.
DATA SAS-data-set;
SET SAS-data-set1 SAS-data-set2 …;
BY BY-variable(s);
RUN;