Midterm 1 Flashcards
The Data Step
The data step manipulates the data.
The input for a data step can be of several types, such as raw data or a SAS data set.
The output from a DATA step can be of several types, such as a SAS data set or a report.
Do all SAS programs contain a DATA step?
No
PROC step
In general, the PROC step analyzes data, produces output, or manages SAS files.
The input for a PROC step is usually a SAS data set.
THe ouput from a PROC step can be of several types, such as a report or an updated SAS data step.
SAS Statements
A SAS statement is a series of items that might include keywords, SAS names, special characters, and operators.
The two types of SAS statements are:
Those that are used in DATA and PROC steps.
Those that are global in scope and can be used anywhere in a SAS program
All SAS statements end with a semicolon.
Global statements
Used anywhere in a SAS program.
Stay in effect until changed or canceled, or until you end your SAS session.
i.e. TITLE, OPTIONS, FOOTNOTE
SAS data sets
SAS file store in a SAS library that SAS creates and processes.
Contains data values that are organized as a table of observations (rows) and variables (columns)
Contains descriptor information such as the data types and lengths of the variables.
SAS libraries
A collection of one or more SAS files, including SAS data sets, that are referenced and stored as a unit.
A logical name (libref) can be assigned to a SAS library using the LIBNAME statement
Libref
A libref can be up to 8 characters long.
must begin with a letter or an underscore.
can contain only letters, digits, or underscores.
i.e. libname project ‘C:\workshop\winsas\lwcrb’;
Which of the following sentences is true concerning the LIBNAME statement:
A. The LIBNAMEstatement must go in a DATAstep.
B. The LIBNAMEstatement must end in a semicolon.
C. The LIBNAME statement must be the first statement in a program.
D. The LIBNAME statement must be followed by the RUN statement.
B. The LIBNAMEstatement must end in a semicolon.
Two-level SAS data set name
A SAS data set can be referenced using a two level SAS data set name: libref.dataset
i.e. proc sort data=work.enroll
libref is the logical name that is associated with the physical location of the SAS library.
data set is the data set name, which can be up to 32 characters long, must begin with a letter or an underscore, and can contain letters, digits, and underscores.
One-level SAS data set name
A data set referenced with a one level name is automatically assigned to the work library by default.
i.e. proc sort data=enroll out=project.enroll;
Temporary SAS Data sets
A temporary SAS dat set is one that exists only for the current SAS session or job.
The work library is a temporary data library.
Data sets held in the Work library are deleted at the end of the SAS session.
Permanent SAS data sets
A data set that resides on the external storage medium of your computer and is not deleted when the SAS session terminates.
Any data library referenced with a LIBNAME statement is considered a permanent data library by default.
Variables
Data values are organized into columns called variables.
Variables have attributes, such as the name and type, that enable you to identify them and that define how they can be used.
Variable names
Variable names can be up to 32 characters long
Must begin with a letter or an underscore.
Can contain only letters, digits, or underscores.
Which of the following variable names is valid? A. street# B. zip_code C. 2address D. last name
B. zip_code
Character variables
Character variables are stored with a length of 1 to 32,767 bytes with 1 character equaling 1 byte.
Character variables can contain letters, numeric digits, and other special characters.
Numeric variables
Numeric variables are stored as floating-point numbers with a default byte size of 8.
To be stored as a floating point number, the numeric value can contain numeric digits, plus or minus sign, decimal point, and E for scientific notation.
How many of the following data sets aren't permanent data sets? work.enroll temp.enroll project.enroll enroll
Two (work.enroll and enroll)
How should a date be stored in SAS?
a. character
b. numeric
b. numeric
SAS Dates
A SAS date value is a value that represents the number of days between January 1, 1960, and a specified date.
Dates before January 1, 1960 are negative numbers.
Dates after January 1, 1960, are positive numbers.
To reference a SAS date value in a program, use a SAS date constant.
A SAS date constant is a date (DDMMMYYYY) in quotation marks followed by the letter D.
ex. ‘12NOV1986’d
What is the numeric SAS date value for December 25, 1959? A. -6 B. -7 C. 6 D. 8
B. -7
Missing Data
Missing data is a vlaue that indicates that no data value is stored for the variable in the current observation.
A missing numeric value is displayed as a single period (.)
A missing character value is displayed as a blank space.
CONTENTS procedure
The contents procedure shows the descriptor portion of a SAS data set.
i.e. proc contents data=project.enroll; run;
the VARNUM option can be used to print the variable list in the order of the variables’ potions in the data set.
Which step displays the director of the project library and suppresses printing the contents of individual data sets?
A. proc contents data=project; run;
B. proc contents data=project.all;
C. proc contents data=project nocontents; run;
D. proc contents data=project._all_nods; run;
D. proc contents data=project._all_nods; run;
PRINT Procedure
The print procedure can show the data portion of a SAS data set.
ex. proc print data=project.enroll; run;
Comments
Two ways to add comments:
comment
/ comment */
What is the name of the data set being read?
data work.newprice;
set golf.supplies;
golf.supplies
What is the name of the data set being created?
data work.newprice;
set golf.supplies;
work.newprice
Set statement
The SET statement reads an observation from one or more SAS data sets for further processing in the DATA step.
By default, the SET statement reads all variables and all observations from the input data sets.
The set statement can read temporary or permanent data sets.
Compilation phase
During the compilation phase, SAS does the following:
Checks the syntax of the SAS statements.
Translates the statements into machine code.
Identifies the name, type, and length of each variable.
The following three items are potentially created:
input buffer
program data vector
descriptor information
Input Buffer
The input buffer is a logical area in memory into which SAS reads each record of a raw data file when SAS executes an INPUT statement.
This buffer is created on when the DATA step reads raw data
When the data step reads a SAS data set, SAS reads the data directly into the program data vector.
Program Data Vector (PDV)
A logical area in memory where SAS builds a data set, one observation at a time.
Along with data set variables and computed variables, the PDV contains the following two automatic variables:
- the _N- variable, which counts the number of times the DATA step begins to iterage.
- the ERROR variable, which signalas the occurrence of an error caused by the data during execution. Either 0 (no error) or 1 (one or more errors occured)
Which of the following statements is false concerning the N and ERROR variables?
A. SAS does not write the N and ERROR variables to the output data set.
B. SAS increments the N variable by 1 for each iteration of the DATA step.
C. SAS automatically generates the N and ERROR variables for every DATA step.
D. SAS sets the ERROR variable equal to the total number of errors caused by the data during execution
D. SAS sets the ERROR variable equal to the total number of errors caused by the data during execution
Which one of the following is not one of the items in the PDV at compile time?
A. byte size of the variable
B. Initial value of the variable.
C. Name of the variable
D. type (character or numeric of the variable
B. Initial value of the variable.
Descriptor information
Information that SAS creates and maintains about each SAS data set, including data set attributes and variable attributes.
I.e. name of the data set, date and time that the data set was created, names data types, and lengths of the variables.
Execution Phase
During the execution phase, SAS does the following:
- Initializes the PDV to missing and sets the initial values of N and ERROR
- Reads data values into the PDV
- Executes any subsequent programming statements
- Outputs the observation to a SAS data set
- Returns to the top of the DATA step
- Resets the PDV to missing for any variables not read directly from a data set and increments N by 1
- repeats the process until the end of file is detected.
How many times does SAS iterate through a DATA step with 9 observations?
Nine times
DROP statement
the DROP statement specifies the names of the variables to omit from the output data set.
Use DROP= after data-set input name to specify the variables for writing to a specific output data set.
data work.total(keep=name total test1 test2)
KEEP statement
The KEEP statement specifies the names of the variable to write to the output data set.
Use KEEP= after data-set input name to specify the variables for writing to a specific output data set.
data work.total(drop=name total test1 test2)
FORMAT Statements
The FORMAT statement associates formats to variable values. ex. data work.newprice; set golf.supples; saleprice=price*0.75; format saleprice dollar18.2; run; Format statements assigned in a DATA step are considered permanent attributes (stored in the descriptor portion).
LABEL Statements
The LABEL statement assigns descriptive labels to variable names.
data work.newprice;
set golf.supples;
saleprice=price*0.75;
label type=’Type of Ball’ saleprice=’Sale Price’
run;
Label statements assigned in a DATA step are considered permanent attributes (stored in the descriptor portion).
How is the DATA step debugger invoked?
A. adding a /DEBUG option to the DATA statement
B. adding a DEBUG statement after the DATA statement.
C. adding the DEBUG option to the OPTIONS statement
D. adding the DEBUG=YES option to the OPTIONS statement
A. adding a /DEBUG option to the DATA statement
DATA step debugger
The DATA step debugger consists of windows and a group of commands that provide an interactive way to identify logic and data errors in DATA steps.
PUTLOG Statement
The PUTLOG statement can be used to write messages to the SAS log to help identify logic errors.
Implicit Output
By default, at the end of each iteration, every DATA step contains an implicit OUTPUT statement that tells SAS to write observations to the data set or data sets that are being created.
OUTPUT Statement
The OUTPUT statement without arguments causes the current observation to be written to all data sets that are named in the DATA statement.
Multiple output statements can be used in a data step.
Placing an explicit OUTPUT statement in a DATA step overrides the implicit output, and SAS adds an observation to a data set only when an explicit OUTPUT statement is executed.
Creating multiple data sets
The DATA statement can specify multiple output data sets. The OUTPUT statement can specify the data set names. data work.first work.second; set work.scores; test=test1; output work.first; test=test2; output work.second; drop test1 test2; run;
Using the OUTPUT statement without arguments causes the current observation to be written to all data sets that are named in the DATA statement.
The drop and keep statements apply to all output data sets.
Which data set(s) does the OUTPUT statement populate? data work.total work.first work.second; set work.scores; total=test1+test2; output; run;
A. work.total
B. work.first
C. work.second
D. work.total, work.first, and work.second
D. work.total, work.first, and work.second
OUTPUT statements and if-then statement
Ex. if sex=’F’ then output female;
else if sex=’M’ then output male;
Selecting Observations
The FIRSTOBS= and OBS= data set options can be used to control which observations are read from the input data set.
ex. set sashelp.retail (obs=10);
FIRSTOBS= and OBS= are valid for input processing only. They are not valid for ouput processing.
How many observations are in the output data set? data work.portion; set sashelp.retail (firstobs=5 obs=10); run; A. 5 B. 6 C. 10 D. 14
B. 6
FIRSTOBS=
the FIRSTOBS= data set option specifies a starting point for processing an input data set.
OBS=
the OBS= data set option specifies an ending point for processing an input data set.
The OBS= option specifies the number of the last observation, and not how many observations there are to process.
Which step has invalid syntax?
A. data shoes (firstobs=101 obs=200);
set sashelp.shoes; run;
B. data shoes;
set sashelp.shoes (firstobs=101 obs=200); run;
C. proc print data=sashelp.shoes (firstobs=101 obs=200); run;
A. data shoes (firstobs=101 obs=200);
set sashelp.shoes; run;
Expression
An expression is a sequence of operands and operators that forms a set of instructions that define a condition for selecting observations.
operands are constants (character or numeric), variables (character or numeric), SAS functions
operators are symbols that request a comparison, logical operation, or arithmetic calculation.
Comparison operators
Comparison operators compare a variable with a value or with another variable.
EQ or =: equal to NE or ^= ~= : not equal to GT or >: greater than GE >=: greater than or equal to LT or
Which of the following is not a valid expression? A. qtr1<= qtr2 B. address=' ' C. sales gt 6400 D. name ne Mary Ann
D. name ne Mary Ann
Logical operators
Logical operators combine or modify expressions
AND or &: logical and
OR or |: logical or
NOT or ^: logical not
Arithmetic operators
Arithmetic operators indicate that an arithmetic calculation is performed.
If a missing value is an operand for an arithmetic operator, the result is a missing value.
**: exponentiation
*: multiplication
/: division
+: addition
-: subtraction
Special WHERE operators
The WHERE statement can use special WHERE operators BETWEEN - AND : an inclusive range CONTAINS or ? : a character string LIKE: a character pattern SOUNDS LIKE or =* : spelling variation IS NULL : missing value IS MISSING : missing value SAME AND ALSO : augments and expression
Which names will be selected based on the below expression?
name like ‘M_ _k’%
- Mark
- Marcia
- Mickey
- Matthew
- Michael
- Mark
3. Mickey
data work.newprice;
set golf.supplies;
saleprice=price*0.75;
run;
Which statement must be added to the above program to create an output data set with observations having saleprice greater than $10?
A. if not (saleprice>10) then delete;
B. if saleprice>10;
C. either statement will work.
C. Either statement will work
data work.newprice;
set golf.supplies;
saleprice=price*0.75;
run;
Which statement must be added to the above program to create an output data set with observations having saleprice greater than $10?
A. where saleprice>10;
B. if saleprice>10;
C. either statement will work.
B. if saleprice>10;
WHERE Statement
The WHERE statement causes the DATA step to process only those observations from a data set that meet the condition of the expression.
The expression in the WHERE statement
Can reference variables that are from the input data set.
Cannot reference variables created from an assignment statement or automatic variables (N or ERROR).
data subset;
set sales;
difference=actual-predict;
run;
Which WHERE statement will create an error when submitted, if inserted in the above program? A. where actual>predict; B. where difference ge 1000; C. where product in ('CHAIR' , 'SOFA'); D. where state='Texas' and date
B. where difference ge 1000;
Subsetting if statement
The subsetting if statement causes the DATA step to continue processing only those observations in the program data vector that meet the condition of the expression.
data work.newprice;
set golf.supplies;
saleprice=price*0.75;
if saleprice>10;
run;
If the expression is true for the observation, SAS continues to execute the remaining statements in the DATA step, including the implicit OUTPUT statement at the end of the DATA step. The resulting SAS data set (or data sets) contains a subset of the original SAS data set.
If the expression is false, no further statements are processed for that observation, the current observation is not written to the DATA step are not executed, and SAS immediately returns to the beginning of the DATA step.
Which program will create an error when submitted? A. data subset; set sales; if difference<500; difference=actual-predict; if state= 'Texas'; run; B. data subset; set sales; differences=actual-predict; if difference between 500 and 1000; run; C. data subset; set sales; difference=actual-predict; if product='CHAIR' and difference ge 100; run;
B. data subset; set sales; differences=actual-predict; if difference between 500 and 1000; run;
WHERE statement versus subsetting IF statement
The WHERE statement selects observations before they are brought into the program data vector.
The subsetting IF statement selects observations that were read into the program data vector.
IF-THEN DELETE Statement
The IF-THEN DELETE statement causes the DATA step to stop processing those observations in the program data vector that meet the condition of the expression.
ex. if saleprice<= 10 then delete;
If the expression is true for the observation, the current observation is not written to a data set, and SAS returns immediately to the beginning of the DATA step for the next iteration.
SORT Procedure
Orders SAS data set observations by the values of one or more character or numeric variables
Either replaces the original data set or creates a new data set.
Produces only an output data set, but no report.
Arranges the data set by the values in ascending order by default.
The DATA= option identifies the input SAS data set.
The OUT= option names the output data set.
Without the OUT= option, the SORT procedure overwrites the original data set.
ex. proc sort data=sashelp.shoes
out=shoes;
by descending region product;
run;
BY statement
The BY statement specifies the sorting variables.
PROC SORT first arranges the data set by the values of the first BY variable
PROC SORT then arranges any observations that have the same value of the first BY variable by the values of the second BY variable.
This sorting continues for every specified BY variable.
By default, the SORT procedure orders the values by ascending order.
The DESCENDING option reverses the sort order for the variable that immediately follows in the statement.
In addition ot the SORT procedure, a BY statement can be used in the DATA step and other PROC steps.
The data sets used in the DATA step and other PROC steps must be sorted by the values of the variables that are listed in the BY statement or have an appropriate index.
Will the following program run successfully? proc sort data=sashelp.shoes out=shoes; by descending region ascending product; run;
No
Concatenating
If more than one data set name appears in the SET statement, the resulting output data set is a concatenation of all the data sets that are listed.
SAS reads all observations from the first data set, then all from the second data set, and so on, until all observations from all the data sets are read.
LENGTH Statement
The LENGTH statement specifies the number of bytes for storing variables. EX. data company; length name $ 15; set divisionA divisionB; run;
Will you get the same results if the LENGTH statement is after the SET statement?
No
RENAME=
The RENAME= data set option changes the names of variables.
The RENAME= option specifies the variable that you want to rename equal to the new name of the variables
The list of variables to rename must be enclosed in parentheses.
Ex. set divisionA (rename=(state=location)) divisionB;
Which of the following statements has the proper syntax for the RENAME= option? A. set divisionA (rename=name=first, state=location) division B (rename=name=first); B. set divisionA (rename=(name=first state=location)) divisionB (rename=(name=first); C. set divisionA (rename= (name=first) (state=location)) divisionB (rename= (name=first)); D. set divisionA (rename= (name=first), (state=location)) divisionB (rename = (name=first));
B. set divisionA (rename=(name=first state=location))
divisionB (rename=(name=first);
Interleaving
Use a single SET statement with multiple data sets and a BY statement to interleave the specified data sets.
The observations in the new data set are arranged by the values of the BY variable or variables. Then, within each BY group, they are arranged by the order of the data sets in which they occur.
The data sets that are listed in the SET statement must be sorted by the values of the variables that are listed in the BY statement, or they must have an appropriate index.
data company; length name $ 15; set divisionA (rename=(state=location)) divisionB; by name; run;
Merging
The merge statement joins observations from two or more SAS data sets into single observations.
The BY statement specifies the common variables to match-merge observations. The variables in the BY statement must be common to all data sets.
The data sets listed in the MERGE statement must be sorted in the order of the values of the variables that are listed in the BY statement, or they must have an appropriate index.
Ex. data combine; merge revenue expense; by name; profit=revenue-expense; run;
IN= Option
The IN= option creates a variable that indicates whether the data set contributed data to the current observation.
Within the DATA step, the value of the variable is 1 if the data set contributed to the current observation, and 0 if the data set did not contribute to the current observation.
Ex. data combine1; merge revenue1 (in=rev) expense1 (in=exp); by name; profit=revenue-expense; run;
Which of the following statements is false concerning the IN= option?
A. The IN= variables are included in the SAS data set that is being created.
B. The values of the IN= variables are available to program statements during the DATA step.
C. when a data set contributes an observation for the current BY group, the IN= value is a numeric 1.
D. The IN= data set option is specified in parentheses after a SAS data set name in the SET and MERGE statements.
A. The IN= variables are included in the SAS data set that is being created.
data combine1; merge revenue1 (in=rev) expense1 (in=exp); by name; profit=revenue-expense; run;
Which statement will give all observations from the revenue1data set regardless of matches or non matches?
A. if rev=1;
B. if rev=1 and exp=1;
C. if rev=1 and (exp=1 and exp=0)
D. if (rev=1 and exp=1) and (rev=1 and exp=0)
A. if rev=1;
INFILE Statement
With an INPUT statement, the INFILE statement identifies the physical name of the external file to read. The physical name is the name that the operating environment uses to access file. EX. data work.kids; infile 'kids.dat'; input name $ 1-8 siblings 10 @12 bdate mmddyy10. @23 allowance comma2. hobby1 $ hobby2 $ hobby3 $; run;
INPUT statement
The input statement describes the arrangement of values in the input data record and assigns input values to the corresponding SAS variables.
EX. data work.kids; infile 'kids.dat'; input name $ 1-8 siblings 10 @12 bdate mmddyy10. @23 allowance comma2. hobby1 $ hobby2 $ hobby3 $; run;
Which of the following is not an input style for the INPUT statement? A. list input B. column input C. delimited input D. formatted input
C. delimited input
Column input
With column input, the column numbers that contain the value follow a variable in the INPUT statement.
To read with column input, data values:
must be in the same columns in all the input data records.
Must be in standard form.
Column input statement can contain:
variable- names a variable that is assigned input values.
$ : Indicates to store a variable value as a character value rather than as a numeric value.
start-column: Specifies the first column of the input record that contains the value to read.
-end-column: Specifies the last column of the input record that contains the value to read.
Ex. input name $ 1-8 siblings 10
Formatted input
With formatted input, an informat follows a variable name and defines how SAS reads the value of this variable. An informat gives the data type and the field width of an input value.
To read with formatted input, data values
-Must be in the same columns inall the input data records.
Can be in standard or nonstandard form.
Formatted input statement can contain the following:
pointer-control-moves the input pointer to a specified column in the input buffer. @n moves the pointer to column n. +n moves the pointer n columns.
variable-names a variable that is assigned input values.
informat- specifies a SAS informat to use to read the variable values.
ex. input @12 bdate mmddyy10.
@23 allowance comma2.
List input
With list input, variable names in the INPUT statement are specified in the same order that the fields appear in the input data records.
To read with list input data values:
-must be separated with a delimiter
-can be in standard or nonstandard form.
You must specify the variables in the order that they appear in the raw data file, left to right. The default length for variables is 8 bytes. A space (blank is the default delimiter.
pointer control: moves the input pointer to a specified column in the input buffer.
Variable: names a variable that is assigned input values.
$ : Indicates to store a variable value as a character value rather than as a numeric value.
: Reads data values that need additional instructions that informats can provide but are not aligned in columns.
informat: specifies an informat to use to read the variable values.
input hobby1 $
siblings
bdate : mmddyy10.
;
input @45 name $10.
Which input technique is used in the above statement
Formatted input
What third item is created at compile time in addition to the input buffer and the program data vector (PDV)? A. report B. data values C. raw data file D. descriptor information
D. descriptor information
Data Errors
A data error is when the INPUT statement encounters invalid data in a field.
When SAS encounters a data error, these events occur:
A note that describes the error is printed in the SAS log.
The input record contents of the input buffer being read is displayed in the SAS log.
The values in the SAS observation (contents of the PDV) being created are displayed in the SAS log.
A missing value is assigned to the appropriate SAS variable.
Execution continues
DATALINES statement
The DATALINES statement can be used with an INPUT statement to read data directly from the program, rather than data stored in a raw data file.
datalines;
Chloe 2 11/10/1995 $5Running Music Gymnastics
Travis 2 1/30/1998 $2Baseball Nintendo Reading
;
Run;
Which statement is false concerning the DATALINES statement?
A. Multiple DATALINES statements can be used in a DATA step.
B. A null statement (a single semicolon) is needed to indicate the end of the input data.
C. The DATALINES statement is the last statement in the DATA step and immediately precedes the first data line.
A. Multiple DATALINES statements can be used in a DATA step.
Standard Data
Standard data is any data that SAS can read without any special instructions.
input name $ 1-8 siblings 10 bdate $ 12-21 allowance $ 23-24 hobby1 $ 26-35 hobby2 $ 36-45 hobby 3$ 46-55
How many variables are numeric and, ideally, how many variables should be numeric?
1 numeric variable and 3 variables should be numeric
Nonstandard data
Nonstandard data is any data that SAS cannot read without a special instruction.
Informat
An informat is an instruction that SAS uses to read data values into a variable
SAS uses the informat to determine the following:
-whether the variable is numeric or character
-the length of character variables.
How will the following numbers be read with an informat of 8.2?
12345678 and 1234.567
A. 123456.78 and 1234.567
B. 123456.78 and 1234.56
C. 123456.7 and 1234.567
D. 123456.7 and 1234.56
A. 123456.78 and 1234.567
Which statement is false regarding informats?
A. When you use an informat, the informat contains a period (.) as a part of the name.
B. The $ indicates a character informat, and the absence of a $ indicates a numeric informat.
C. An informat has a default width or specifies a width, which is the number of columns to read in the input data.
D. when a problem occurs with a valid informat, SAS writes a note to the SAS log, assigns a missing value to the variable, and terminates the DATA step.
D. when a problem occurs with a valid informat, SAS writes a note to the SAS log, assigns a missing value to the variable, and terminates the DATA step.
DLM= option
The DLM= option specifies a delimiter to be used for list input. Blank is the default delimiter.
Ex.
infile ‘kids4.dat’ dlm=’ , ‘;
Missing data (delimiter)
By default SAS treats two consecutive delimiters as one, not as a missing value between the delimiters.
DSD option
The DSD option can do the following:
Treat two consecutive delimiters as a missing value
Remove quotation marks from strings and treat any delimiter inside the quotation marks as a valid character
Set the default delimiter to a comma.
infile ‘kids5.dat’ dsd;
Missover option
The missover option prevents an INPUT statement from reading a new input data record if it does not find values in the current input line for all the variables in the statement. When an input statement reaches the end of the current input data record, variables without any values assigned are set to missing with the MISSOVER option.