Programa P4-5 Flashcards
Explain a SET statement
The SET statement is used for READING SAS data sets. It is an executable statement and can be placed under program control.
List examples of SET Applications
Single File Read:
* SET statement to read a SAS data set observation by observation i.e.
* set pdata.demog; *
Concatenation:
* Concatenating files is achieved by listing the data sets on the SET statement i.e.
* set pdata.demog1 pdata.demog2 *
* will concatenate both data sets together. The resulting data set will
have all observations from table ‘A’ at the top of the data set, those from table ‘B’ at the end.
Multiple SET statements:
* Combines datasets: The output data set will contain variables from all input datasets, with any common variables overwriting values from earlier SET statements.
* The DATA step will end when an end-of-file marker is reached in any of the input datasets
When would you use the NOSORTED Option?
- By using the NOTSORTED option on the BY statement, FIRST. and LAST. can also be used with Grouped data.
- forms groups within the data but allows the same group to be repeated and also allows groups to appear out of ascending or descending sequence, the required information can be extracted
How can you identify the end of a Data Set Using the END= Option?
The END option is used to find out when the last row is read from a SAS table. This option is included on the SET statement and defines a column in the Logical Program Data Vector which is set to a value of 1 for the last row in the table, otherwise it contains the value .i.e.
set pdata.demog end=last_observation
how to Determining the Number of Observations using the NOBS= Option
- Sometimes it is necessary to determine the number of observations in the data set. This
can be achieved by using the NOBS= option on the SET statement. - Adding a STOP; statement before the SET statement ensures that the data is not read it. If used with a PUT stateent, SAS will grab the nobs and outputs them to the log.
What’s the use of Sampling from SAS Data Files
When analysing a large volume of data, it is often useful to be able to take samples of the data in order to reduce processing time and the associated cost
Define the RAND Function and CALL STREAMINIT Routine
- The RAND function will return a stream of random numbers based on the distribution argument passed to it:
targetvariable = RAND(distribution);
- The CALL STREAMINIT routine is used prior to the RAND function to specify a seed value used for any subsequent RAND functions. The CALL STREAMINIT routine needs to be used just once per DATA step.
CALL STREAMINIT(seed);
TRUE/FALSE
You can generate a stream of random numbers with negative integers
To generate a reproducible stream of random numbers (Pseudorandom) then the seed
value must be any positive integer. Any nonpositive seed (or simply not using the CALL STREAMINIT routine) will cause SAS to generate a seed from the system clock and therefore the random numbers generated by the RAND function will not be reproducible
* Using the CALL STREAMINIT routine, with a positive integer ‘seed’, the same stream of
random numbers will be generated every time this step runs (Pseudorandom).
TRUE/FALSE
The power of merging lies in Match-Merging, where rows are matched on a key and merged - i.e. with a BY statement present
TRUE
List components of Match-Merging
Match-Merging
* Requires common BY variables on both data sets;
* Only observations with matching BY variables will be paired;
* BY variables must have the same names and types in both data sets. It is desirable
for their lengths to correspond also (though with care it is possible to perform a merge where BY variables have different lengths);
* Data sets must be sorted or indexed by the BY variables;
* FIRST. and LAST. are generated automatically;
* Duplicate BY values can be present in any data set;
* Any number of data sets can be merged;
* The usual data set options apply;
* Possible selection statements against two data sets with IN= variables called A and
B would be
Explain uniqueness of data in terms of merging data
Uniqueness:
* Refers to how often values are repeated within a variable or column.
* 100% uniqueness occurs when every value of a variable is different e.g. user IDs in a data set that contains logon security details.
* Low levels of uniqueness are found in e.g. gender variables or yes/no type status flags.
* In between are variables that contain some element of uniqueness, such as people’s
surnames.
Explain cardinality of relationships in terms of merging data
uniqueness of the data helps in understanding the cardinality
of the relationship between data sets, in terms of the key variables that may be used to merge them.
Cardinality Relationship
* One to One: EmpID in Employee data set to EmpID in a Retired
Employees data set.
* One to Many or vice versa: AccNo in Account Details data set to AccNo in Transaction data set.
* Many to Many: CustNo in Orders data set to CustNo in Deliveries data
set
How is the MSGLEVEL= option used?
The MSGLEVEL= system option is used to specify the amount of detail that is printed in the SAS log when SAS code is executed.
- The default value is N, which restricts the output to notes, warnings and error messages.
- The alternative value is I, which outputs further information specifically relating to merge
processes, use of indexes and sort procedures
True or False ?
Provide the corrected version for any false statements regarding DATA step merging.
a) Common BY variables must have the same name and length, but it is permissible for
them to be of different types;
b) Only observations with matching BY variables will be paired;
c) Data sets must first be sorted or indexed by the BY variables;
d) FIRST. and LAST. are generated automatically;
e) Duplicate BY values can be present in any data set;
f) Match merging allows only two data sets to be merged.
a) Common BY variables must have the same name and length, but it is permissible for them to be of different types.
Corrected:
Common BY variables must have the
same name and type, but it is permissible for them to have different lengths.
f) Match merging allows two data sets to be merged.
Corrected:
Match merging allows multiple data sets to be merged.
Describe Data Summarisation?
Data Summarisation’ describes the process of collapsing data, in order to gain a higher level view of key factors and generate certain statistics such as totals, averages,
minimum values and maximum values. Methods such as PROC MEANS and PROC TRANSPOSE can achieve this.
Write the syntax for PROC MEANS
Basic Syntax:
PROC MEANS <option(s)> <statistic-keyword(s)>;
BY variable-list;
CLASS variable-list;
VAR variable-list;
FREQ variable;
ID variable-list;
TYPES requested-combinations-of-class-variables;
OUTPUT <OUT=SAS-data-set> <output-statistic-list>;
RUN;</output-statistic-list>
What are the CLASS and VAR statements used for in PROC MEANS?
- Adding a CLASS statement to the procedure introduces the ability to group the analysis by one or more classification / categorisation variables. Classification variables can be character or numeric, but tend to have discrete values by which to group the results.
- The VAR statement is used to identify one or more analysis (numeric) variables for which
a series of default statistics are output
Why is an output statement added to a PROC MEANS procedure?
- Saving Results
A common use of the MEANS Procedure is to produce an output SAS data set. Adding an OUTPUT statement to the step, now creates and output SAS data set - It is possible to generate more than one output data set within a single Proc MEANS step by using multiple OUTPUT statements.
What are the automatic variables generated in a PROC MEANS procedure and can you explain them?
- The FREQ variable gives the number of observations at each ‘level’.
- The STAT variable shows the name of the five (default) statistics produced for the output
data set. - The TYPE variables gives values for the whole data set, for the data set broken down by the values of the class variable and values for the data set broken down by combinations of the class variable