Lesson 3: Exploring and Validating Data Flashcards
PROC CONTENTS to
to view the DESCRIPTOR portion of the table and to visually scan the table to see the data portion, but tables are often too large for a visual review to be sufficient.
The PRINT procedure creates
a listing of all rows and columns in the data.
The MEANS procedure calculates
a simple summary statistics for numeric columns in the data. The default statistics are frequency count, mean, standard deviation, minimum, and maximum.
PROC MEANS DATA=input-table;
VAR col-name(s);
RUN;
The UNIVARIATE procedure generates
a detailed summary statistics for numeric column. It includes the 5 extremes: lowest and highest extreme values and their observation numbers
PROC UNIVARIATE DATA=input-table;
VAR col-name(s);
RUN;
The FREQ procedure creates
a frequency TABLE for each COLUMN in the input table. The results include the unique values for the column along with their frequency, percent, and cumulative frequency and percent. This procedure can help you find problems in the data, like miscoded values or case inconsistencies.
PROC FREQ DATA=input-table;
TABLES col-name(s);
RUN;
OBS stands for
observations. By default PROC PRINT will list all of the rows but OBS allows to specify the subset / number of rows. It will include ALL of the columns in the data
PROC PRINT DATA=input-table (OBS=n);
RUN
proc print data=pg1.storm_summary (obs=10);
run;
DATA= specifies the table name
OBS= prints only the first 10 observations.
VAR statement
limits the COLUMN and changes their order in the report. Can populate the Column Names by clicking on and dragging them. The columns on the VAR statement have to be numeric
PROC PRINT DATA=input-tables (OBS=n);
VAR col-name(s);
RUN;
OBS is for…. and VAR is for…
OBS is for ROWs
VAR is for COLUMNS
How to FILTER the ROWS that appear in a PROC (print, univariate, etc)?
Use the WHERE statement
PROC procedure-name . . . ;
WHERE expression;
RUN;
What expressions / operators can be used with WHERE statement?
= or EQ
^= or ~= or NE
> or GT
< or LT
>= or GE
<= or LE
The expression evaluates as true or false for each ROW.
Either the symbol or letters can be used
Character values are case sensitive and must be enclosed in DOUBLE or SINGLE quotation marks.
Numeric values are NOT enclosed in quotation marks and must be standard numeric values. In other words, you cannot include special symbols such as commas or dollar signs.
What are the components of the WHERE statement?
Column name
Operator
Value
Character values are case sensitive and must be enclosed in DOUBLE or SINGLE quotation marks.
Numeric values are NOT enclosed in quotation marks and must be standard numeric values. In other words, you cannot include special symbols such as commas or dollar signs.
What is the SAS Date Constant?
A notation for SAS dates
Dates are stored as numeric values, so the expression is evaluated based on a numeric comparison.
If you want to compare a date column to a fixed date, you type the date using a particular notation, and SAS turns the string you provide into the numeric equivalent to evaluate the expression.
This notation is called the SAS date constant, and it has this format: a one- or two-digit day, a three-letter month, and a two- or four-digit year, enclosed in quotation marks, followed by the letter D.
“DDMMMYYYY”D
Examples:
WHERE date > “1jan15”d
WHERE date > “01JAN2015”d
Syntax for a start date on or after January 1, 2010
WHERE StartDate >= “01jan2010”d;
How to combine multiple WHERE expressions?
With the keywords AND or OR
What does IN operator do?
The IN operator to compare to a list of same values.
INSTEAD OF:
where Type=”SUV” or Type=”Truck” or Type=”Wagon”;
USE:
where Type in (“SUV”,”Truck”,”Wagon”);
After typing the column name, you use the keyword IN
In parentheses, list the values separated by commas or spaces
The IN operator works with both NUMERIC and CHARACTER values. Character values are CASE sensitive and must be enclosed in QUOTATION marks.
The values can be separated either with a space or comma
The keyword NOT to reverse the logic of the IN operator
WHERE col-name IN (value-1<…,value-n>);
WHERE col-name NOT IN (value-1<…,value-n>);
What statement allows to format values in columns?
- FORMAT statement in PROC PRINT.
- Can format any number of columns in a single FORMAT statement
- Affects display, does NOT change values
PROC PRINT DATA=input-table;
FORMAT col-name(s) format;
RUN;