Lesson 3: Exploring and Validating Data Flashcards

1
Q

PROC CONTENTS to

A

to view the DESCRIPTOR portion of the table and to visually scan the table to see the data portion, but tables are often too large for a visual review to be sufficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The PRINT procedure creates

A

a listing of all rows and columns in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The MEANS procedure calculates

A

a simple summary statistics for numeric columns in the data. The default statistics are frequency count, mean, standard deviation, minimum, and maximum.

PROC MEANS DATA=input-table;
VAR col-name(s);
RUN;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The UNIVARIATE procedure generates

A

a detailed summary statistics for numeric column. It includes the 5 extremes: lowest and highest extreme values and their observation numbers

PROC UNIVARIATE DATA=input-table;
VAR col-name(s);
RUN;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The FREQ procedure creates

A

a frequency TABLE for each COLUMN in the input table. The results include the unique values for the column along with their frequency, percent, and cumulative frequency and percent. This procedure can help you find problems in the data, like miscoded values or case inconsistencies.

PROC FREQ DATA=input-table;
TABLES col-name(s);
RUN;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

OBS stands for

A

observations. By default PROC PRINT will list all of the rows but OBS allows to specify the subset / number of rows. It will include ALL of the columns in the data

PROC PRINT DATA=input-table (OBS=n);
RUN

proc print data=pg1.storm_summary (obs=10);
run;

DATA= specifies the table name
OBS= prints only the first 10 observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

VAR statement

A

limits the COLUMN and changes their order in the report. Can populate the Column Names by clicking on and dragging them. The columns on the VAR statement have to be numeric

PROC PRINT DATA=input-tables (OBS=n);
VAR col-name(s);
RUN;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

OBS is for…. and VAR is for…

A

OBS is for ROWs
VAR is for COLUMNS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to FILTER the ROWS that appear in a PROC (print, univariate, etc)?

A

Use the WHERE statement

PROC procedure-name . . . ;
WHERE expression;
RUN;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What expressions / operators can be used with WHERE statement?

A

= or EQ
^= or ~= or NE
> or GT
< or LT
>= or GE
<= or LE

The expression evaluates as true or false for each ROW.
Either the symbol or letters can be used
Character values are case sensitive and must be enclosed in DOUBLE or SINGLE quotation marks.
Numeric values are NOT enclosed in quotation marks and must be standard numeric values. In other words, you cannot include special symbols such as commas or dollar signs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the components of the WHERE statement?

A

Column name
Operator
Value

Character values are case sensitive and must be enclosed in DOUBLE or SINGLE quotation marks.
Numeric values are NOT enclosed in quotation marks and must be standard numeric values. In other words, you cannot include special symbols such as commas or dollar signs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the SAS Date Constant?

A

A notation for SAS dates

Dates are stored as numeric values, so the expression is evaluated based on a numeric comparison.

If you want to compare a date column to a fixed date, you type the date using a particular notation, and SAS turns the string you provide into the numeric equivalent to evaluate the expression.

This notation is called the SAS date constant, and it has this format: a one- or two-digit day, a three-letter month, and a two- or four-digit year, enclosed in quotation marks, followed by the letter D.

“DDMMMYYYY”D

Examples:
WHERE date > “1jan15”d
WHERE date > “01JAN2015”d

Syntax for a start date on or after January 1, 2010
WHERE StartDate >= “01jan2010”d;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to combine multiple WHERE expressions?

A

With the keywords AND or OR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does IN operator do?

A

The IN operator to compare to a list of same values.

INSTEAD OF:
where Type=”SUV” or Type=”Truck” or Type=”Wagon”;

USE:
where Type in (“SUV”,”Truck”,”Wagon”);

After typing the column name, you use the keyword IN
In parentheses, list the values separated by commas or spaces
The IN operator works with both NUMERIC and CHARACTER values. Character values are CASE sensitive and must be enclosed in QUOTATION marks.
The values can be separated either with a space or comma
The keyword NOT to reverse the logic of the IN operator

WHERE col-name IN (value-1<…,value-n>);
WHERE col-name NOT IN (value-1<…,value-n>);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What statement allows to format values in columns?

A
  • FORMAT statement in PROC PRINT.
  • Can format any number of columns in a single FORMAT statement
  • Affects display, does NOT change values

PROC PRINT DATA=input-table;
FORMAT col-name(s) format;
RUN;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Compound AND expression in WHERE statement does what?

A

Specifies two or more conditions using the AND operator. The keyword AND means that both conditions must be true in order for a row to be printed.

Syntax example for Type equal to TS (tropical storm) and Hem_EW equal to W (west) :

WHERE Type = “TS” AND Hem_EW = “W”;

17
Q

What does date constant do?

A

Filters on dates. The date constant will be converted to the number of days from January 1st, 1960 in order to evaluate our WHERE expression.

Syntax for a start date on or after January 1, 2010
WHERE StartDate >= “01jan2010”d;

18
Q

How to filter only the rows that meet certain numeric value?

A

PROC PRINT DATA=libref. input table;
WHERE col-name operator value;
RUN;

Example:
proc print data=pg1.storm_summary;
where MaxWindMPH >= 156;
run;

No QUOTATION marks for numeric values
The log will list how many observations or rows met the criteria in the WHERE statement.

19
Q

How to filter only the rows that meet ONE character value?

A

PROC PRINT DATA=libref. input table;
WHERE col-name operator “value”;
RUN;

Example for Basin equal to WP (West Pacific):
proc print data=pg1.storm_summary;
where Basin= “WP”;
run;

Character constants require QUOTATION (either single or double)
They are CASE sensitive

20
Q

How to filter only the rows that meet MULTIPLE and SAME character value?

A

Use IN operator. The IN operator to compare to a list of same values.

PROC PRINT DATA=libref. input table;
WHERE col-name operator “value” AND “value”;
RUN;

Example for Basin equal to SI or NI (South Indian or North Indian)

proc print data=pg1.storm_summary;
where where Basin in (“SI” “NI”);
run;

Character constants require QUOTATION (either single or double)
They are CASE sensitive
Values in parenthesis can be separated with either SPACE or COMMA

21
Q

How to filter rows with DATES?

A

Use date constant: “DDMMMYYYY”D

WHERE Date Operator DateConstant

Example for StartDate on or after January 1, 2010

proc print data=pg1.storm_summary;
where StartDate >= “01jan2010”d;
run;

22
Q

How to filter only the rows that meet MULTIPLE and DIFFERENT character value?

A

Use compound AND operator.

Specifies two or more conditions using the AND operator. The keyword AND means that both conditions must be true in order for a row to be printed.

Syntax example for Type equal to TS (tropical storm) and Hem_EW equal to W (west) :

proc print data=pg1.storm_summary;
where Type = “TX” AND Hem_EW = “W”
run;

23
Q

How to filter only the rows that meet alternate values?

A

Use OR operator. With the OR keyword only one of the expressions needs to be true for a row to be included in the output.

Syntax example for MaxWindMPH greater than 156 or MinPressure less than 920

proc print data=pg1.storm_summary;
where MaxWindMPH > 156 OR MinPressure < 920;
run;

Numeric MISSING values are treated as the SMALLEST possible value
Can add GREATER THAN 0 to exclude rows with missing values:

For example:
where MaxWindMPH > 156 OR 0< MinPressure < 920;

CLASS NOTES:
MaxWindMPH isn’t greater than 156, but MinPressure is missing. Remember the expression was for MinPressure to be less than 920. Missing values are technically treated as the smallest possible value. So when MinPressure is missing, the expression MinPressure less than 920 is true. All rows with missing values for MinPressure are included in the report.

We can modify the WHERE statement to exclude the missing values for MinPressure. In the expression for MinPressure add that MinPressure should be greater than 0 as well as less than 920.
where MaxWindMPH>156 or 0<MinPressure<920;

24
Q

How to filter your data by missing values only?

A

You could write an expression where a column is equal to a period for a numeric missing value or a space enclosed in quotation marks for a character missing value.

where Type=. or Type=” “;

25
Q

What is the syntax for MISSING or NOT MISSING operators?

A

WHERE col-name IS MISSING;
WHERE col-name IS NOT MISSING;

Example:
WHERE Age is missing;
WHERE Name is not missing;

Can be used for either numeric or character missing values
Data from DBMS environment can also use NULL operator

26
Q

What does BETWEEN - AND operators are for?

A

The BETWEEN AND operator is handy for numeric and character ranges. The endpoints of the range are inclusive

Syntax:
WHERE col-name BETWEEN value-1 AND value-2;

Example:
where Age between 20 and 39; (between and including 20 and 39)

27
Q

What is the LIKE operator for?

A

The LIKE operator enables us to do pattern matching.

Syntax:
WHERE col-name LIKE “value”;

% is a wildcard for ANY number of characters
Underscore is a wildcard for a SINGLE character

Example:
where City like “New%”;
In this example, % after NEW returns ANY string after New including spaces and letters (New York, New Delhi, Newport, Newcastle, New)

where City like “Sant_ %”;
The underscore represents a SINGLE character and then a space and
% returns both Santa and Santo, a space, and any other string (Santa Clara, Santa Cruz, Santo Domingo, Santo Tomas)

28
Q

What do MACRO variables do?

A

Macro variables store a text string. The SAS macro language is designed to help make your programs reusable and dynamic.

Suppose you have a program with MULTIPLE procedures, and you want to FILTER EACH procedure to select only rows where the type of car is a wagon.

Open code or syntax in a separate window.

proc print data=sashelp.cars;
where Type=”Wagon”;
var Type Make Model MSRP;
run;

proc means data=sashelp.cars;
where Type=”Wagon”;
var MSRP MPG_Highway;
run;

proc freq data=sashelp.cars;
where Type=”Wagon”;
tables Origin Make;
run;

29
Q

What is the syntax for MACRO?

A

%LET macro-variable=value;

All statements in the macro language begin with a % sign
After %LET, specify the name of the macro variable
an equals sign =
the text string you want to store.
DO NOT enclose the string in QUOTATION marks.
When you reference macro variable (in code) use DOUBLE QUOTATION marks
The %LET statement ends with a semicolon.

Example:
%let CarType = Wagon;

30
Q

How to run a MACRO

A
  1. Create a macro variable
  2. Use it in the program
  3. When you reference a macro in the code, precede the name with &.
    & MARCO-VAR. The ampersand triggers SAS to look up the text string stored in the value and replaces it with the PROC before it executes the code.
  4. Now if you want to run the same procedures and change the value of the FILTER (e.g. Wagon with SUV), you only have to change the value in the %LET statement!

Example:
%let CARTYPE=Wagon; (can be replace with SUV)

proc print data=sashelp.cars;
where Type = “& CARTYPE”;
var Type Make Model MSRP;
run;

proc means data=sashelp.cars;
where Type = “& CARTYPE”;
var MSRP MPG_Highway;
run;

proc freq data=sashelp.car;
where Type = “& CARTYPE”;
tables Origin Make;
run;

31
Q

Are MARCO variables permanent?

A

No. Like libraries, macro variables are temporary, so when you exit SAS, they are deleted. If you have macro variable references in your program, you must create the macro variable at the beginning of your program before you reference it.

32
Q

What is an example of a MARCO for DATE?

A

%let Date=01JAN2000;

IMPORTANT: When referencing it in the code use “&Date”d

proc print data=pg1.storm_summary;
where MaxWindMPH>=&WindSpeed and Basin=”&BasinCode” and
StartDate>=”&Date”d;
var Basin Name StartDate EndDate MaxWindMPH;
run;

33
Q

What is the syntax to specify FORMAT

A

<$> format-name <w>.<d></d></w>

<$> - character format
name of the format

<w> - total format width (including decimals and special characters)
period - required delimiter
<d> - for numeric formats, number of decimal places

PROC PRINT DATA=input-table;
FORMAT col-name(s) format;
RUN;
</d></w>

34
Q

What do procedures do?

A

Validate data

35
Q

What is the syntax for PROC SORT to sort data?

A

PROC SORT DATA=input-table
OUT = output-table;
BY <DESCENDING> col-name(s);
RUN;</DESCENDING>

36
Q

Syntax to remove duplicate adjacent (next to each other) ROWS?

A

PROC SORT DATA= input-table
OUT = output-table
NODUPRECS
DUPOUT=output table;
BY ALL;
RUN;

DUPOUT=output table - generates a table of the duplicate rows that were removed

37
Q

Syntax to remove duplicate adjacent (next to each other) COLUMNS?

A

PROC SORT DATA=input table
OUT = output table
NODUPKEY
DUPOUT=output table;
BY DESCENDING col-names;
RUN;