Exam 1 Flashcards
What does $ in an input statement mean
categorical variable
what does “cards” mean
have to use this or datalines (datalines is used in the infile statement) when using list input to read internal raw data; tells sas to receive data values
what does dlm in the infile statement do
tells sas the delimiter for the data
what does NOOBS in proc print do
clears the observation # column in the output
infile statement
tells sas where it will be reading your data from
what does obs do in proc print ex- (obs=20)
tells sas to print the first 20 observations
proc print data=x (firstobs=11 obs=20)
prints 11th=20th observations
what does firstobs do in the infile statement
starts reading data from the second row of your datafile (use if the first row of the file is just variable names)
what does using var statement in proc print do
will only print your selected variables
what does varnum in proc contents do
puts the variables in creation order, rather than in the default alphabetical order; can make finding the variable easier
proc sgplot data=;
histogram salary /showbins binwidth=5000;
run;
creates a histogram with salary on the x-axis, markings at the mid point of each bin
bin width specifices the binwidth, sas will determine the number of bins unless you use the nbins options
proc sgplot data=
scatter x= y=
/group=gender
run
creates a scatter plot with x and y variables on their respective axes, group the data by the gender variable
vbar
similar to a histogram. options can be explored in 8.2
dsd
- ignores delimiters enclosed in quote marks
- treats 2 delimiters in a row as a missing value
- does not read quote marks as part of the data
- assumes the dlm is a comma
- prudent to use missover in case there is missing dat at the end of the dataline
missover
tells sas that if it runs out of data, don’t go to the next line, assign missing values to any remaining variables in that dataline
truncover
need this when reading in data in column or formatted input and some datalines are shorter than others
-tells sas to read data for the variable until the end of the dataline or the last column specified in the format or column range (whichever comes first)
differences between missover and truncover
both will assign missing values if the dataline ends before the variable’s field starts
-but when the dataline ends in the middle of a variable field, truncover will take as much as there is, whereas missover will assign a missing value to the variable
dlm=’09’x.
specifies that it is a tab delimited file
what does sum in proc print do
will print the sum of the variable specified
what does short in proc contents do
will only output the variable names
what does @ do in the input statement
- uses pointers to read in external raw data
- tells sas the beginning column of the variable
- all values of the variable must be aligned and space delimited to use the pointer
- the default length of the variable is still 8
column range input method
- tells sas the length of the variable
- good to use if the variable longer than 8 units in length
- can make the data take up less space for shorter variables
what kind of files can infile read
flat files (.txt, .csv, .dat)
how sas stores date values
stores dat as the number of days from Jan. 1, 1960
how sas stores time values
the number of seconds since midnight
how sas stores datetime values
the number of seconds between midnight on Jan 1, 1960 and the given date and time
$w. informat
reads in character data
w specifices the width
.d would specify decimal points (just use “.” with no d because it is a character variable)
w.d informat
reads in standard numeric data
w is the total width from first to last number after the decimal
d is the number of digits to read in after the decimal point
COMMAw.d informat
reads in numeric values and removes embedded commas, blanks, dollar signs, %, dashes, and right parentheses from the input data
- converts a left parenthesis to a minus sign (ex (500) input varname comma5. turns into -500)
- writes the number with comma separating every 3 digits
DOLLARw.d informat
- similar to commaw.d, will write numbers with a leading $
DATEw. informat
reads in data values in the form ddmmmyy or ddmmmyyy
ex 16mar99 use date7.
16mar1999 use date9.
DDMMYYw. informat
reads in date calues in the form ddmmyy or ddmmyyy
ex 160399 use DDMMYY6.
ex 16/03/99 use DDMMYY8.
ex 16031999 use ddmmyy8.
ex 16/03/1999 use ddmmyy10.
* if it were 03/30/1999 you would use mmddyy10.
*never would be four y’s just use two
TIMEw. informat
reads hours, minutes, and seconds in the form hh:mm:ss.ss ex 10:13 PM use TIME8. ex 11:23:07.40 use TIME11.2 ex 11:23:09.40 PM use TIME14.2 *count the spaces
DATETIMEw. informat
reads in datetime values as ddmmmyy hh:mm:ss.ss
ex 16mar1997/11:23:07.40 use datetime21.2
where to use informat in the data step
- can use it before the input statement for reading internal raw data
- can use in the input statement
- can use before the infile statement when reading in external raw data
where to use format statements
*can specify in proc print- but this won’t change how the data is stored in the view table
* can use in the data step after informat and before input and it will store the formats
*
proc format
- this is where you can create user defined formats
* create and store the format in proc format, use in proc print with a format statement
by statement and proc sort
- for proc sort, you have to specify an out data set to save the sorted data
- sort variable first in proc sort then can use the sorted variable in proc print
using the where statement
use where to subset the data in proc print
can also be used in the data step
to modify a sas dataset
have to use the set statement
subsetting your data
- where chooses the observations you want from an existing dataset
- output the observations that you want to a new dataset
- delete the observations you don’t want so they don’t appear in a new dataset
- keep variables you want
- drop varaiables you don’t want
if then
can use if then statements with output/delete to subset the data
if then else
most efficient way to use the statement with mutually exclusive observations of the variable
renaming vs labeling
- renaming will change the name of the variable and is what you need to reference when you reference variables
- labels are what will show up in the data table, easiest practice is to just remove the labels (label varname “ “)
proc means
- can use options to get certain statistics (n nmiss std mean q1 q3)
- can specify certain variables you want statistics ran for
- can sort and then use the sorted variable in proc means to get grouped analysis
- if you don’t want to sort, you can use the class statement to do a grouped analysis
Combining datasets by stacking
- just use set statement and it will stick the datasets on top of each other in the order you specified
- increases the number of observations by combining vertically
- good for when datasets are structured the same but have different observations
- problems with stacking- need variables to have same format (numeric vs character) - length of the variable needs to be the same in both datasets, if it isnt you need to set the dataset with longer length variable first or it will get truncated (can also use a format statement to define the length before the set statement)
Combining datasets by merging
- merges horizontally
- good for when you need to combine datasets that have different variables
- have to sort by the unique identifying variable, make sure none of the variable names in the two datasets are the same, then combine by the sorted variable
- use of in option to see which observations are a part of which datasets
proc freq
- creates a contingency table (1 or two way)
* review what each of the four values in the table mean