SAS Flashcards
Code to import data
data work.datasetname; input age weight height; label age = 'Patient Age'; cards; 24 130 65 30 150 70; run;
printing data
proc print run;
UNIVARIATE procedure does what
for each variable prints summary statistics
extreme observations, stem and leaf, basic stats (mean/median/mode/deviation/stdev/range/IQR), quartiles, t-test, sign, signed rank
USE TO EXPLORE NEW DATASET
proc UNIVARIATE code
proc univariate data = setName plot; (plot not necess.) var weight; histogram weight; (not necessary) Title1='Age study'; run;
what does the CORR procedure do
shows simple statistics (N of each group, mean, standard deviation, max, label)
Correlations between variables
proc CORR code
proc corr data = work.setName;
var age height;
run;
code to create new dataset
data work.newSet;
set oldSet;
run;
create new dataset and add new variable and fill it (code)
data work.clenedSet; set oldSet; bp = .; IF x = 6 THEN bp = 1; IF x = 1 OR x = 2 OR x = 3 OR x = 4 OR x = 5 THEN bp = 0; run;
what does cross-tabulation with the FREQ procedure do?
shows two tables: one way freq and cross tabulations of two variables
can see who used what treatment
percentages
frequency tables for variables in analysis
code to show frequencies of dataset proc FREQ
proc freq data = setName;
tables age age*weight;
run;
what to add to FREQ procedure to see how many missing values
tables age age*weight/MISSING;
proc TTEST code
proc ttest data = setName; class group; (groups we want to compare) var height; (compare groups on this variable) run;
what does proc TTEST do to missing values
excludes them
proc TTEST equality of variance results
if Folded F >0.05 assume equal variances, else say variances unequal
what test to use when unequal variances for proc TTEST
Satterthwaite
What test to use when equal variances TTEST
Pooled
proc TTEST if P
Reject null and say difference in heigh between groups
proc TTEST F-test null and alternative
Null is equal variances, alternative is unequal variances
How to use Cochran with proc TTEST
proc ttest data = setName COCHRAN;
When to use cochran
produces p-value for unequal variances
if folded f
How to compare our mean height to mean value under the null of 60 (code)
proc ttest data setName H0=60;
var height;
run;
how to check normality of data
proc univariate data = work.setName plot;
var height;
histogram height;
run;
shows box plot, histogram, etc
how to do before and after TTEST
proc ttest data = setName;
paired before*after
run;
null is before-after=0
ANOVA code
proc anova data = work.setName; class food; (7 different groups ate diff food) model height = food; (compare heights of people who ate different food) run;
ANOVA F-statistic calculation
variance between groups/variance between groups
should be near 1 if null correct
F statistic 6.67 suggests difference in means of groups
When to use Tukey
see which anova means are different.
multiple comparison
Tukey code
proc anova data = work.setName;
class food; (7 different groups ate diff food)
model height = food; (compare heights of people who ate different food)
means food/tukey;
run;
how to interpret Turkey output
comparisons significant at 0.05 indicated by **. those groups different
Why use ANOVA over t-test
faster. running many t-tests will increase chances that results are shown by chance
recode to put heights into 4 groups
data work.heightsgrouped;
set work.heights;
group=.; (initiate variable and handle missingness)
if height >= 10 AND height
Write ANOVA code to test effect of age on group
proc anova data = work.setname; class group; model age = group; run;
what kind of comparisons do we do in anova
one continuous variable (age) to one categorical variable (group)
where do we write continuous and categorical in anova
continuous always on let and categorical on right. class will alway be categorical
anova using proc GLM code
proc glm data = work.setname; class group; model age = group run;
same out put but more options for GLM
how to sort data in increasing order (Code)
proc sort data = work.setname;
by group;
run;
how to create box plot for data (code)
proc boxplot data work.setname;
plit age*group;
run;
how to interpret box plot
horizontal line is median
plus is the mean
top and bottom are 1st and 3rd quartiles
center shaded box made of 50% of data
how do we get all the attributes of our dataset and their characteristics (code)
proc contents data = work.dataset;
run;
what does CONTENTS procedure display
alphabetic list of variables and attributes
variable name, type, length, etc
length of variable important if merging datasets
number is the position in the dataset
how to correlate things together
the CORR procedure
the CORR procedure code
proc corr data = dataset;
var carb ener etohn fat sugaraw sugaref;
with lexpectbirth;
run;
what does CORR procedure output
simple statistics of each variable listed
pearson correlation coefficients
top number is correlation and bottom number is p-value
how to create correlation matrix with all the variables (code)
proc corr data = setname;
var age height weight blah blah3 blah2;
run;
why would we do a correlation matrix
to find out which predictors may be informative when predicting response
understand how predictors are related because that affects how they jointly model the response variable
How to make scatterplot (code)
ods graphics on; proc gplot data = work.set; plot age*hegiht; run; ods graphics off;
reg procedure code
proc reg data = work.set;
model lebirth=ener;
run;
how to structure the model statement of REG procedure
first variable is response and right variable is predictor
what is root MSE
provided by REG procedure
estimate of standard deviation of the Y (response variable)
what is R-square
percent of the variation that the model explains
__% of total variation is explained by the model
what does it mean when regression intercept 27
When energy is 0, life expectancy after birth is 27
what does it mean when slope is 0.014
for every unit increase in energy consumption, life expectancy after birth goes up 0.01
what is null hypothesis
proc reg data = work.set;
model lebirth=ener;
run;
energy consumption not related to life expectancy
how to transform data to look at squared life expectancy? (code)
data work nurtirion2; set work.nutrition;
le2= lebirth*lebirth;
run;
proc reg data = work.nutrition2; model le2=ener; run;
Why squaere data
logarithmic trend taken away - more linear now
assume linear relationship so if look at graph want to see linearityy
what if transformed data and root MSE went up and R square didnt improve
not best transformation
when do transformation hope to improve linearity and our explanation of the variation
How to look if multicollinearity may exist (code)
ODS graphics on; PROC CORR DATA=work.nutrition nomiss plots=matrix(histogram); VAR carbs ener etohn fat sugraw sugref; RUN; ODS graphics off;
if put 2 response variables in model that are correlated with each other
will get errors and hard to decipher which variable is trying to give you which info
how to decide which variable not to use
context of model (study obesity more sense to include BMI than weight), include 1, run model, see fit, try another
bad model if can explain a variable throguh
linear combo
if had model of solid predictors DF would say
1
otherwise says 0 or B
code for backward selection
in model line at end add ‘/SELECTION = BACKWARDS
what is backward selection
put all variables in, consider one and if doesn’t add any info to model remove
keep going backward until finds model that gives most info
removes variable that’s least significant (contributing less)
what is forward selection
start with no variables and consider one that adds most info
moves on until no info added after adding variable
what is stepwise election
biased. forward, add next var (added 2), steps back and asks if addition of 2nd makes 1st less significant
at each step consider removing or adding variable
can make diff decision at each step
what does selection kick off before the start
linear things
which things are kicked off models using section first
non-significant p value
default significance for backward selection
0.1
default significance of stepwise selection
0.15
if data miner use ____ selection because want best model
stepwise
what complexity model do we want
simpler
how to see residuals
run model with ods graphics
what should we not see from residuals
trend between residuals and fitted values or residuals and any variable
residual vs. predicted values
want random - no patterns
standardized residuals
expect 95% between +-2
leverage
what’lll happen if pull that observation out of model.
points with high leverage = big infliuence on the line
Q-Q plot
quartiles vs. residuals
how do we expect residuals to be distributed
normally
how do we transform age square
data agetransform; set age; age2= age**2; run;
double star is power
why do proc means
exploratory analysis
tells how many people in which category
how to perform exploratory analysis on age categories and a dichotomous variable
proc meas data = name; class age; var dichotomous; run;
PROC MEANS DATA=ear; class antibo; var clear1; RUN;
what does output mean (N obs, N, mean)
N Obs: How many people were on that antibiotic
N: How many had first ear problem
Mean: Percentage that had a recovery in the 14 day period
how to look at model of antibiotic one vs antibiotic 2 as reference (compare from 1 to 2)
PROC LOGISTIC DATA=ear DESCENDING;
CLASS antibo;
MODEL clear1 = antibo / LACKFIT;
run;
PROC LOGISTIC DATA=ear DESCENDING;
CLASS antibo;
MODEL clear1 = antibo / LACKFIT;
run;
how to interpret OR 2.247
OR of 1 vs 2
Odds of recovery are 2.247 times greater for antibiotic 1 vs antibiotic 2
how to check if OR significant
make sure CI doesnt contain 1
look at p values
PROC LOGISTIC DATA=ear DESCENDING;
CLASS antibo;
MODEL clear1 = antibo / LACKFIT;
run;
what if don’t put class variable
SAS looks at it as continuous variable
ALWYAS USE CLASS STATEMENT IN LOG REGRESSION TO MAKE MORE INTERPRETABLE
Odds = (write out standard model example)
e^-1.864 (constant) (e^1stBvalue)^X1 (e^2ndBvalue)^2 etc
e^-1.864 (constant) (e^1stBvalue)^X1 (e^2ndBvalue)^2 etc
every unit increase in X increases odds of Y being 1 by
e^b
male e^B = 2.454
intrepret
if subtract 1 from value get % increase or decrease in odds caused by being male
odds of owning a gun increase by 145%
educ b=-0.056
exp(b) = 0.946
year’s education decreases odds by 5.4%
10 year age affect on odd
Exp(B)^10
1.008^10 = 1.083
odds go up
how to format vaiable (code)
proc format;
value hospformat 1= ‘Hospitalized’ 2=’Not Hospitalized’;
what do formats do
start their own folder and can call formats in proc freq
how to do chi square test of independence (code)
proc freq data=h1n1; format Hospitalization hospformat. Age ageformat.; tables Hospitalization * Age / chisq; weight Count; run;
what do we put after format
Any time it’s a format with a dot dot tells SAS it’s a formating statement
proc freq data=h1n1; format Hospitalization hospformat. Age ageformat.; tables Hospitalization * Age / chisq; weight Count; run;
- means
Star says hospitalization versus age.
/chisq tells sas
specific analysis I want done on frequency table is chi square
what does chisq show
frequency table
chi square value
rule of thumb for doing chi square
need to have 5 in each square
will warn if datasets have less than 5 - USE FISHERS instead
if chi square value p value
reject null hypothesis that age is independent of hospitalization
chi square likelihood ratio based on
regression analysis
mantel-haenszel chi square
ordinal test of association
- Good if have ordinal categories
- Looking for association b/w rows and columns assuming there is order for the columns
Phi coefficient
Usually just for 2x2 tables, for which -1
Contingency coefficient
C=sqrt(ϕ2/(N+ϕ2))
Cramer’s V
measure of association
expected chi square and fishers exact code
ods graphics on;
proc freq data=h1n1;
format Hospitalization hospformat. Age ageformat.;
tables Hospitalization * Age / expected chisq;
weight Count;
exact fisher pchi;
run;
when to use exact chi square
Use exact chi square when we don’t have this assumption covered – expected counts >5
Fishers exact
don’t have to worry about restrictions chi square has
poisson regression
Different style of regression based on Poisson distribution
Poisson distribution
common distribution for counts
log odds can’t span
0
format questions to be answered as yes or no (code)
proc format;
value qaformat 1=’Yes’ 2=’No’ 3=’Dunno’;
run;
code to check agreement between self questionnaire and interview
proc freq data=cough; format saq qaformat. int qaformat.; tables saq * int /agree; weight count; run;
proc freq data=cough; format saq qaformat. int qaformat.; tables saq * int /agree; weight count; run;
what does two sided test mean
testing if equals to zero
proc freq data=cough; format saq qaformat. int qaformat.; tables saq * int /agree; weight count; run;
what does one sided testing means
null is kappa 0q
which kappa to report
ONE SIDED because want positive kappa
don’t look at weighted kappa WANT BASIC KAPPA
dont report exact p value
CI for kappa
shouldn’t include zero
negtaive kappa value means
no agreement
kappa breakdowns
poor (
code to produce exact test for kappa
proc format; value physformat 1=’Minimal' 2=’Moderate' 3=’Large’ 4 =‘Excessive’; run; proc freq data=phys; format phys1 physformat. phys2 physformat.; tables phys1* phys2/agree; weight count; test agree; exact agree; run;
McNemar’s test tests for
Symmetry
shown when doing agreement
is the probability that 1 physician rates it a 1 nad naother 3 same as probabilyu as 1st rates it a 3 and another rates it a 1
If table is not symmetric is what it indicates is that one phsyician tends to say that the ectopy’s are lartger than another physician – bias
27 minimal by 1 physician and only 15 minimal for another – bias towrard sayign thigns are smlaler (want 15 and 27 to be closer to each other)
H0 and Ha of McNemar’s
null is symmetric, alternative is assymetric
How to interpret McNemar’s
if p value
Parametric test: paired t-test
Nonparametric
Wilcoxon signed rank
How to get nonparametric correlation along with parametric (code)
proc corr data=oc pearson spearman;
var before after;
run;
difference between spearman and pearson
Pearson only looking at linear association
Spearman works off of ranks. rank data to compare data. looking at non-linear association
code for matched pairs t-test
ODS GRAPHICS ON; PROC TTEST DATA=work.contraceptives; PAIRED before * after; TITLE ‘Example of Matched Pairs’; RUN; ODS OFF;
ODS GRAPHICS ON; PROC TTEST DATA=work.contraceptives; PAIRED before * after; TITLE ‘Example of Matched Pairs’; RUN; ODS OFF;
what does this do
match observations before and after to determine if same
assume before and after are normally distributed
paired t-test H0 and Ha
H0: before = after
H1: before != after
what if reject null of paired t-test
say observations before not distributed same way as observations after
Wilcoxon signed rank used when
nonparametric for matched paires t-test
wilcoxon signed rank code
PROC UNIVARIATE DATA=oc;
VAR diff;
RUN;
look at Signed rank
when to do signed rank vs t-test
based on normality. often people do non-parametric test so don’t have to assume normality
nonparametric tests harder to prove something, so significant in nonparametric will be significant in parametric
independent samples
parametric: t-test
non-parmetric
Mann Whitney U (Wilcoxon Signed Rank)
t-test independent samples code
PROC TTEST DATA=pain;
CLASS Physiotherapy;
VAR pri;
RUN;
PROC TTEST DATA=pain;
CLASS Physiotherapy;
VAR pri;
RUN;
class statement tells SAS
where to get 2 independent samples
physic classified in 2 groups
trying to see if pain rating same in 2 groups
Mann Whitney U code
nonparam test for indep samples
PROC NPAR1WAY DATA=pain WILCOXON;
CLASS Physiotherapy;
VAR pri;
RUN;
What to look at when doing Wilcoxon Rank sum (Mann Whitney U)
T-approximation
doing 2-sided test to determine if differences are zero
Kruskal Wallis Test code
PROC NPAR1WAY DATA=pain WILCOXON;
CLASS analg;
VAR pri;
RUN;