R Parameterized Modeling Flashcards
Omit Missing Values From Data Frame
na.omit(DataFrame)
how to count missing values in a column
sum(is.na(DataFrame$column))
How to perform Best Subset Selection
"library(leaps) regfit.full = regsubsets(Salary~., Hitters) #by default, this only tests up to the best eight variable model, #you can override this by using the nvmax option: regfit.full = regsubsets(Salary~., data = Hitters, nvmax = 19)"
How to interpret output of best subset model
It will return the best model for models with n variables. For example, for two variable model it will mark the appropriate two columns with an asterik
how to plot best subset using best subset selection
“plot(regfit.full, scale = ““adjr2””)
see ?plot.regsubsets for help”
Do forward/backward stepwise selection
“use method parameter in regsubsets function
regfit. fwd = regsubsets(Salary~., data= Hitters, method=”“forward””)
regfit. bwd = regsubests(Salary~., data = Hitters, method = ‘backward’)”
Create a training/test set
“set.seed(1)
train = sample(c(TRUE, FALSE), nrow(Hitters), rep = TRUE)
test = (!train)
regfit.best = regsubsets(Salary~., data = Hitters[train, ])”
Create a matrix of X’s, which is a data structure used by many regression packages
“model.matrix(Formula, Dataframe)
this function creates a matrix with an intercept of 1, and dummy codes factor vars for you. Used by many regression functions behind the scenes to train and predict values. You must use this function incase some function does not have a predict function (like regsubsets)
remember - Formula = Y~., “
What should you type in the R console to install the “car” package?
install.packages(‘psych’)
What should you type to create a matrix “a” comprising all natural numbers from 1 to 10, with 2 rows and 5 columns.
a = matrix(1:10, 2, 5)
PCR in R
library(pls)
set. seed(2)
pcr. fit = pcr(Salary~., data = Hitters, scale = TRUE, validation = ‘CV’)
PLS in R
library(pls)
set.seed(1)
pls.fit=plsr(Salary~., data=Hitters, subset = train, scale = TRUE, validation = ‘CV’)
summary(pls.fit)
PLS vs. PCR
PCR tries to maximize variance explained by the predictors (the principal components), whereas PLS searches for predictors that explain variance in BOTH the predictors and the response - usually PLS explains more variance with fewer predictors.
polynomial regression
fit = lm(wage~poly(age, 4), data=Wage)