Intro Flashcards
Load libraries into R
library(Caret)
Edit data in R
fix()
View names in data frame
names()
Load Variable Names Into Environment So Don’t have to type the name of columns
attach(DataFrame)
basic linear model function R
lm.(y~x, data = DataFrame)
display statistics on model
first send model output to variable lm.output = lm(y~x, data=Dataframe)
Then, summary(lm.output)
Show information after fitting a model
names(model.output)
summary(model.output)
Show model coefficients and confidence intervals
coef(model.output) #shows the coeff
confint(model.output) #shows the 95% conf interval for the coefficients
Use model to predict new values
Predict()
Predict(model.output, dataframeofx’s, interval=”confidence)
Prediction vs. Confidence Interval
When predicting a new data point, want prediction interval. Confidence Interval is about the where the average of future values lie.
To get PI, Predict(model.output, dataframeofx’s, interval = “prediction”)
Scatter Plot with Regresion Line
Plot(x, y)
abline(model.output, lwd=3, col= “red”) #adds line to scatterplot
lwd is for width
See diagnostic plots of linear regression
plot(model.output) #Automatically does it, b/c model output contains it, wow!
if there are 4 graphs, first create 4 tiles, so first do:
par(mfrow=c(2,2))
how does predict() work
predict(model.output) will return a vector of predicted Y values
predict(model.output,
inspect functions
type function name predict
if there is call to method, use methods(methodname)
Get max of vector
which.max(vector), returns index of max position
Shorthand formula for regression in R
formula = lm(Yvariable ~ ., data = DATAFRAME)
instead of writing x1 + x2 + etc you can just put a dot.
Function to use when there is Colinearity
Need to see the Variance Inflation Factor VIF, part of car package.
library(car)
vif(lm.fit) #use on model output
remember VIF > 5 indicates colinearity
How to see a correlation matrix
cor() all columns must be numeric, if a column isn’t numeric use matrix notation such as cor(data.frame[,-9])
Logistic Regression R
logreg = glm(Direction ~Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = SMarket, family = binomial) key is family = binomial, using the glm() genaralized linear model function
When doing logistic regression on a factor variable with two levels, how do we handle the dummy coding?
we dont have to do anything, the glm function will dummy code it for you automatically! However, you can retireve what the dummy coding values are by using contrasts()
attach(Smarket)
contrasts(Direction)
How to use logistic regression output to predict values in your dataset
predict(logreg, type = ‘response’)
the type = ‘response’ tells R to output probabilities instead of other variables.
How to convert Logistic regression probabilities into actual predictions.
- Create a vector corresponding to “zero” probability.
logreg. predict = rep(‘Down’, 1250) - Rename vector based on probability
logreg. predict[logreg.probs > .5] = ‘Up’
How to create a confusion matrix
use the table function
table(VectorOfPredictions, Vector of True Values)
table(logreg.predict, Direction)
The two vectors have to have the same values like “Up/Down”, so make sure they are converted to same values.
sub select training data in time series
train = (Year<2005) #Stores binary vector in trainingset Smarket.2005 = Smarket[!train, ] #Data before 2005 Direction.2005 = Direction[!train]
How to train a linear regression model on a subset of the data
Use the subset argument to glm
logreg. train = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial, subset=train)
Then predict using the predict function
making prediction on test set, logistic regression
predict(glm.fit, Smarket.2005, type=’response’)
LDA with R
library(MASS) #library for LDA
lda.fit = lda(Direction~Lag1+Lag2, data=Smarket, subset=train)
lda.pred = predict(lda.fit, Smarket.2005)
names(lda.pred)
lda.class = lda.pred$class #uses 50% probability
lda.probs = lda.pred$posterior #these are the probabilities
How to center and scale variables for distance based learning methods KNN, K Means, etc
scale()
gives each variable a mean of zero and stdev of 1
create vector
a = c('Hamel', 'Bob', 'Tom') OR a = 1:10 OR a = c(1:10)
Create a matrix
matrix(nrow, ncol)
matrix(1:6, 2, 3)
see attributes of an object
attributes()
combine two vectors of equal length into data frame
rbind(x,y) OR
cbind(x,y)
* trick question, doesn’t have to be equal length the shorter vector if not equal will just be duplicated to accommodate the bigger vector
Factor Vector
x ))
The order of the levels can be set using the levels argument to factor(). This can
be important in linear modelling because the rst level is used as the baseline level.
find missing values
is.na()
returns boolean vector
differences b/w mattrices and data frames in R
in matrices, entire matrix has to be same class whereas data frame can store dierent classes of objects in each column
convert data frame to matrix
data.matrix()
count # of columns and rows
nrow(DataFrame)
ncol(DataFrame)
how to change names of columns in R
names(x) )
find the class or data type of each column
sapply(dataframe, class)
connections to files in R
- file - python like interface to file
- gzfile opens to connection to gzipped file
- url
see ?file
Readlines from a file
con <- readLines(con, 10)
readlines from webpage
con <- readLines(con)
see ?file
Find number of missing values in column
sum(is.na(Dataframe$column))
what is model.matrix()
#Create Model Matrix, which is normally used behind the scense #by regression models to predict things, just converts your dataset # to a model-freiendly format - adds a column of 1's for intercept and #dummy codes all variables
What do you do when you perform 10 k-fold cross validation and you are left with 10 different error terms for a particular tuning parameter
you take the mean of the statistic in question for all 10 folds. You average the 10 folds together, because you can’t use any one fold by itself.
What happens after you find the best model using k-fold cross validation or LOOCV?
Once you find the best model, you should fit the model over the ENTIRE dataset as a last step and use that model. You only used CV so that you could find the optimal model that generalizes, but you can go ahead and fit the model over the entire dataset when you are done.
what package has lasso and ridge regression
library(glmnet)