Intro Flashcards by Hamel Husain

Load libraries into R

library(Caret)

How well did you know this?

Not at all

Perfectly

Edit data in R

fix()

How well did you know this?

Not at all

Perfectly

View names in data frame

names()

How well did you know this?

Not at all

Perfectly

Load Variable Names Into Environment So Don’t have to type the name of columns

attach(DataFrame)

How well did you know this?

Not at all

Perfectly

basic linear model function R

lm.(y~x, data = DataFrame)

How well did you know this?

Not at all

Perfectly

display statistics on model

first send model output to variable lm.output = lm(y~x, data=Dataframe)
Then, summary(lm.output)

How well did you know this?

Not at all

Perfectly

Show information after fitting a model

names(model.output)

summary(model.output)

How well did you know this?

Not at all

Perfectly

Show model coefficients and confidence intervals

coef(model.output) #shows the coeff

confint(model.output) #shows the 95% conf interval for the coefficients

How well did you know this?

Not at all

Perfectly

Use model to predict new values

Predict()

Predict(model.output, dataframeofx’s, interval=”confidence)

How well did you know this?

Not at all

Perfectly

Prediction vs. Confidence Interval

When predicting a new data point, want prediction interval. Confidence Interval is about the where the average of future values lie.
To get PI, Predict(model.output, dataframeofx’s, interval = “prediction”)

How well did you know this?

Not at all

Perfectly

Scatter Plot with Regresion Line

Plot(x, y)
abline(model.output, lwd=3, col= “red”) #adds line to scatterplot

lwd is for width

How well did you know this?

Not at all

Perfectly

See diagnostic plots of linear regression

plot(model.output) #Automatically does it, b/c model output contains it, wow!
if there are 4 graphs, first create 4 tiles, so first do:
par(mfrow=c(2,2))

How well did you know this?

Not at all

Perfectly

how does predict() work

predict(model.output) will return a vector of predicted Y values
predict(model.output,

How well did you know this?

Not at all

Perfectly

inspect functions

type function name predict

if there is call to method, use methods(methodname)

How well did you know this?

Not at all

Perfectly

Get max of vector

which.max(vector), returns index of max position

How well did you know this?

Not at all

Perfectly

Shorthand formula for regression in R

formula = lm(Yvariable ~ ., data = DATAFRAME)

instead of writing x1 + x2 + etc you can just put a dot.

How well did you know this?

Not at all

Perfectly

Function to use when there is Colinearity

Need to see the Variance Inflation Factor VIF, part of car package.
library(car)
vif(lm.fit) #use on model output
remember VIF > 5 indicates colinearity

How well did you know this?

Not at all

Perfectly

How to see a correlation matrix

cor() all columns must be numeric, if a column isn’t numeric use matrix notation such as cor(data.frame[,-9])

How well did you know this?

Not at all

Perfectly

Logistic Regression R

Study These Flashcards

logreg = glm(Direction ~Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = SMarket, family = binomial) key is family = binomial, using the glm() genaralized linear model function

When doing logistic regression on a factor variable with two levels, how do we handle the dummy coding?

Study These Flashcards

we dont have to do anything, the glm function will dummy code it for you automatically! However, you can retireve what the dummy coding values are by using contrasts()
attach(Smarket)
contrasts(Direction)

How to use logistic regression output to predict values in your dataset

Study These Flashcards

predict(logreg, type = ‘response’)

the type = ‘response’ tells R to output probabilities instead of other variables.

How to convert Logistic regression probabilities into actual predictions.

Study These Flashcards

Create a vector corresponding to “zero” probability.
logreg. predict = rep(‘Down’, 1250)
Rename vector based on probability
logreg. predict[logreg.probs > .5] = ‘Up’

How to create a confusion matrix

Study These Flashcards

use the table function
table(VectorOfPredictions, Vector of True Values)
table(logreg.predict, Direction)

The two vectors have to have the same values like “Up/Down”, so make sure they are converted to same values.

sub select training data in time series

Study These Flashcards

train = (Year<2005)  #Stores binary vector in trainingset
Smarket.2005 = Smarket[!train, ] #Data before 2005
Direction.2005 = Direction[!train]

How to train a linear regression model on a subset of the data

Use the subset argument to glm - ----------------------------- logreg. train = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial, subset=train) Then predict using the predict function

making prediction on test set, logistic regression

predict(glm.fit, Smarket.2005, type='response')

LDA with R

library(MASS) #library for LDA lda.fit = lda(Direction~Lag1+Lag2, data=Smarket, subset=train) lda.pred = predict(lda.fit, Smarket.2005) names(lda.pred) lda.class = lda.pred$class #uses 50% probability lda.probs = lda.pred$posterior #these are the probabilities

How to center and scale variables for distance based learning methods KNN, K Means, etc

scale() | gives each variable a mean of zero and stdev of 1

create vector

``` a = c('Hamel', 'Bob', 'Tom') OR a = 1:10 OR a = c(1:10) ```

Create a matrix

matrix(nrow, ncol) | matrix(1:6, 2, 3)

see attributes of an object

attributes()

combine two vectors of equal length into data frame

rbind(x,y) OR cbind(x,y) * trick question, doesn't have to be equal length the shorter vector if not equal will just be duplicated to accommodate the bigger vector

Factor Vector

x )) The order of the levels can be set using the levels argument to factor(). This can be important in linear modelling because the rst level is used as the baseline level.

find missing values

is.na() | returns boolean vector

differences b/w mattrices and data frames in R

in matrices, entire matrix has to be same class whereas data frame can store dierent classes of objects in each column

convert data frame to matrix

data.matrix()

count # of columns and rows

nrow(DataFrame) | ncol(DataFrame)

how to change names of columns in R

names(x) )

find the class or data type of each column

sapply(dataframe, class)

connections to files in R

1. file - python like interface to file 2. gzfile opens to connection to gzipped file 3. url see ?file

Readlines from a file

con <- readLines(con, 10)

readlines from webpage

con <- readLines(con) | see ?file

Find number of missing values in column

sum(is.na(Dataframe$column))

what is model.matrix()

``` #Create Model Matrix, which is normally used behind the scense #by regression models to predict things, just converts your dataset # to a model-freiendly format - adds a column of 1's for intercept and #dummy codes all variables ```

What do you do when you perform 10 k-fold cross validation and you are left with 10 different error terms for a particular tuning parameter

you take the mean of the statistic in question for all 10 folds. You average the 10 folds together, because you can't use any one fold by itself.

What happens after you find the best model using k-fold cross validation or LOOCV?

Once you find the best model, you should fit the model over the ENTIRE dataset as a last step and use that model. You only used CV so that you could find the optimal model that generalizes, but you can go ahead and fit the model over the entire dataset when you are done.

what package has lasso and ridge regression

library(glmnet)

Intro Flashcards

(47 cards)