Chp3 Regression Flashcards
Linear Regression Theory
A statistical process for estimating the relationship among variables
What two variables are in regression?
Response Variable (dependent variable y)
Predictor Variables (independent variables, X)
What is regression used for?
Predicting and forecasting
What does linear regression try to do?
Use past data to predict future outcomes
What are model coefficients/parameters/weights?
multiply them against the input values to generate your response variable
What is error called between observed response Yi and predicted response Yhat
Residual
Residual Formula
Yi - Yhat
observed response - predicted response
Residual Sum of Squares
RSS, sum of each residual squared
Residual Formula mathematical simple
Ei = (B0 + B1Xi) - (Bhat0 + Bhat1X1)
Which is the best regression line?
The one that minimizes the sum of squared residuals
Multiple R-squared
Will always increase as you add more predictors because increasing variance and every predictor is increasing multiple R-squared, but not every predictor is a good predictor
Adjusted R-squared
Captures how many of those predictors you have added are actually good predictors as you add those predictors. Mult r sq and adj r sq values go up, but there will be a time where the adj r sq will plateau and drop, stop adding variables at that point
Adjusted R-squared shows
When adding more predictors makes it worse
F statistic
Captures how good model is, bigger the better
Degrees of freedom
How much wiggle room you have in your data set
In hypothesis testing, what must be true to support the Null hypothesis H0
Pvalue > alpha
In hypothesis testing, what must be true to not support the Null hypothesis
Pvalue <= alpha
What is alpha in hypothesis testing
The probability of Rejecting the null hypothesis given that the null hypothesis is true
What is pvalue in hypothesis testing
The probability of getting a result as extreme as you have given that the null hypothesis is true
What are the only two outcomes from hypothesis testing?
- Reject H0 in favor of H1
- Do not reject H0
In hypothesis testing, we never accept //
H1
If we are looking if a drug has an effect, what is null and alternate hypothesis?
null - drug has no effect
alt = drug has some effect
What are the four questions to evaluate the fit of a regression model?
Is at least one of the predictors useful in predicting the response?
Do all the predictors help explain the response, or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict and how accurate is our prediction?
What is the hypothesis test to determine if at least one predictor is useful in predicting the response?
H0: all betas are 0
H1: at least one beta is nonzero
What does it mean if the F statistic is close to 1?
There is no relationship between response and predictors, H0 is true
What is f statistic if H1 is true
F-statistic > 1
If at least one predictor is useful, what does that mean about the p value associated with the f statistic?
it is very small
What is hypothesis testing for determining if all the predictors explain the response, or if only a subset of predictors are useful?
For each predictor
H0: Bi = 0 (there is no relationship between predictor and response)
H1: Bi != 0 (There is some relationship between predictor and response)
What p value makes a predictor useful? not useful?
Very low P value
a non low p value means we cannot reject the null hypothesis
T value
Tells you how many std your beta value is from 0
T value the farther
the better
In example, what is significance that radio and newspaper have correlation in correlation matrix?
If we spend more on radio we are also spending more on newspaper, which should not be done because newspaper is not very correlated with sales
Given p predictors how many models could we create
2^p
two strategies for feature selection
Forward Selection
Backward Selection
Forward Selection
Start with null model (intercept) and add predictors
Backward Selection
Start with all predictors and remove variables with largest p-values
Which statistics 3 describe how well the model fits the data?
Residual Standard Error RSE
R^2
RMSE
what is r^2
Correlation between y and yhat
What do we want r squared to be?
1, we want what we are predicting to be very correlated to the training data
Residual Analysis
Sees how well the model fits the data
Residuals should be 4
homosceadastic
Centered on 0 through a range of fitted values
normally distributed
uncorrelated with each other
Homosceadastic
Variance in residual must not vary too much
RSE must be
minimized
If residual plot is nonlinear, what type of regression would be best?
quadratic or higher power regression
Significance level (alpha)
Probability of making the wrong decision, given H0 is true
Confidence Interval
The range of results that would be expected to contain the population parameter of interest
Confidence Level
Probability that if an experiment was repeated multiple times, results will be the same
Confidence level formula
1 - alpha
What does 95% confidence interval mean?
If I ran the experiment 20 times, then my true value (actual mean maybe) will be present in this interval 19 times.
Higher confidence means
Higher margin for error, wider confidence intervals
prediction interval is _ than confidence interval
wider
Difference between prediction interval and confidence interval
Prediction interval is trying to predict something, so we have to take into account the variation in the data and uncertainty in knowing the true population parameter
Confidence interval is basically drawn from the data you have
Multi-collinearity
When one predictor is a linear combination of another predictor
In the presence of multi-collinearity
Regression model will not work. Cannot trust the weights because when updating one weight you hold the others constant, which cant be true in multi-collinearity.