Regression Flashcards
What questions can regression answer?
How do systems work?
ex: how many runs the avg homerun is worth
-effects of economic factors on pres. election
Make Predictions about what will happen in the future?
-height in the future
-price of oil in the future
-housing demand in next 6 months
Simple Linear Regresssion
-one predictor
-y = response
x = predictor
Equation
y = a0 + a1x1
general linear regression equation
with m predictors
y = response
x = predictor
y = a0 + sum from j =1 to m ajxj
How do you measure the quality of a regression line’s fit?
the sum of squared errors
-distance between true response and our estimate
simple linear regression prediction error
Yi - actual
yhati - prediction
Yi - Yhati or yi - (a0+a1xi1))
Sum of squared errors equation
sum from i = 1 to n (yi-yhati)^2
or
sum from i = 1 to n (yi-(a0+a1xi1))^2
What is the best fit regression SLR line?
minimizes sum of squared errors
-defined by a0 and a1
How do we measure the quality of a models fit?
likelihood
What is likelihood? What is maximum likelihood?
-measure the probability (density) for any parameter set; we assume the observed data is the correct value and we have information about the variance
-parameters that give the highest probability
What Maximum Likelihood Estimation (MLE). What are you minimizing to calculate this?
the set of parameters that minimizes the sum of squared errors
zi = observations
yi = model estimates
minimize sum from i = 1 to n (zi-yi)^2
Maximum likelihood in the context of linear regression
LR - y = a0 + sum from j =1 to m ajxj
sum square errors = sum from i = 1 to n (zi-yi)^2
substitute regression equation for yi in sum of squared errors
minimize sum from i = 1 to n (zi-(a0 + sum from j =1 to m ajxj))^2
How can you use likelihood to compare two different models?
the likelihood ratio
Akaike Information Criterion equation. What is the penalty terma nd what does it do?
L*: maximum likelihood value
K: # of parameters we’re investigating
AIC = 2k -2ln(L*)
Penalty term - (2k) balances likelihood with simplicity
-helps avoid overfitting
AIC with regression? Do you want AIC to be smaller or higher?
substitute maximum likelihood reg. equasion and the # of parameters is m+1
-we prefer models with smaller aic, aic smaller encourages fewer parameters and higher likelihood
corrected AIC
-works well if we have infitiely many data points
-this never happens
-add a corrections term
AICc = AIC 2k(k+1)/ n-k-1
Comparing models with AIC
relative likelihood that lower AIC model is better =
e^((AIC1-AIC2)/2)
Bayesian Information Criterion (BIC)
L*: maximum likelihood value
K: # of parameters we’re investigating
n: number of data points
BIC = kln(n) - 2ln(L*)
AIC VS BIC
BICs penalty term >AICs penalty term
-BIC encourages models with fewer parameters than AIC does
-only use bic when there are more data points than parameters
BIC comparison between 2 modesl on the same dataset…
is abs(BIC1-BIC2) >10 the smaller bic model is very likely to be better
if between 6 and 10 smaller bic models is likely better
between 2 and 6 somewhat likely better
between 0 and 2 is slightly likely to be better
Is there a hard an fast rule for choosing betweeen AIC, BIC, or maximum likelihood?
No, all 3 can give valuable information. Looking at all 3 can help you decide which is best
Regression coefficients for predictions and forecasting
the response increases by the coeeficient * the variable
in other words if the variable= 1 , that increases the response by the coefficient amount (descriptive)
if we are forecasting
-same thing but the coefficient is increase the response by its amount when the variable =1 (predictive)
Which of the components of analytics can regression be used for?
Descriptive and predictive analytics
not prescriptive
Causation
one thing causes another thing
correlation
two things tend to happen together or not together
- they don’t nescessarily cause each other
When is there causation?
-cause is before effect
-idea of causation makes sense
-no outside factors that could cause the relationship
-be careful before claiming causation
Transforming data
-adjust the data so the fit is linear
-quadratic regression
-response transform
-box-cox transformation
variable interaction
ex- 2 yr olds height @ adulthood. if both parents are tall maybe the kid will be even taller ie their heights interact
-y = a0 + a1x1 + a2x2+a3(x1x2)
-the interaction term is a new column of data that we can use as a new input x3
p-value of coefficient
estimate the probability that the coefficient is really 0
-form of hypothesis testing
if p value > 0.05
- can remove from model
-other thresholds can be used
-higher thresholds - more factors can be included
-possibilitie of including irrelevant factor
-lower thresholds - less factors can be included - possibility of leaving out relevant factor
p-value warnings
with large amounts of data p values get small even when attributes are not at all related to the response
p values are only probabilities even when meaningful
-100 attributes p values of .02 each, 2% chance of not being significant
-expect 2 that are not really relevant
confidence interval
where the coefficient probably lies and how close it is to 0
T-statistic
the coefficient divided by it’s standard error
-related to p value
interpreting coefficient
-sometimes you discover the coefficient when multiplied by attribute still doesn’t make much of a difference even if the pvalue is very low
ex: estimate household income with age as one of the attributes
-if the coefficient is 1
even with low p value the attribute really isn’t very important. its unlikely to mkae even a $100 difference
R squared value (coefficient of determination)
-estimate of how much variability your model accounts for
-ex rsquared = 59%
-accounts for about 59% of the variability in the data
-the remaining 41% is either randomness or other factors
adjusted r dquared
rsquared adjusted for # of attributes used
interpreting r squared, what is a good value?
-some things aren’t easily modeled
-things can affect real life systems especially when humans are involved
-r-squared values of .4 or .3 are quite good
what is the null hypothesis?
the hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.
r squared formual
1-SSEresiduals/SSEtota