L3: Linear Regression Flashcards
After this week: - Understand how regression analysis works - Apply linear models to solving different regression problems - Critically assess the accuracy of coefficient estimates and the accuracy of the model - Produce a precise analysis of the model output
Is there a relationship between any of the advertising streams and sales?
It appears that TV and Radio look promising as having some relationship. This is shown by the linear/non-linear pattern.
Newspaper may have some weak relationship or may require a data transformation.
How strong is the relationship between the different advertising streams and sales?
TV has the strongest relationship, followed by radio and then newspaper.
Which of the media contributes to sales?
From first glance, it appears that TV and radio only contribute to sales
Is the relationship linear between the advertising streams and sales?
Perhaps the radio and tv are linear, however it is possible that the relationship tapers, so the relationship could be logarithmic for TV and perhaps something similar for radio
If there were synergy between two variables, what would this mean?
It would suggest that there is an interaction between two variables that aids in the explanation of the dependent variable’s variability.
What are the assumptions that are made in a linear regression model? (3)
That the response variable, Y has a linear relationship to the predictor variable, X
That the errors are independent and normally distributed
That there is constant variability in the residuals
Linearity, Nearly Normal Residuals, Constant Variability
Define the i th residual by its equation.
Let ei be the residual of datapoint i:
ei = yi - ŷi
That is, the residual is the difference between the true and the predicted value of y
Define the residual sum of squares then
The residual sum of the squares is a means of measuring the discrepancy between the predicted and true values of the dependent variable.
RSS = ni=1∑e2i
Where ei is the residual of the ith data point
What is the least squares approach?
The least squares approach is choosing the coefficients of the linear model by minimising the RSS. In such a way we optimise the model so that the model has the least deviation from the data points.
This will yield the most-true model for the data.
Which of the variables are significant?
What does this mean?
The Pr(>|t|) value gives the probability of the t-test, if this is <0.05 then we can reject the null hypothesis and assume a relationship.
In this case, the intercept and TV variable appear to be significantly related to sales.
What does the Std. Error indicate?
The Std. Error indicates how precisely the model estimates the coefficient’s unknown (error) value.
SE(B0) = 0.457843: in the absence of any advertising, the average sales can vary by 457.843 units.
SE(B1) = 0.002691: for each $1,000 increase in television advertising, the average increase in sales can vary by 2.691 units.
What is the 95% confidence interval of the B1 coefficient?
The 95% confidence interval is found by
B1 ± 2 SE(B1)
Therefore the interval is:
[B1-2SE(B1), B1+2SE(B1)]
What is the Residual Standard Error (RSE)?
It is a measure of the quality of linear regression fit.
In our previous example, the RSE = 3.259 therefore actual sales in each market deviates from the true regression line by 3259 units on average. This is 23% (3259/14000) of the mean value (14,000) of the sales.
What does the R2 tell us?
The R squared tells us the proportion of variability in Y that can be explained by the independent variable X.
For multiple linear regression, how can we find the best estimates for the regression coefficients?
We can use the RSS, just like in linear regression