unit 3 - ch 13 - multiple linear regression (mr) Flashcards
The multivariate dependent and independent relationship
Y - carat
X - price of gem
X2 - cut
X3 - clarity
X4 - Color
The multiple regression equation
Y hat = b + mx +
MR = Y hat = (y hat equation)
Partial or correlation coefficient is that middle part
Dummy variables
Using categorical (nominal) data
Converts categorical data into binary data
Used for _____ (missed in lecture)
Gem - Y - X1 - X2
Non-numeric data = text
0-1 binary code
r
Sign is - or +
Range is -1 or +1
Direction is indicates
X-y relationship is =
multi r
Sign is +
Range is 0 to +1
Direction is does not indicate
X-y relationship is >=
Multi-r is a single point-value representing the strength of a simultaneous relationship between the x-variables and Y
(multi) collinearity
Share
Line (slope)
(Multi) collinearity:
When 2 (or more) x-variables are highly correlated with each other
The mutli-variate dependents (X and Y)
Independent relationships (X and X)
multi-variate dependent vs independent relationships
The mutli-variate dependents (X and Y)
Independent relationships (X and X)
student car broke down on campus
Student X moves car (Y) across campus.
The total distance of the movement of car (Y) is 100% due to the effort of student (X) = simple linear regression
Next day students (X1 and x2) move car (Y)
We can measure the total distance car (Y) was pushed by harder to find efforts of X1 AND X2 STUDENTS ADD TO THE TOTAL MOVEMENT OF TOTAL
r or multi r formula
The adverse effects of multicollinearity
When 2 or more x-variables are highly correlated
1. Cannot decipher which x-variable is affecting the y-variable (not an issue with SLR)
2. Increase the chances of type 2 error (FTRN that is really false)
3. The signs of the partial correlation coefficients may flip
As collinearity decreases there is an increase in each predictor variables unique portion of the variability within the Y-variable
Multiple regression excel:
regression table
anova table
collinearity table
r = Strength
a = Significance
c = Collinearity
the strength of the relationship: summary output table (regression table)
Coefficient of determination: the percent of the variation in gem price that is explained by the variation in carat, cut, clarity, color
N= sample size
P = number of predictors
If the general rule regarding sample size is not met adjusted R square is a more accurate indicator of the strength of the multiple regression relationship
judgment call
Is the strength of the relation (missed again) :(
Multiple R, R square, Adjusted RSQ → Strength of the relationship → judgment call
test stat =
= between term/within term
underlying theory of anova test
total variation can be divided into two distinct parts:
1 - between AND
2 - whtin (error)
and the two components can be compared to determine which is affecting the data to a greater degree
total variation in the y-variable can be divided into distinct components
regression. term
residual term (error)
regression term
1 - regression term (Y’s relationship with the X-variable)
Regression term: Y hat - Y bar
residual term
2 - residual term (random factors not in the model)
Residua Term: Y - Y hat
full model
FM = Y hat = b + m(x)
total variation
Total Variation: Y - Y bar
increase of F
increase ms regression / decrease ms residual
decrease of F
decrease ms regression / increase ms residual
anova table
1 of the 4 facets of the Null states: everything is unrelated
Ho: The model of caret, cut, clarity, and color is unrelated with gem price
H1: The model of caret, cut, clarity and color is correlated with gem price
If FTRN: the model is not significantly correlated to the Gem price (is not a good model)
If RTN: the model is significantly correlated to Gem price (is statistically good model)
when to FTRN or RTN
If FTRN: the model is not significantly correlated to the Gem price (is not a good model) F>a
If RTN: the model is significantly correlated to Gem price (is statistically good model) F<a
Significance F is compared to alpha not p value and RTn or FTRN
unexplained variation
Naked eye appeal - seller’s reputation, seller’s service etc…
16-17%
explained variability
Carat, cut, clarity, color etc.
83-84%
The value of the chance model is not for practical use but for
comparison purposes
significance of the components
Y hat = b0 + b1 (x1) + b2 (x2) + b3 (x3) + e….
b0 = y-int
b1 = (partial) correlation coefficient
e1 = residual
1 of the 4 facets of the Null states: everything is unrelated
Ho: each x-variable is not correlated with gem price
H1: each x-variable is correlated with gem price
IF FTR: the x-variable is not a good predictor variable
If RTN: the x-variable is a good predictor variable
p value vs alpha (FTRN vs RTN)
P value < alpha : reject
P-value > alpha : FTRN
0 and 1 variable
The “0” variable:
Reference group
Represents the absence the qualitative attribute
The “1” variable:
Dummy variable
Represents the presence the qualitative attribute
look at notes i guess to understand graphs :/
If the gem is pink, rather than green, it demands a premium
Pink gems, on average, cost (…)
If dummy coefficient was a negative number: pink gems sell t a discount compared to green gems