Chpt 15 Flashcards
How many variables and what type of variables are involved in multiple regresson
2 or more independent variables
what is the Multiple regression the study of
how a dependent variable y is related to 2 or more independent variables
What is the multiple regression model
Y = B0+B1X1 to B2x2 +…..Bpxp + E
what is the random variable in teh regression model
E - the error term
what does the error term in multiple regression account for (SSE)
accounts for the variability in y that cannot be explained by the linear effect of the p independent variables
what are the assumptions in multiple regression
the mean or expected value of E is 0
What is the multiple regression equation
E(y) = B0 + B1x1 + Bx2x+………BpXP
What does E(y) stand for
mean or expected value of y
in multiple regression, when we use sample data to estimate the multiple regression equation, what is the fromula
yTriangle hat = b0 +b1x1+b2x2…….bpxp
what does y trainagle hat stand for in mutlple regression
predicted value of the dependent variable
what does yi stand for
observed value of the dependent variable for the ith observation
what does y traingle hat i stand for
predicted value of the dependent variable
when adding more independent variable to a multiple regression, does it mean the regression will be “better off” why
no, it can make things worse, called overfitting
what is multicollinearity
the addition of more independent variables creates more relationships among them
- so not only are the I.V. potentially related to the Dependent variable, they are also potentially related to each other
If you have 4 I.V., how many relationships do you have
4 - with the I.V and D.V and 6 more with the I.vs
so in total there are 10 relationships to consider
do all I.V. help at predicting the D.V?
no, some I.V. are better at predicting the D.V. than others, some contribute nothing
in multicollinearity, what is the ideal situation
that all of the I.Vs to be correlated with the D.V. but NOT with each other
in multiple regression, how is each coefficient interpreted as?
the estimated change in y corresponding to a one unit change in a variable when all other variables are held constant
What are the 6 preps for multiple regression
- generate a list of potential variable; indpednent and dependent
- Collect data on the variables
- check the relationship b/w each I.V and the D.V. using scatter plots and correlations
- (optional) conduct simple linear regression for each i.V./D.V pair
- use the non-redundant I.V.s in teh analysis to find the best fitting model
- use the best fitting model to make predictions about the D.V.
what two problems can happen in multiple regression
- overfitting and 2. multicollinearity
what is overfitting
is caused by adding too many I.V.; they account for more variance but add nothing to the model
What is multicolinearity
happens when some / all of the i.v.s are correlated with each other
In Simple linear regression how do we interpret bi
as an estimate of the change in y for a one-unit change in the I.V.
In multiple linear regression how do we interpret bi
we interpret each regression coefficient as : bi - an estimate of the change in y corresponding to ta one unit change in xi when all other intendent variables are held constant
multiple regression what does bi represent
an estimate of the change in y corresponding to a one-unit change in xi when all other I.Vs are held constant
What is the formula for Coefficient of Determination in simple linear regression
r squared = SSR/SST
what is the formula for Coefficient of Determination with multiple regression
Rsqaured = SSR/SST
What does the Multiple Coefficient of Determination indicate
indicates we are measuring the goodness of fit for the estimated multiple regression equation
- denoted R squared
How can R sqaured (Multiple coefficient of determination) be interpreted as
the proportion of variability in the dependent variable that can be explained by the estimated multiple regression equation - when multiplied by 100, it can be interpreted as the % of the variability in y that can be explained by the estimated regression equation
What would this mean: R squared = .904
90.4% of the variability in travel time y is explained by the estimated multiple regression equation
Does R squared always increase or decrease as intendent variables are added
always increases
Why do many analysts prefer adjusting R squared for the # of independent variables to avoid what
to avoid overestimating the impact of adding an i.v.
What is the formula for the adjusted multiple coefficient of determination
R squared a = 1 - (1-R sqaured)[ (n-1) (n-p-1)
What does “p” represent in the adjusted multiple coefficient of determination
of independent variables
What happens to SSE when you add more I.V.s
causes prediction errors to become smaller, thus reducing the SSE
SSR = SST-SSE
- when SSE becomes smaller, causing R squared - ssr/sst to increase
IF a variable is added to the model, what happens to R squared
becomes larger even if the variable added is not statistically significant
If the value of R squared is small and the model contains a large # of I.V. what can happen to the Adjusted Coeff. of Det.
can take on a negative value
- in such cases, minitab sets the adjusted coeff of det to zero
in multiple regression, what is the variance of E denoted by
Q sqaured
in multiple regression, what is the variance of E expectation
same for all values of the I.vs
What are the assumptions about E in simple or multiple linear regression
- the error term E - is a random variable with an expected value of 0
- The variance of E is the same for all values of the I.Vs
- The values of E are independent
- The Error term - is a normally distributed random variable
What type of graph is the mutliple regression
plane in 3-d space graph
What is the value of E in multiple regression
difference b/w actual y and the expected value of y E(y) when x1 = x* and x2 = X*2
Regression Analysis Terms
Dependent variable - we now use
Graph is called a
we now use response variable
graph is called a response surface
In Simple linear regression what tests did we use to test for significance
t and F test
- both provided the same conclusion
In multiple regression what tests do we use when testing for significance
F test and t test
in multiple regression what is the F test for
used to determine whether a significanct relationship exists b/w d.v and the set of i.vs
in multiple regression what is the F test referred to as
the test for Overall significance
in multiple regression what is the t test used for
used to determine whether EACH of the I.V.s is Significant
- a separate t test is conducted for EACH of the I.V. s
in multiple regression what is the t test referred to as
a test for individual significance
What is the Hypotheses for F test in multiple regression (test for overall significance)
HO: B1 = B2 = …….Bp = 0
Ha: one or more of the parameters is NOT equal to zero
In multiple regression, with the F test, if HO is reject what can we say
gives us sufficient statistical evidence to conclude that one or more of the parameters is NOT equal to zero and
that the overall relationship b/w y and the set of I.V s is significant
In multiple regression, with the F test, if we cannot reject Ho, what can we say
we do not have sufficient evidence to conclude that a significant relationship is present
in multiple regression, what is the formula for mean square
sum of squares / df
In multiple regression, what is the df for Total sum of squares
n-1
in multiple regression, what is the df for sum of squares due to regression (SSR)
p df
p - # of i.v?
in multiple regression, what is teh df for sum of squares due to error
SSE has n-p-1 df
in multiple regression, what is the df for MSR
MSR = SSR/p
in multiple regression, what is the df for MSE
MSE = SSE/n-p-1
In multiple regression, what does MSE provide
provides an unbiased estimate of Q squared (the variance of the error term E)
in multiple regression, if HO: B1 = B2 = Bp = 0, then what can we say about MSR
then MSR also provides an unbiased est of Q squared
and the value of MSR/MSE should be close to 1
in multiple regression, if HO: B1 = B2 = Bp = DOES NOT = 0, then what can we say about MSR
MSR overestimates Q squared
MSR/MSE - becomes larger
How do you determine how large the value of F must be to reject Ho
p-value approach - reject HO if p-value < a
CV approach reject if F > Fa
What is Fa based on and what is the df
based on F distribution
df for numerator = p
df for denominator = n-p-1
What is the standard error also callaed
standard error
what is the formula for the standard error
sqaure root of MSE
what is the test statistic formula for t test in multiple regression
t = bi / sbi
what is the hyp test for t test
Ho: B1 =B2 = 0
Ha: B1 and / or B2 is not equal to zero
what is the hyp test for t test
Ho: B1 =B2 = 0
Ha: B1 and / or B2 is not equal to zero
What does the I.V in regression analysis refer to
any variable being used to predict or explain the value of the D.V.
Hwo to determine whether multicollinearity is high enough to cause problems - what test do you use
rule of thumb test
- a sample correlation coefficent greater than +7 or less than =7 for 2 IVs is a warning of potential problems
- try to avoid including i.vs that are highly correlated (in practice, this is rarely possible)
In multiple regression, what do you do if you believe there is substantial multicollinearity
separating the effects of individual i.vs on the dependent variables is very difficult
What do you use to estimate and predict
- a confidence interval and a prediction interval
what is a confidence interval
mean travel for all trucks that travel 100 miles and make 2 deliveries
what is a prediction interval
of the travel time for one specific truck that travels 100 miles and makes 2 deliveries
With multiple regression, how do you develop the interval estimate for the mean value of y and the prediction interval
similar to simple linear reg, we use mini tab or excel or other software packages
With multiple regression, how do you develop the interval estimate for the mean value of y and the prediction interval
similar to simple linear reg, we use mini tab or excel or other software packages
How do you interpret Bo, B1 and B2 when categorical variable is present in multiple regression
X2 = 0 mechanical
E (y|mechanical) mean or expected value of repair time given a mechanical repair
E(y|mechanical) = B0 + B1x1 + B2(0) = B0 + B1x1 for electrical E(y| electrical) = B0 + B1x1 = B2(1) = B0+B1x1+B2 =(B0+B2) + B1x1
How do you interpret Bo, B1 and B2 when categorical variable is present in multiple regression
X2 = 0 mechanical
E (y|mechanical) mean or expected value of repair time given a mechanical repair
E(y|mechanical) = B0 + B1x1 + B2(0) = B0 + B1x1 for electrical E(y| electrical) = B0 + B1x1 = B2(1) = B0+B1x1+B2 =(B0+B2) + B1x1
how to interpret parameters in Multiple reg.
E(y\mechancial) = when mechanical is given 0
E(y|electircal) = when electrical is given 1
if B2 is positive
If B 2 is negative
If B = 0
Positive - the mean repair time for electrical will be greater than that for mechanical
Negative - the mean repair time for electrical will be less than that for mechanical
0 - no difference in the mean time b/w electrical and mechanical and the type of repair is NOT related to the repair time
how to interpret parameters in Multiple reg.
E(y\mechancial) = when mechanical is given 0
E(y|electircal) = when electrical is given 1
if B2 is positive
If B 2 is negative
If B = 0
Positive - the mean repair time for electrical will be greater than that for mechanical
Negative - the mean repair time for electrical will be less than that for mechanical
0 - no difference in the mean time b/w electrical and mechanical and the type of repair is NOT related to the repair time
why do we use a dummy variable
the use of a dummy variable provides 2 estimated regression equations that can be used to predict the repair time depending on if its mechiancal or electrical
if a categorical variable has k levels then how many dummy variables are required
k-1
- each dummy variable is coded as a 1 or a zero
what is residual analysis useful for in multiple regression
standardized residuals are frequently used in residual plots and in the identification of outliers
what is the formula for the standardized residual for observation i
yi - y triangle hat i / Syi- y triangle hat i
What is Syi- y triangle hat i stand for
the standard deviation of the residual i
what is the formula for Syi - y traingle hat i
S x square root of hi
S- standard error of the estimate
hi - leverage of observation
HOw is the leverage of the observation determined
by how far the values of the I.vs are form their mean
in multiple regression, what can the normal prob plot be used for
to determine whether the distribution of E appears to be normal
- same procedure as in simple linear regression
- use software to compute it
How do you determine if there is an outlier
if the value of the standardized residual is less than -2 or greater than +2
The presence of one or more outliers in a dat set tends to do what to the standard error
tends to increase the standard error of the estimate
when the size of the standardized residual will decrease when
S increases
What do we do to correct when the standardized residual rule fails to identify the outlier
use studentized deleted residuals
What are studentized deleted residuals
may detect outliers that standardized residuals do not detect
what is definition of unstandardized residual?
difference b/w an observed value and the value predicted by the model
definition of Standardized residual
residual / an estimate of its SD
- also called Pearson Residuals, M=0, SD =1
Definition of Studentized Deleted residual
- the deleted residual for a case / by it’s standard error
What is the difference b/w a studentized deleted residual and its associated studentized residual indicate
how much difference eliminating a case makes on its own prediction
What is the difference b/w a studentized deleted residual and its associated studentized residual indicate
how much difference eliminating a case makes on its own prediction
How is each residual obtained in studentized delted residuals
obtained by regressing using all of the data EXCEPT for the point in question
What does Si denote
the standard error of the estimate based on the data set with i th observation removed
IF Si is less than S what can we say
the ith ovservation is an outlier
- the absolure value of the ith studentized residual will be larger than the absolute value of the standardized residual
How can t distribution be used to determine what with regards to studentized deleted residuals
to determine whether the studentized deleted residuals indicate the presence of outliers
if the value of the ith studentized deleted residual is less than T a/2 or greater than t a/2 what can we conclude
it’s an outlier
what does leverage of an observation hi measure
how far the values of the I.V.s are form their mean values
HOw do we compute leverage
use minitab
What is the rule of thumb or hi
hi > CV, we have an influential observations?
What is a problem that can happen using leverage to find influential observations
observations can be identified as having a high leverage and not necessarily be influential
- using leverage can lead to wrong conclusions
What can we use to eliminate issues with using leverage to find influential observations
Cook’s distance measure
What does Cook’s Distance measure use
both leverage of observation i, hi and the residual observation i (ui-Y triangle hat i)
What is the formula for Cooks distance measure
Di - (Yi-y triangle hati)squared / (p+1)s Squared [hi / (1-hi)squared]
IF Di >1 what can we conclude
the ith observation is influential and should be studied further
What is logistic Regression example
estimate the prob that the bank will approve the request for a c/c given a particula set of vlaues for the chosen I.Vs
what does logistic regression require
dependent variable y and one or more i.v s
What is the logistic regression equation
E(y) = e B0+B1x1+B2x2+…Bpxp / 1+e B0+B1x1+B2x2+..Bpxp
In logistic regression, what type of graph do we have
s shaped graph
in logistic regression, what does the values of E(y) representing probability increase dhow
fairly rapidly as x increase father up
What does logistic regression seek to do
- model the prob. of an event occurring depending on the values of the I.Vs, which can be categorical or numerical
- estimate the prob. that an event occurs for a randomly selected observation vs the prob the event does not occur
- predict the effect
What does logistic regression seek to do
- model the prob. of an event occurring depending on the values of the I.Vs, which can be categorical or numerical
- estimate the prob. that an event occurs for a randomly selected observation vs the prob the event does not occur
- predict the effect of a series of variables on a binary response variable (0 or 1)
- classify observations by estimating the prob that an observation is in a particular category (such as approved or not approved)
Model, estimate, predict and classify)
In logistic regression why won’t simple linear regression work
b/c simple linear regression is one quantiative variable predicting another
in logisitc regression why won’t multiple linear regression work
multiple linear regression is simple linear regression with more i.vs
in logisitc regression why won’t nonlinear regression work
still 2 quantative variables but the data is curvilinear
running typical regressions with logistics causes what problems
- binary data does not have a normal distribution (1 or 0), which is a condition needed for most other types of regression
- predicted values of the DV can be beyond 0 and 1 which violates the definition of probability
- probabilities are often not linear such as ‘U” shapes where prob is very low or very high at extremes of x -values
running typical regressions with logistics causes what problems
- binary data does not have a normal distribution (1 or 0), which is a condition needed for most other types of regression
- predicted values of the DV can be beyond 0 and 1 which violates the definition of probability
- probabilities are often not linear such as ‘U” shapes where prob is very low or very high at extremes of x -values `
What are Odds ratio formula
odds = Prob (occurring)/ Prob (not occurring)
= p / (1-p)
What are Odds ratio formula
odds = Prob (occurring)/ Prob (not occurring)
= p / (1-p)
when is the odds ratio used
in logistic regression
What does the odds ratio measure
the impact on the odds of a one-unit increase in only one of the I.Vs
- the odd’s that y = 1 given that one of the IVs has been increased by 1 unit (odds1)/(the odds that y = 1 given no change in the values (odds0)
what is the event of interest in the odd’s ratio
y = 1
in logistic regression we are estimating what
an unknown p for any given linear combination of the I.vs (so the prob of succcess is p and failure is q = 1-p
TEsting for significance in logisitc regression we use what type of test
G test
What is the Hyp test for significane for logistic regression
Ho: B1 = B2 = 0
If the null hypothesis in logistic regression is true, what can we say about the sampling distribution of G
Follows a chi-square distribution
df = # of I.V
in logistic regression if a is > p-value what can we conclude
we reject HO and conclude that the overall model is significant
What does the G test show
overall significance
What does the G test show
overall significance
what is the z test used for in logistic regression
used to determine whether each of the I.V.s is making a significant contribution to the overall model
What is the hyp test for z-test for logistic regression
HO:B1 = 0
IN a z test for logistic regression, if HO is true
the estimated coefficient divided by its standard error follows a normal prob distribution
Adjusted multiple coefficient of determination - Definition
A measure of the goodness of fit of the estimated multiple regression equation that adjusts for the number of independent variables in the model and thus avoids overestimating the impact of adding more independent variables
Categorical independent variable - Definition
An independent variable with categorical data
Cook’s distance measure - Definition
A measure of the influence of an observation based on both the leverage of observation i and the residual for observation i
Dummy variable - Definition
A variable used to model the effect of categorical independent variables. A dummy variable may take only the value zero or one
Multicollinearity - Definition
The term used to describe the correlation among the independent variables
Multiple coefficient of determination - Definition
A measure of the goodness of fit of the estimated multiple regression equation. It can be interpreted as the proportion of the variability in the dependent variable that is explained by the estimated regression equation
Multiple regression analysis - Definition
Regression analysis involving two or more independent variables.
Multiple regression equation - Definition
The mathematical equation relating the expected value or mean value of the dependent variable to the values of the independent variables; that is, E(y) = B0 + B1x1 + B2x2 + . . . + Bpxp.
Multiple regression model - Definition
The mathematical equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . , xp and an error term e.
Odds in favor of an event occurring - Definition
The probability the event will occur divided by the probability the event will not occur
Odds Ratio - Definition
The odds that y = 1 given that one of the independent variables increased by one unit (odds1) divided by the odds that y = 1 given no change in the values for the independent variables (odds0); that is, Odds ratio = odds1yodds0