Chpt 15 Flashcards

1
Q

How many variables and what type of variables are involved in multiple regresson

A

2 or more independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the Multiple regression the study of

A

how a dependent variable y is related to 2 or more independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the multiple regression model

A

Y = B0+B1X1 to B2x2 +…..Bpxp + E

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is the random variable in teh regression model

A

E - the error term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does the error term in multiple regression account for (SSE)

A

accounts for the variability in y that cannot be explained by the linear effect of the p independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are the assumptions in multiple regression

A

the mean or expected value of E is 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the multiple regression equation

A

E(y) = B0 + B1x1 + Bx2x+………BpXP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does E(y) stand for

A

mean or expected value of y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

in multiple regression, when we use sample data to estimate the multiple regression equation, what is the fromula

A

yTriangle hat = b0 +b1x1+b2x2…….bpxp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what does y trainagle hat stand for in mutlple regression

A

predicted value of the dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what does yi stand for

A

observed value of the dependent variable for the ith observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what does y traingle hat i stand for

A

predicted value of the dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

when adding more independent variable to a multiple regression, does it mean the regression will be “better off” why

A

no, it can make things worse, called overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is multicollinearity

A

the addition of more independent variables creates more relationships among them
- so not only are the I.V. potentially related to the Dependent variable, they are also potentially related to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

If you have 4 I.V., how many relationships do you have

A

4 - with the I.V and D.V and 6 more with the I.vs

so in total there are 10 relationships to consider

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

do all I.V. help at predicting the D.V?

A

no, some I.V. are better at predicting the D.V. than others, some contribute nothing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

in multicollinearity, what is the ideal situation

A

that all of the I.Vs to be correlated with the D.V. but NOT with each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

in multiple regression, how is each coefficient interpreted as?

A

the estimated change in y corresponding to a one unit change in a variable when all other variables are held constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the 6 preps for multiple regression

A
  1. generate a list of potential variable; indpednent and dependent
  2. Collect data on the variables
  3. check the relationship b/w each I.V and the D.V. using scatter plots and correlations
  4. (optional) conduct simple linear regression for each i.V./D.V pair
  5. use the non-redundant I.V.s in teh analysis to find the best fitting model
  6. use the best fitting model to make predictions about the D.V.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what two problems can happen in multiple regression

A
  1. overfitting and 2. multicollinearity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is overfitting

A

is caused by adding too many I.V.; they account for more variance but add nothing to the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is multicolinearity

A

happens when some / all of the i.v.s are correlated with each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

In Simple linear regression how do we interpret bi

A

as an estimate of the change in y for a one-unit change in the I.V.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In multiple linear regression how do we interpret bi

A

we interpret each regression coefficient as : bi - an estimate of the change in y corresponding to ta one unit change in xi when all other intendent variables are held constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
multiple regression what does bi represent
an estimate of the change in y corresponding to a one-unit change in xi when all other I.Vs are held constant
26
What is the formula for Coefficient of Determination in simple linear regression
r squared = SSR/SST
27
what is the formula for Coefficient of Determination with multiple regression
Rsqaured = SSR/SST
28
What does the Multiple Coefficient of Determination indicate
indicates we are measuring the goodness of fit for the estimated multiple regression equation - denoted R squared
29
How can R sqaured (Multiple coefficient of determination) be interpreted as
the proportion of variability in the dependent variable that can be explained by the estimated multiple regression equation - when multiplied by 100, it can be interpreted as the % of the variability in y that can be explained by the estimated regression equation
30
What would this mean: R squared = .904
90.4% of the variability in travel time y is explained by the estimated multiple regression equation
31
Does R squared always increase or decrease as intendent variables are added
always increases
32
Why do many analysts prefer adjusting R squared for the # of independent variables to avoid what
to avoid overestimating the impact of adding an i.v.
33
What is the formula for the adjusted multiple coefficient of determination
R squared a = 1 - (1-R sqaured)[ (n-1) (n-p-1)
34
What does "p" represent in the adjusted multiple coefficient of determination
of independent variables
35
What happens to SSE when you add more I.V.s
causes prediction errors to become smaller, thus reducing the SSE SSR = SST-SSE - when SSE becomes smaller, causing R squared - ssr/sst to increase
36
IF a variable is added to the model, what happens to R squared
becomes larger even if the variable added is not statistically significant
37
If the value of R squared is small and the model contains a large # of I.V. what can happen to the Adjusted Coeff. of Det.
can take on a negative value | - in such cases, minitab sets the adjusted coeff of det to zero
38
in multiple regression, what is the variance of E denoted by
Q sqaured
39
in multiple regression, what is the variance of E expectation
same for all values of the I.vs
40
What are the assumptions about E in simple or multiple linear regression
1. the error term E - is a random variable with an expected value of 0 2. The variance of E is the same for all values of the I.Vs 3. The values of E are independent 4. The Error term - is a normally distributed random variable
41
What type of graph is the mutliple regression
plane in 3-d space graph
42
What is the value of E in multiple regression
difference b/w actual y and the expected value of y E(y) when x1 = x* and x2 = X*2
43
Regression Analysis Terms Dependent variable - we now use Graph is called a
we now use response variable | graph is called a response surface
44
In Simple linear regression what tests did we use to test for significance
t and F test | - both provided the same conclusion
45
In multiple regression what tests do we use when testing for significance
F test and t test
46
in multiple regression what is the F test for
used to determine whether a significanct relationship exists b/w d.v and the set of i.vs
47
in multiple regression what is the F test referred to as
the test for Overall significance
48
in multiple regression what is the t test used for
used to determine whether EACH of the I.V.s is Significant | - a separate t test is conducted for EACH of the I.V. s
49
in multiple regression what is the t test referred to as
a test for individual significance
50
What is the Hypotheses for F test in multiple regression (test for overall significance)
HO: B1 = B2 = .......Bp = 0 Ha: one or more of the parameters is NOT equal to zero
51
In multiple regression, with the F test, if HO is reject what can we say
gives us sufficient statistical evidence to conclude that one or more of the parameters is NOT equal to zero and that the overall relationship b/w y and the set of I.V s is significant
52
In multiple regression, with the F test, if we cannot reject Ho, what can we say
we do not have sufficient evidence to conclude that a significant relationship is present
53
in multiple regression, what is the formula for mean square
sum of squares / df
54
In multiple regression, what is the df for Total sum of squares
n-1
55
in multiple regression, what is the df for sum of squares due to regression (SSR)
p df | p - # of i.v?
56
in multiple regression, what is teh df for sum of squares due to error
SSE has n-p-1 df
57
in multiple regression, what is the df for MSR
MSR = SSR/p
58
in multiple regression, what is the df for MSE
MSE = SSE/n-p-1
59
In multiple regression, what does MSE provide
provides an unbiased estimate of Q squared (the variance of the error term E)
60
in multiple regression, if HO: B1 = B2 = Bp = 0, then what can we say about MSR
then MSR also provides an unbiased est of Q squared | and the value of MSR/MSE should be close to 1
61
in multiple regression, if HO: B1 = B2 = Bp = DOES NOT = 0, then what can we say about MSR
MSR overestimates Q squared | MSR/MSE - becomes larger
62
How do you determine how large the value of F must be to reject Ho
p-value approach - reject HO if p-value < a CV approach reject if F > Fa
63
What is Fa based on and what is the df
based on F distribution df for numerator = p df for denominator = n-p-1
64
What is the standard error also callaed
standard error
65
what is the formula for the standard error
sqaure root of MSE
66
what is the test statistic formula for t test in multiple regression
t = bi / sbi
67
what is the hyp test for t test
Ho: B1 =B2 = 0 Ha: B1 and / or B2 is not equal to zero
68
what is the hyp test for t test
Ho: B1 =B2 = 0 Ha: B1 and / or B2 is not equal to zero
69
What does the I.V in regression analysis refer to
any variable being used to predict or explain the value of the D.V.
70
Hwo to determine whether multicollinearity is high enough to cause problems - what test do you use
rule of thumb test - a sample correlation coefficent greater than +7 or less than =7 for 2 IVs is a warning of potential problems - try to avoid including i.vs that are highly correlated (in practice, this is rarely possible)
71
In multiple regression, what do you do if you believe there is substantial multicollinearity
separating the effects of individual i.vs on the dependent variables is very difficult
72
What do you use to estimate and predict
1. a confidence interval and a prediction interval
73
what is a confidence interval
mean travel for all trucks that travel 100 miles and make 2 deliveries
74
what is a prediction interval
of the travel time for one specific truck that travels 100 miles and makes 2 deliveries
75
With multiple regression, how do you develop the interval estimate for the mean value of y and the prediction interval
similar to simple linear reg, we use mini tab or excel or other software packages
76
With multiple regression, how do you develop the interval estimate for the mean value of y and the prediction interval
similar to simple linear reg, we use mini tab or excel or other software packages
77
How do you interpret Bo, B1 and B2 when categorical variable is present in multiple regression X2 = 0 mechanical E (y|mechanical) mean or expected value of repair time given a mechanical repair
``` E(y|mechanical) = B0 + B1x1 + B2(0) = B0 + B1x1 for electrical E(y| electrical) = B0 + B1x1 = B2(1) = B0+B1x1+B2 =(B0+B2) + B1x1 ```
78
How do you interpret Bo, B1 and B2 when categorical variable is present in multiple regression X2 = 0 mechanical E (y|mechanical) mean or expected value of repair time given a mechanical repair
``` E(y|mechanical) = B0 + B1x1 + B2(0) = B0 + B1x1 for electrical E(y| electrical) = B0 + B1x1 = B2(1) = B0+B1x1+B2 =(B0+B2) + B1x1 ```
79
how to interpret parameters in Multiple reg. E(y\mechancial) = when mechanical is given 0 E(y|electircal) = when electrical is given 1 if B2 is positive If B 2 is negative If B = 0
Positive - the mean repair time for electrical will be greater than that for mechanical Negative - the mean repair time for electrical will be less than that for mechanical 0 - no difference in the mean time b/w electrical and mechanical and the type of repair is NOT related to the repair time
80
how to interpret parameters in Multiple reg. E(y\mechancial) = when mechanical is given 0 E(y|electircal) = when electrical is given 1 if B2 is positive If B 2 is negative If B = 0
Positive - the mean repair time for electrical will be greater than that for mechanical Negative - the mean repair time for electrical will be less than that for mechanical 0 - no difference in the mean time b/w electrical and mechanical and the type of repair is NOT related to the repair time
81
why do we use a dummy variable
the use of a dummy variable provides 2 estimated regression equations that can be used to predict the repair time depending on if its mechiancal or electrical
82
if a categorical variable has k levels then how many dummy variables are required
k-1 | - each dummy variable is coded as a 1 or a zero
83
what is residual analysis useful for in multiple regression
standardized residuals are frequently used in residual plots and in the identification of outliers
84
what is the formula for the standardized residual for observation i
yi - y triangle hat i / Syi- y triangle hat i
85
What is Syi- y triangle hat i stand for
the standard deviation of the residual i
86
what is the formula for Syi - y traingle hat i
S x square root of hi S- standard error of the estimate hi - leverage of observation
87
HOw is the leverage of the observation determined
by how far the values of the I.vs are form their mean
88
in multiple regression, what can the normal prob plot be used for
to determine whether the distribution of E appears to be normal - same procedure as in simple linear regression - use software to compute it
89
How do you determine if there is an outlier
if the value of the standardized residual is less than -2 or greater than +2
90
The presence of one or more outliers in a dat set tends to do what to the standard error
tends to increase the standard error of the estimate
91
when the size of the standardized residual will decrease when
S increases
92
What do we do to correct when the standardized residual rule fails to identify the outlier
use studentized deleted residuals
93
What are studentized deleted residuals
may detect outliers that standardized residuals do not detect
94
what is definition of unstandardized residual?
difference b/w an observed value and the value predicted by the model
95
definition of Standardized residual
residual / an estimate of its SD | - also called Pearson Residuals, M=0, SD =1
96
Definition of Studentized Deleted residual
- the deleted residual for a case / by it's standard error
97
What is the difference b/w a studentized deleted residual and its associated studentized residual indicate
how much difference eliminating a case makes on its own prediction
98
What is the difference b/w a studentized deleted residual and its associated studentized residual indicate
how much difference eliminating a case makes on its own prediction
99
How is each residual obtained in studentized delted residuals
obtained by regressing using all of the data EXCEPT for the point in question
100
What does Si denote
the standard error of the estimate based on the data set with i th observation removed
101
IF Si is less than S what can we say
the ith ovservation is an outlier | - the absolure value of the ith studentized residual will be larger than the absolute value of the standardized residual
102
How can t distribution be used to determine what with regards to studentized deleted residuals
to determine whether the studentized deleted residuals indicate the presence of outliers
103
if the value of the ith studentized deleted residual is less than T a/2 or greater than t a/2 what can we conclude
it's an outlier
104
what does leverage of an observation hi measure
how far the values of the I.V.s are form their mean values
105
HOw do we compute leverage
use minitab
106
What is the rule of thumb or hi
hi > CV, we have an influential observations?
107
What is a problem that can happen using leverage to find influential observations
observations can be identified as having a high leverage and not necessarily be influential - using leverage can lead to wrong conclusions
108
What can we use to eliminate issues with using leverage to find influential observations
Cook's distance measure
109
What does Cook's Distance measure use
both leverage of observation i, hi and the residual observation i (ui-Y triangle hat i)
110
What is the formula for Cooks distance measure
Di - (Yi-y triangle hati)squared / (p+1)s Squared [hi / (1-hi)squared]
111
IF Di >1 what can we conclude
the ith observation is influential and should be studied further
112
What is logistic Regression example
estimate the prob that the bank will approve the request for a c/c given a particula set of vlaues for the chosen I.Vs
113
what does logistic regression require
dependent variable y and one or more i.v s
114
What is the logistic regression equation
E(y) = e B0+B1x1+B2x2+...Bpxp / 1+e B0+B1x1+B2x2+..Bpxp
115
In logistic regression, what type of graph do we have
s shaped graph
116
in logistic regression, what does the values of E(y) representing probability increase dhow
fairly rapidly as x increase father up
117
What does logistic regression seek to do
1. model the prob. of an event occurring depending on the values of the I.Vs, which can be categorical or numerical 2. estimate the prob. that an event occurs for a randomly selected observation vs the prob the event does not occur 3. predict the effect
118
What does logistic regression seek to do
1. model the prob. of an event occurring depending on the values of the I.Vs, which can be categorical or numerical 2. estimate the prob. that an event occurs for a randomly selected observation vs the prob the event does not occur 3. predict the effect of a series of variables on a binary response variable (0 or 1) 4. classify observations by estimating the prob that an observation is in a particular category (such as approved or not approved) Model, estimate, predict and classify)
119
In logistic regression why won't simple linear regression work
b/c simple linear regression is one quantiative variable predicting another
120
in logisitc regression why won't multiple linear regression work
multiple linear regression is simple linear regression with more i.vs
121
in logisitc regression why won't nonlinear regression work
still 2 quantative variables but the data is curvilinear
122
running typical regressions with logistics causes what problems
1. binary data does not have a normal distribution (1 or 0), which is a condition needed for most other types of regression 2. predicted values of the DV can be beyond 0 and 1 which violates the definition of probability 3. probabilities are often not linear such as 'U" shapes where prob is very low or very high at extremes of x -values
123
running typical regressions with logistics causes what problems
1. binary data does not have a normal distribution (1 or 0), which is a condition needed for most other types of regression 2. predicted values of the DV can be beyond 0 and 1 which violates the definition of probability 3. probabilities are often not linear such as 'U" shapes where prob is very low or very high at extremes of x -values `
124
What are Odds ratio formula
odds = Prob (occurring)/ Prob (not occurring) = p / (1-p)
125
What are Odds ratio formula
odds = Prob (occurring)/ Prob (not occurring) = p / (1-p)
126
when is the odds ratio used
in logistic regression
127
What does the odds ratio measure
the impact on the odds of a one-unit increase in only one of the I.Vs - the odd's that y = 1 given that one of the IVs has been increased by 1 unit (odds1)/(the odds that y = 1 given no change in the values (odds0)
128
what is the event of interest in the odd's ratio
y = 1
129
in logistic regression we are estimating what
an unknown p for any given linear combination of the I.vs (so the prob of succcess is p and failure is q = 1-p
130
TEsting for significance in logisitc regression we use what type of test
G test
131
What is the Hyp test for significane for logistic regression
Ho: B1 = B2 = 0
132
If the null hypothesis in logistic regression is true, what can we say about the sampling distribution of G
Follows a chi-square distribution df = # of I.V
133
in logistic regression if a is > p-value what can we conclude
we reject HO and conclude that the overall model is significant
134
What does the G test show
overall significance
135
What does the G test show
overall significance
136
what is the z test used for in logistic regression
used to determine whether each of the I.V.s is making a significant contribution to the overall model
137
What is the hyp test for z-test for logistic regression
HO:B1 = 0
138
IN a z test for logistic regression, if HO is true
the estimated coefficient divided by its standard error follows a normal prob distribution
139
Adjusted multiple coefficient of determination - Definition
A measure of the goodness of fit of the estimated multiple regression equation that adjusts for the number of independent variables in the model and thus avoids overestimating the impact of adding more independent variables
140
Categorical independent variable - Definition
An independent variable with categorical data
141
Cook’s distance measure - Definition
A measure of the influence of an observation based on both the leverage of observation i and the residual for observation i
142
Dummy variable - Definition
A variable used to model the effect of categorical independent variables. A dummy variable may take only the value zero or one
143
Multicollinearity - Definition
The term used to describe the correlation among the independent variables
144
Multiple coefficient of determination - Definition
A measure of the goodness of fit of the estimated multiple regression equation. It can be interpreted as the proportion of the variability in the dependent variable that is explained by the estimated regression equation
145
Multiple regression analysis - Definition
Regression analysis involving two or more independent variables.
146
Multiple regression equation - Definition
The mathematical equation relating the expected value or mean value of the dependent variable to the values of the independent variables; that is, E(y) = B0 + B1x1 + B2x2 + . . . + Bpxp.
147
Multiple regression model - Definition
The mathematical equation that describes how the dependent variable y is related to the independent variables x1, x2, . . . , xp and an error term e.
148
Odds in favor of an event occurring - Definition
The probability the event will occur divided by the probability the event will not occur
149
Odds Ratio - Definition
The odds that y = 1 given that one of the independent variables increased by one unit (odds1) divided by the odds that y = 1 given no change in the values for the independent variables (odds0); that is, Odds ratio = odds1yodds0