Multiple and nonlinear regression Flashcards

Question 1

Q

Non linear models (6)

Answer

A

1.) logistic curve
-curves in and out of 1:1 line
-take inverse of each function(flip denominator and numerator)
-then isolate a+bx on one side and take natural log
-by transforming the dependent variable, we have revealed
the linear behaviour of the independent variable

2.)multiplicative model is a simple power fundtion
-take log of both sides of the equation and use simple linear regression
y=ax^b
-does not work with negative or 0 values
-best for fixed rate of change

3) exponential decay
y=ae^-bx
-the best fit line has
this general form, but
it is useful to note
that a is not a true
intercept, since the
line will never cross
the y-axis
-take natural log of 'a' in order to treat it like a linear regression

4.) logarithmic regression
-𝑦 = 𝑎 + 𝑏 ln x
-used for datasets that experience rapid
rates of change initially but then level out through time
-note that logarithmic functions do not like negative
values, so if your dataset contains negative values you
will need to remove them or transform the data first if
you think the logarithmic function is the right choice

5.)polynomial function
-polynomial functions involve multiple levels of curvature, and the complexity of the
model can be changed to suit the model
-• polynomial models tend to work best in
large datasets

6.) Periodic regression model
𝑦 = 𝑋 + 𝐴 cos (𝑐𝑡 − 0)
ex:Time series data
-do not make it into a linear function
-model should show seasonality and so a linear function would not properly represent it
-• calculate the mean value of y (X)
• calculate the amplitude of y (A)
(𝑚𝑎𝑥(𝑦) − 𝑚𝑖𝑛(𝑦))/
2
-like all regression models, we can now compare the observed y values to
those predicted by the model to determine a r
2 value

Question 2

Q

• however, determining which type of transformation to apply is the challenge
• a good first step is to read the literature – eg, it is well known that many
stream morphology relationships behave as _____ functions, so if working with
stream data, you might expect the power transformation to work best

Question 3

Q

Multiple regression

Answer

A

• linear multiple regression models have the form
𝑦ො = 𝑎 + 𝑏1𝑥1 + 𝑏2𝑥2 + 𝑏3𝑥3 ⋯ 𝑏𝑛𝑥..
-• each bnxn pair represent a different independent variable
• we face a significant challenge here in terms of visualizing the relationship
• recall that the scatterplot is the common method of displaying a simple
relationship – 1 independent variable + 1 dependent variable = 2 dimensional
space
• once we add a 2nd independent variable or more, we add more dimensions to
the scatterplot and quickly find that it is impossible to plot higher order
regression models

Question 4

Q

Types of multiple regression models

Answer

A

1.) 3-D
. in the 3-D case, the regression model is not
represented by a line, but instead by a flat plane
• the value of a corresponds to the point on the plane
where x1 = x2 = 0
• the value of b1
represents the change in y for every
unit change in x1
, while x2
stays constant
• the value of b2
represents the change in y for every
unit change in x2
, while x1
stays constant

Question 5

Q

Types of multiple regression models

Answer

A

1.) 3-D
. in the 3-D case, the regression model is not
represented by a line, but instead by a flat plane
• the value of a corresponds to the point on the plane
where x1 = x2 = 0
• the value of b1
represents the change in y for every
unit change in x1
, while x2
stays constant
• the value of b2
represents the change in y for every
unit change in x2
, while x1
stays constant

2.)4-D
• a regression model with 3 independent variables would be plotted in 4-D space

Question 6

Q

goal of simple and multiple regression

Answer

A

the goal of multiple regression is the same as simple

regression – minimize the error/residuals/unexplained variation

Question 7

Q

Sensitivity testing

Answer

A

• recall that correlation analysis was strongly affected by the size of sample areas
• a small number of large census tracts will give a different correlation
coefficient than a large number of small tracts, given the same data
• the same situation occurs in multiple regression
• the magnitude and significance of the coefficient of determination (r
2
) can be
strongly influence by data aggregation
• therefore, when performing a multiple regression with spatial data, it is highly
advisable to carry out sensitivity testing to determine the influence of sample
size and configuration on the results
• sensitivity testing is a common method of model evaluation
• eg, run the analysis at several scales or aggregates to look for substantial
changes in the models

Question 8

Q

Assumptions for simple regression vs. assumptions for multiple regression

Answer

A

• recall the assumptions for simple regression:
• the relationship between x and y is linear and the equation for a straight line
represents the model
• the residuals have a mean = 0 and their variance does not vary with x
• the residuals are all independent
• for each value of x, the residuals have a normal distribution centred on the line
of best fit

multiple regression adds one more:
there is no multicollinearity among the independent variables

Question 9

Q

multicollinearity

Answer

A

multicollinearity means that, while each independent variable should be correlated
with the dependent variable, the independent variables should not be correlated
with each other
• for example, say we have a multiple regression model that predicts house price
from house size and whether the house has a garage or not
• we should find that:
• house price correlates with house size
• house price correlates with garage yes/no
• house size does not correlate with garage yes/no
• the effect of multicollinearity is to make insignificant relationships more significant,
and to make the empirical parameters (a and b) more sensitive to individual data
points

.ex: cannot have male and female income as independents because they are well correlated
-so you would only choose one, either male OR female as one independent

Question 10

Q

Dealing with multicollinearity: interaction variable

Answer

A

1.) interaction
variable
• for example, we might find that 2 independent variables (x, z) are highly correlated
with the dependent variable (y) as well as with each other
• we may not be able to logically choose which of x or z to keep and which to
drop
• instead, we could create a new interaction variable, which would be the
product of x and z, or xz
• in this case, the information contained within x and z are combined,
reflecting the fact that the two independent variables interact with each
other

combine two similar variables in order to keep all of the information by using the product of the two

Question 11

Q

ex of multiple regression model

𝐻𝑝𝑟𝑖𝑐𝑒 = 21.6𝐿𝑠𝑖𝑧𝑒 + 9333.4𝑁𝑏𝑒𝑑𝑟𝑜𝑜𝑚 − 1933

Answer

A

• from this model, we know that the house price will increase $21.60 for every 1 ft2
the lot size increases, provided the number of bedrooms stays the same
• likewise, the house price increases $9333.40 for every bedroom included, provided
the lot size stays the same
• it is also important to note here that the house price is a combination of lot size and
the number of bedrooms, and treating each individually would not be appropriate

Question 12

Q

mispecification

Answer

A

treating the independent variables as separate, or not including important
independent variables at all, leads to misspecification of the model

• note the differences in the a and b coefficients between these simple regression
models and the previous multiple regression models – multiple regression considers
the combined effects of the independent variable and is not just the “sum of the
parts”

• furthermore, if house price was significantly dependent on a third variable, say the
number of bathrooms, our original multiple regression model would be misspecified
and consequently incorrect

Question 13

Q

Ways to develop a multiple regression model: Kitchen sink

Answer

A

• usually, multiple regression analysis begins with a large data set, and a single
dependent variable is selected from the variables

1.)Kitchen Sink Method
- this method of developing a regression model simply takes all of the
independent variables and puts them into the analysis
-• in this method, as we add independent variables, the coefficient of
determination will continue to increase, suggesting that we are improving
our model
• however, the adjusted r
2 will only increase if the added variables add
significant power to the model

-gives a table of coeffecients
-anything with a p-value of less then 0.05 is significant
-everytime you add a variable, R squared has to go up
-R squared does not tell you how good a model is, just how many variables you have
-R squared value of 0.876 makes you think that we have done a good job because 87% of the variables are attached to housing price, when in reality there are variables which are not really important that are pushing this score up
- another table given in the multiple
regression analysis describes the coefficients
for each independent variable
and the t and p values which if significant should be above 0.05

Question 14

Q

What to do with missing or 0 values? 3

Answer

A

we could simply delete the entire census tract, with the justification that it is not
representative of the rest of the census tracts
• we could fill the missing values with the mean of the variable for the entire data
set
• we could perform a regression analysis to predict the value of the missing value(s)
• in most cases, missing values are treated by deletion – you need to have compelling
evidence that allows you to invent the value of a data point

Question 15

Q

tolerance

Answer

A

• we must also consider the role of multicollinearity amongst the
variables
• this can be determined by examining the tolerance values given in
the regression output
• tolerance is the amount of variability in an independent variable
that is not explained by the other independent variables
• as a rule of thumb, a high tolerance (> 0.2) is good because it
shows that the variable is exerting an influence on the
dependent variable that no other independent variable is
• when tolerance is low (< 0.2), that variable is exerting less
unique influence on the dependent variable, and much of its
influence could be the same as that imparted by a different
independent variable
• because of the issues related to the adjusted r
2 value, missing data,
insignificant relationships, and multicollinearity, it is advisable to
avoid the kitchen sink method

Question 16

Q

4 problems with kitchen sink method?

Answer

Study These Flashcards

A

• because of the issues related to the adjusted r
2 value, missing data,
insignificant relationships, and multicollinearity, it is advisable to
avoid the kitchen sink method

Question 17

Q

a second method of developing a regression model : filter Model

Answer

Study These Flashcards

A

Builds of off Kitchen sink Model

You can go through the variables and determine their value to the regression model
remove variables with low tolerance ONE AT A TIME, to check the effects on other variables

• before running the analysis, look for potential problems and eliminate them:
• first, look for missing data and delete observations for which you cannot justify
inserting data
• removing an observation/variable because it weakens your r
2 value is not a
good justification – only remove observations/variables on the grounds that
they compromise the integrity of the method, not the results
• second, use a scatterplot matrix and correlation analysis to identify significant
relationships between independent variables – when you find two variables that
are highly correlated, choose to keep the one that makes sense logically to the
rest of the analysis
• an alternative here is to run repeated kitchen sink method analyses, but each
time removing the variable with the lowest tolerance until all of the variables
have strong tolerances

Question 18

Q

Third Method for multiple regression: garburator model

Answer

Study These Flashcards

A

builds from method 2:

• a third method considers everything you did in method 2, but adds an examination of
extreme or outlier observations
.-those variable outliers/obervations which have “LEVERAGE” on the analysis
• extreme observations are individual observations (eg, census tracts) that have an
unusually strong influence on the regression model – this observational influence is
called “leverage”
• observations with large leverages will have a disproportionate impact on the
dependent variable, and this could be a justification for removing the observation
from the analysis
• leverage values are another output of the regression analysis process, and you can
examine these looking for outliers
• as a rule of thumb, any observation with a leverage > 2p/n could be considered an
outlier, where p = the number of independent variables in your model, and n = the
number of observations in your model

keep in mind this threshold is a rule of
thumb and can be applied very
subjectively – observations close to the
threshold do not have to be removed,
but observations that far exceed the
threshold and stand out as exceptional
values, could be justifiably removed

Question 19

Q

fourth method of multiple regression: flush model

Answer

Study These Flashcards

A

• a fourth method considers all of the previous methods, but finally removes all of the
variables that are not significantly related to the dependent variable
• in our example, only 1 variable – unemployment rate – showed a significant
relationship to house value, so we only use it in our final regression model

• although technically this is a simple regression model, we have used the multiple
regression model to achieve it
• this is known as a parsimonious model, because it is the most basic or simplest model
that provides the most explanation
• for this model, r
2 = 0.313, but it is still statistically significant, has no redundant
variables or extreme observations, and is the simplest model that explains the house
value variable the best

Question 20

Q

parsimonious model

Answer

Study These Flashcards

A

it is the most basic or simplest model

that provides the most explanation

Question 21

Q

Before using any methods of multiple regression it is smart to begin by doing one of these three selection types

Answer

Study These Flashcards

A

• start by organizing your variables into ones that might be included, perhaps through a
correlation analysis between individual independent variables and the dependent
variable
• forward selection multiple regression starts by taking the variable that is most
correlated with the dependent variables, adding it to the equation, then finding
the next most correlated, and continuing until all of the significant variables are
included
• backward selection multiple regression starts with all the variables, then removes
the one with the lowest correlation, re-evaluates, and so on
• stepwise selection multiple regression is a combination of forward and backwards
– its starts by selecting the most correlated, then the second most, and then tests
the model; if the first is no longer significant, it is dropped…

Multiple and nonlinear regression Flashcards

(21 cards)