Multiple and nonlinear regression Flashcards
Non linear models (6)
1.) logistic curve
-curves in and out of 1:1 line
-take inverse of each function(flip denominator and numerator)
-then isolate a+bx on one side and take natural log
-by transforming the dependent variable, we have revealed
the linear behaviour of the independent variable
2.)multiplicative model is a simple power fundtion
-take log of both sides of the equation and use simple linear regression
y=ax^b
-does not work with negative or 0 values
-best for fixed rate of change
3) exponential decay y=ae^-bx -the best fit line has this general form, but it is useful to note that a is not a true intercept, since the line will never cross the y-axis -take natural log of 'a' in order to treat it like a linear regression
4.) logarithmic regression
-π¦ = π + π ln x
-used for datasets that experience rapid
rates of change initially but then level out through time
-note that logarithmic functions do not like negative
values, so if your dataset contains negative values you
will need to remove them or transform the data first if
you think the logarithmic function is the right choice
5.)polynomial function
-polynomial functions involve multiple levels of curvature, and the complexity of the
model can be changed to suit the model
-β’ polynomial models tend to work best in
large datasets
6.) Periodic regression model π¦ = π + π΄ cos (ππ‘ β 0) ex:Time series data -do not make it into a linear function -model should show seasonality and so a linear function would not properly represent it -β’ calculate the mean value of y (X) β’ calculate the amplitude of y (A) (πππ₯(π¦) β πππ(π¦))/ 2 -like all regression models, we can now compare the observed y values to those predicted by the model to determine a r 2 value
β’ however, determining which type of transformation to apply is the challenge
β’ a good first step is to read the literature β eg, it is well known that many
stream morphology relationships behave as _____ functions, so if working with
stream data, you might expect the power transformation to work best
power
Multiple regression
β’ linear multiple regression models have the form
π¦ΰ· = π + π1π₯1 + π2π₯2 + π3π₯3 β― πππ₯..
-β’ each bnxn pair represent a different independent variable
β’ we face a significant challenge here in terms of visualizing the relationship
β’ recall that the scatterplot is the common method of displaying a simple
relationship β 1 independent variable + 1 dependent variable = 2 dimensional
space
β’ once we add a 2nd independent variable or more, we add more dimensions to
the scatterplot and quickly find that it is impossible to plot higher order
regression models
Types of multiple regression models
1.) 3-D . in the 3-D case, the regression model is not represented by a line, but instead by a flat plane β’ the value of a corresponds to the point on the plane where x1 = x2 = 0 β’ the value of b1 represents the change in y for every unit change in x1 , while x2 stays constant β’ the value of b2 represents the change in y for every unit change in x2 , while x1 stays constant
Types of multiple regression models
1.) 3-D . in the 3-D case, the regression model is not represented by a line, but instead by a flat plane β’ the value of a corresponds to the point on the plane where x1 = x2 = 0 β’ the value of b1 represents the change in y for every unit change in x1 , while x2 stays constant β’ the value of b2 represents the change in y for every unit change in x2 , while x1 stays constant
2.)4-D
β’ a regression model with 3 independent variables would be plotted in 4-D space
goal of simple and multiple regression
the goal of multiple regression is the same as simple
regression β minimize the error/residuals/unexplained variation
Sensitivity testing
β’ recall that correlation analysis was strongly affected by the size of sample areas
β’ a small number of large census tracts will give a different correlation
coefficient than a large number of small tracts, given the same data
β’ the same situation occurs in multiple regression
β’ the magnitude and significance of the coefficient of determination (r
2
) can be
strongly influence by data aggregation
β’ therefore, when performing a multiple regression with spatial data, it is highly
advisable to carry out sensitivity testing to determine the influence of sample
size and configuration on the results
β’ sensitivity testing is a common method of model evaluation
β’ eg, run the analysis at several scales or aggregates to look for substantial
changes in the models
Assumptions for simple regression vs. assumptions for multiple regression
β’ recall the assumptions for simple regression:
β’ the relationship between x and y is linear and the equation for a straight line
represents the model
β’ the residuals have a mean = 0 and their variance does not vary with x
β’ the residuals are all independent
β’ for each value of x, the residuals have a normal distribution centred on the line
of best fit
- multiple regression adds one more:
- there is no multicollinearity among the independent variables
multicollinearity
multicollinearity means that, while each independent variable should be correlated
with the dependent variable, the independent variables should not be correlated
with each other
β’ for example, say we have a multiple regression model that predicts house price
from house size and whether the house has a garage or not
β’ we should find that:
β’ house price correlates with house size
β’ house price correlates with garage yes/no
β’ house size does not correlate with garage yes/no
β’ the effect of multicollinearity is to make insignificant relationships more significant,
and to make the empirical parameters (a and b) more sensitive to individual data
points
.ex: cannot have male and female income as independents because they are well correlated
-so you would only choose one, either male OR female as one independent
Dealing with multicollinearity: interaction variable
1.) interaction
variable
β’ for example, we might find that 2 independent variables (x, z) are highly correlated
with the dependent variable (y) as well as with each other
β’ we may not be able to logically choose which of x or z to keep and which to
drop
β’ instead, we could create a new interaction variable, which would be the
product of x and z, or xz
β’ in this case, the information contained within x and z are combined,
reflecting the fact that the two independent variables interact with each
other
combine two similar variables in order to keep all of the information by using the product of the two
ex of multiple regression model
π»πππππ = 21.6πΏπ ππ§π + 9333.4ππππππππ β 1933
β’ from this model, we know that the house price will increase $21.60 for every 1 ft2
the lot size increases, provided the number of bedrooms stays the same
β’ likewise, the house price increases $9333.40 for every bedroom included, provided
the lot size stays the same
β’ it is also important to note here that the house price is a combination of lot size and
the number of bedrooms, and treating each individually would not be appropriate
mispecification
treating the independent variables as separate, or not including important
independent variables at all, leads to misspecification of the model
β’ note the differences in the a and b coefficients between these simple regression
models and the previous multiple regression models β multiple regression considers
the combined effects of the independent variable and is not just the βsum of the
partsβ
β’ furthermore, if house price was significantly dependent on a third variable, say the
number of bathrooms, our original multiple regression model would be misspecified
and consequently incorrect
Ways to develop a multiple regression model: Kitchen sink
β’ usually, multiple regression analysis begins with a large data set, and a single
dependent variable is selected from the variables
1.)Kitchen Sink Method
- this method of developing a regression model simply takes all of the
independent variables and puts them into the analysis
-β’ in this method, as we add independent variables, the coefficient of
determination will continue to increase, suggesting that we are improving
our model
β’ however, the adjusted r
2 will only increase if the added variables add
significant power to the model
-gives a table of coeffecients
-anything with a p-value of less then 0.05 is significant
-everytime you add a variable, R squared has to go up
-R squared does not tell you how good a model is, just how many variables you have
-R squared value of 0.876 makes you think that we have done a good job because 87% of the variables are attached to housing price, when in reality there are variables which are not really important that are pushing this score up
- another table given in the multiple
regression analysis describes the coefficients
for each independent variable
and the t and p values which if significant should be above 0.05
What to do with missing or 0 values? 3
we could simply delete the entire census tract, with the justification that it is not
representative of the rest of the census tracts
β’ we could fill the missing values with the mean of the variable for the entire data
set
β’ we could perform a regression analysis to predict the value of the missing value(s)
β’ in most cases, missing values are treated by deletion β you need to have compelling
evidence that allows you to invent the value of a data point
tolerance
β’ we must also consider the role of multicollinearity amongst the
variables
β’ this can be determined by examining the tolerance values given in
the regression output
β’ tolerance is the amount of variability in an independent variable
that is not explained by the other independent variables
β’ as a rule of thumb, a high tolerance (> 0.2) is good because it
shows that the variable is exerting an influence on the
dependent variable that no other independent variable is
β’ when tolerance is low (< 0.2), that variable is exerting less
unique influence on the dependent variable, and much of its
influence could be the same as that imparted by a different
independent variable
β’ because of the issues related to the adjusted r
2 value, missing data,
insignificant relationships, and multicollinearity, it is advisable to
avoid the kitchen sink method