Lecture 7 - Intro to Regression Flashcards
What is regression?
- statistically test the relationship between phenomena (variables of interest) and 1 or more explanatory variables
- measures strength and nature of the relationship
- identifies if other key factors are missing (how good the model performs and if something is missing)
- used (with caution) to infer causal relationships between independent and dependent variables
Regressional analysis steps (6)
1 - create list of possible x (independent) vars that may help estimate y (dependent variable
2 - collect data on y and x vars
3 - check relationships between each x and y using scatterplots and correlations, use results to eliminate vars not strongly related to y
4 - look at possible relationships between x vars ensuring they aren’t redundant (mulicollinearity)
5 - use x vars from step 4 in multiple regression analysis to find best-fitting model for data
6 - use best-fitting model to predict y for given x-values by plugging x-values into the model
GeoDa example (what do we check?)
how crime is affected by income and house values
r2 = 0.55
what does this mean? the variables only explain 55%
adjusted r2 = 0.53
the more variables we use the more they overlap, the lower the r2
variables - get the constant and the coefficients for income and house value
look at probability (p-value)
inc 0.0000183
hoval 0.0108745
income has more effect than hoval
What we can do with regression?
-the ‘why’
example:
- why crime is higher in particular areas and the reasons?
- can we define and model characteristics of places with high cancer rates to understand the cause and help reduce them?
- what are factors contributing to higher than expected liquefaction?
- look at allergies and consider triggering factors
- look at the cost of flights based on distance
Ordinary Least Squares (OLS)
- best known RA technique
- analyse linear relationships amongst two or more variables (looking for nature of the relationship)
- simple regression = 1 predictor
- multiple regression = more than 1 predictor
- Global Model = single equation represents entire dataset
Regression vs. Correlation
Correlation: interdependence / co-relationship of variables
- look at two variables to see if they co-vary (one high other high, one high other low) do they move in the same direction or not?
- no assumption variables are related
- doesn’t model anything
- doesn’t look at cause and effect
- only index describing linear relationship (random, co-vary, positive/negative)
Regression equation
- the dependent variable is the process we try to predict/understand
- independent variables / explanatory variables are used to model/predict the dependent variable
- each independent variable gets a coefficient computed by the regression tool which represents the strength and type of relationship it has to the dependent variable
- last residual error is included
Regression scatterplot
- look at a plot comparing two variables
- draw a best-fit line on the plot in which the sum of squared (vertical) distances between points and the line is minimized (ordinary least squares)
- the slope of the line is the amount that the dependent variable will change for each unit of change in the independent variable
Residual error
- distance between the actual point and the best fit line (difference between actual and predicted value)
- sum the errors to get the overall residual error, the larger it is the worse the model is
- gives a performance measurement for the model
Scatterplot slope
- the slope of the line is the amount that the dependent variable will change for each unit of change in the independent variable
- positive slope = direct relationship
- negative slope = inverse relationship
- higher angle more increase in a variable
- flat line means small increase for large change (maybe no relationship)
- slope gives the coefficient for that variable
Regression Statistics - P value and r2
- p-value (probability) result of a statistical test
- low p-value = coefficient important to model
- r2 (coefficient of determination) = statistics derived from regression equation to quantify model’s performance
- closer r2 is to 1, the more dependency there is among variables
Regression Statistics - Residuals
- residuals: unexplained portion of the dependent variable
- large residuals: poor model fit
- plot the residuals on a map (predicted minus observed) look for over-predictions and under-predictions to see if there is clustering = spatial autocorrelation (PROBLEM)
- regression is statistics modelling technique - assumes things are random, if we see clustering we have to use alternative method!!
Multiple linear regression
- 3D graphic plot for more than 1 predictor (can’t create with more than 3)
- ‘broken window theory’ - defacement of property invites other crime
- will there be a positive relationship between vandalism and burglar?
- is there a relationship between drug use and burglary?
- is a person at greater risk for burglary if they live in a rich or poor neighborhood?
Assumptions of regression
- assumes linear relationship (can’t be curvilinear)
- no outliers
- no non-stationarity
- no multicollinearity
- assumes normal distribution
- no spatial autocorrelation
Assumptions: Linear relationship
- regression assumes the relationship is linear, if it is non-linear the model will perform poorly
- solution: use a non-linear regression model