Lecture 7 - Intro to Regression Flashcards

1
Q

What is regression?

A
  • statistically test the relationship between phenomena (variables of interest) and 1 or more explanatory variables
  • measures strength and nature of the relationship
  • identifies if other key factors are missing (how good the model performs and if something is missing)
  • used (with caution) to infer causal relationships between independent and dependent variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Regressional analysis steps (6)

A

1 - create list of possible x (independent) vars that may help estimate y (dependent variable
2 - collect data on y and x vars
3 - check relationships between each x and y using scatterplots and correlations, use results to eliminate vars not strongly related to y
4 - look at possible relationships between x vars ensuring they aren’t redundant (mulicollinearity)
5 - use x vars from step 4 in multiple regression analysis to find best-fitting model for data
6 - use best-fitting model to predict y for given x-values by plugging x-values into the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

GeoDa example (what do we check?)

A

how crime is affected by income and house values

r2 = 0.55
what does this mean? the variables only explain 55%

adjusted r2 = 0.53
the more variables we use the more they overlap, the lower the r2

variables - get the constant and the coefficients for income and house value

look at probability (p-value)
inc 0.0000183
hoval 0.0108745
income has more effect than hoval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What we can do with regression?

A

-the ‘why’

example:

  • why crime is higher in particular areas and the reasons?
  • can we define and model characteristics of places with high cancer rates to understand the cause and help reduce them?
  • what are factors contributing to higher than expected liquefaction?
  • look at allergies and consider triggering factors
  • look at the cost of flights based on distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinary Least Squares (OLS)

A
  • best known RA technique
  • analyse linear relationships amongst two or more variables (looking for nature of the relationship)
  • simple regression = 1 predictor
  • multiple regression = more than 1 predictor
  • Global Model = single equation represents entire dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Regression vs. Correlation

A

Correlation: interdependence / co-relationship of variables

  • look at two variables to see if they co-vary (one high other high, one high other low) do they move in the same direction or not?
  • no assumption variables are related
  • doesn’t model anything
  • doesn’t look at cause and effect
  • only index describing linear relationship (random, co-vary, positive/negative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regression equation

A
  • the dependent variable is the process we try to predict/understand
  • independent variables / explanatory variables are used to model/predict the dependent variable
  • each independent variable gets a coefficient computed by the regression tool which represents the strength and type of relationship it has to the dependent variable
  • last residual error is included
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Regression scatterplot

A
  • look at a plot comparing two variables
  • draw a best-fit line on the plot in which the sum of squared (vertical) distances between points and the line is minimized (ordinary least squares)
  • the slope of the line is the amount that the dependent variable will change for each unit of change in the independent variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Residual error

A
  • distance between the actual point and the best fit line (difference between actual and predicted value)
  • sum the errors to get the overall residual error, the larger it is the worse the model is
  • gives a performance measurement for the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Scatterplot slope

A
  • the slope of the line is the amount that the dependent variable will change for each unit of change in the independent variable
  • positive slope = direct relationship
  • negative slope = inverse relationship
  • higher angle more increase in a variable
  • flat line means small increase for large change (maybe no relationship)
  • slope gives the coefficient for that variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Regression Statistics - P value and r2

A
  • p-value (probability) result of a statistical test
  • low p-value = coefficient important to model
  • r2 (coefficient of determination) = statistics derived from regression equation to quantify model’s performance
  • closer r2 is to 1, the more dependency there is among variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Regression Statistics - Residuals

A
  • residuals: unexplained portion of the dependent variable
  • large residuals: poor model fit
  • plot the residuals on a map (predicted minus observed) look for over-predictions and under-predictions to see if there is clustering = spatial autocorrelation (PROBLEM)
  • regression is statistics modelling technique - assumes things are random, if we see clustering we have to use alternative method!!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Multiple linear regression

A
  • 3D graphic plot for more than 1 predictor (can’t create with more than 3)
  • ‘broken window theory’ - defacement of property invites other crime
  • will there be a positive relationship between vandalism and burglar?
  • is there a relationship between drug use and burglary?
  • is a person at greater risk for burglary if they live in a rich or poor neighborhood?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Assumptions of regression

A
  1. assumes linear relationship (can’t be curvilinear)
  2. no outliers
  3. no non-stationarity
  4. no multicollinearity
  5. assumes normal distribution
  6. no spatial autocorrelation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Assumptions: Linear relationship

A
  • regression assumes the relationship is linear, if it is non-linear the model will perform poorly
  • solution: use a non-linear regression model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Assumptions: no outliers

A
  • influential outliers can pull modelled regression relationships away from the best fit and bias regression coefficients
  • solution: create a scatter plot to examine extreme values and correct or remove outliers. Run regression with and without outliers to see their effects
17
Q

Assumptions: no non-stationarity

A
  • does the model explain the phenomena equally for all areas? if the relationships is inconsistent the computed standard errors will be artificially inflated
  • solution: GWR may be more appropriate
18
Q

Assumptions: no multicollinearity

A
  • if two variables are highly correlated/redundant, leads to over-counting bias and an unreliable model, the computer won’t know what coefficients to give the correlated variables because it can’t distinguish their contributions
  • solution: remove/modify variables
19
Q

Assumptions: normal distribution bias

A
  • should have normal like a bell curve, if we see skewed curve there is something that can effect regression performance
  • solution: model may be non-linear, use different model
20
Q

Assumptions: no spatially autocorrelated residuals

A
  • plot residuals on a map to check for clustering
  • solution: run the spatial autocorrelation tool on the residuals, if there is significant clustering there could be a variable missing or use GWR
21
Q

Issues with OLS regression

A
  • assumes things are random
  • spatial data not good for OLS
  • regression unlikely to consider spatial effects