Lecture 7 - Intro to Regression Flashcards

Question 1

Q

What is regression?

Answer

A

statistically test the relationship between phenomena (variables of interest) and 1 or more explanatory variables
measures strength and nature of the relationship
identifies if other key factors are missing (how good the model performs and if something is missing)
used (with caution) to infer causal relationships between independent and dependent variables

Question 2

Q

Regressional analysis steps (6)

Answer

A

1 - create list of possible x (independent) vars that may help estimate y (dependent variable
2 - collect data on y and x vars
3 - check relationships between each x and y using scatterplots and correlations, use results to eliminate vars not strongly related to y
4 - look at possible relationships between x vars ensuring they aren’t redundant (mulicollinearity)
5 - use x vars from step 4 in multiple regression analysis to find best-fitting model for data
6 - use best-fitting model to predict y for given x-values by plugging x-values into the model

Question 3

Q

GeoDa example (what do we check?)

Answer

A

how crime is affected by income and house values

r2 = 0.55
what does this mean? the variables only explain 55%

adjusted r2 = 0.53
the more variables we use the more they overlap, the lower the r2

variables - get the constant and the coefficients for income and house value

look at probability (p-value)
inc 0.0000183
hoval 0.0108745
income has more effect than hoval

Question 4

Q

What we can do with regression?

Answer

A

-the ‘why’

example:

why crime is higher in particular areas and the reasons?
can we define and model characteristics of places with high cancer rates to understand the cause and help reduce them?
what are factors contributing to higher than expected liquefaction?
look at allergies and consider triggering factors
look at the cost of flights based on distance

Question 5

Q

Ordinary Least Squares (OLS)

Answer

A

best known RA technique
analyse linear relationships amongst two or more variables (looking for nature of the relationship)
simple regression = 1 predictor
multiple regression = more than 1 predictor
Global Model = single equation represents entire dataset

Question 6

Q

Regression vs. Correlation

Answer

A

Correlation: interdependence / co-relationship of variables

look at two variables to see if they co-vary (one high other high, one high other low) do they move in the same direction or not?
no assumption variables are related
doesn’t model anything
doesn’t look at cause and effect
only index describing linear relationship (random, co-vary, positive/negative)

Question 7

Q

Regression equation

Answer

A

the dependent variable is the process we try to predict/understand
independent variables / explanatory variables are used to model/predict the dependent variable
each independent variable gets a coefficient computed by the regression tool which represents the strength and type of relationship it has to the dependent variable
last residual error is included

Question 8

Q

Regression scatterplot

Answer

A

look at a plot comparing two variables
draw a best-fit line on the plot in which the sum of squared (vertical) distances between points and the line is minimized (ordinary least squares)
the slope of the line is the amount that the dependent variable will change for each unit of change in the independent variable

Question 9

Q

Residual error

Answer

A

distance between the actual point and the best fit line (difference between actual and predicted value)
sum the errors to get the overall residual error, the larger it is the worse the model is
gives a performance measurement for the model

Question 10

Q

Scatterplot slope

Answer

A

the slope of the line is the amount that the dependent variable will change for each unit of change in the independent variable
positive slope = direct relationship
negative slope = inverse relationship
higher angle more increase in a variable
flat line means small increase for large change (maybe no relationship)
slope gives the coefficient for that variable

Question 11

Q

Regression Statistics - P value and r2

Answer

A

p-value (probability) result of a statistical test
low p-value = coefficient important to model
r2 (coefficient of determination) = statistics derived from regression equation to quantify model’s performance
closer r2 is to 1, the more dependency there is among variables

Question 12

Q

Regression Statistics - Residuals

Answer

A

residuals: unexplained portion of the dependent variable
large residuals: poor model fit
plot the residuals on a map (predicted minus observed) look for over-predictions and under-predictions to see if there is clustering = spatial autocorrelation (PROBLEM)
regression is statistics modelling technique - assumes things are random, if we see clustering we have to use alternative method!!

Question 13

Q

Multiple linear regression

Answer

A

3D graphic plot for more than 1 predictor (can’t create with more than 3)
‘broken window theory’ - defacement of property invites other crime
will there be a positive relationship between vandalism and burglar?
is there a relationship between drug use and burglary?
is a person at greater risk for burglary if they live in a rich or poor neighborhood?

Question 14

Q

Assumptions of regression

Answer

A

assumes linear relationship (can’t be curvilinear)
no outliers
no non-stationarity
no multicollinearity
assumes normal distribution
no spatial autocorrelation

Question 15

Q

Assumptions: Linear relationship

Answer

A

regression assumes the relationship is linear, if it is non-linear the model will perform poorly
solution: use a non-linear regression model

Question 16

Q

Assumptions: no outliers

Answer

A

influential outliers can pull modelled regression relationships away from the best fit and bias regression coefficients
solution: create a scatter plot to examine extreme values and correct or remove outliers. Run regression with and without outliers to see their effects

Question 17

Q

Assumptions: no non-stationarity

Answer

A

does the model explain the phenomena equally for all areas? if the relationships is inconsistent the computed standard errors will be artificially inflated
solution: GWR may be more appropriate

Question 18

Q

Assumptions: no multicollinearity

Answer

A

if two variables are highly correlated/redundant, leads to over-counting bias and an unreliable model, the computer won’t know what coefficients to give the correlated variables because it can’t distinguish their contributions
solution: remove/modify variables

Question 19

Q

Assumptions: normal distribution bias

Answer

A

should have normal like a bell curve, if we see skewed curve there is something that can effect regression performance
solution: model may be non-linear, use different model

Question 20

Q

Assumptions: no spatially autocorrelated residuals

Answer

A

plot residuals on a map to check for clustering
solution: run the spatial autocorrelation tool on the residuals, if there is significant clustering there could be a variable missing or use GWR

Question 21

Q

Issues with OLS regression

Answer

A

assumes things are random
spatial data not good for OLS
regression unlikely to consider spatial effects