GEOG364 Final Flashcards
runs count
a one dimensional autocorrelation measure
joins count
a two dimensional autocorrelation measure
spatial autocorrelation generally explained
the correlation of a variable to itself through space
similarity in position vs similarity in attributes
free sampling and example
the outcome is always random and not determined by previous results
example being flipping a coin
non-free sampling and example
when the outcome is affected by the previous result
example being a card being picked from a deck. each card taken affects the probability of the next card
4 factors that can dramatically influence spatial autocorrelation results
a sample size smaller than 30
one category of values occurs in less than 20% of the data
the region is elongated and has few joins
there are a couple of features with many joins and some with very few
name a limitation of joins counts
it does not work for numeric data
numbers can be reclassed as “high/low,” but this throws away much information
what are the two alternatives to use so for joins/counts to measure spatial autocorrelation
moran’s i
geary’s c
in general, what does moran’s i and geary’s c measure?
they compare the differences in neighbors compared differences in values in the entire study area
in moran’s i or geary’s c what does it mean if the difference between neighboring features is less than between all other features
it would mean that the neighboring features could be considered clustered
which spatial autocorrelation uses squared differences between adjacent cases
geary’s c
which spatial autocorrelation measure uses a covariance term
moran’s i
name two similarities between geary’s c and moran’s i FORMULAs
they both divide by total “w” to account for the number of pairs of cases
they both divide by a variance term in order to account for range of data
explain what -1, 0, and 1 would mean in a spatial autocorrelation analysis
it would mean you are using moran’s i
-1 means negative autocorrelation and the data is dispersed
0 means there is no autocorrelation and pattern is random
1 would mean positive autocorrelation and attributes are clustered
explain what 0, 1, and 2 would mean in a spatial autocorrelation analysis
it would mean you are using geary’s c
0 means positive autocorrelation and values are clustered
1 means no autocorrelation with random values and no apparent pattern
2 means negative spatial autocorrelation with dispersed value (high-low)
match the numbers of moran’s i to geary’s c
-1 = 2 = negative spatial auto 0 = 1 = no autocorrelation 1 = 0 = positive autocorrelation
what does the w represent in a spatial autocorrelation analysis?
the weight given to a measure to set adjeacency
for example, what distance/time/cost would make two features neighbors?
what is the alternative method for etsting significance when etsting geary’s c or moran’s i?
the monte carlo simulation
what does monte carlo simulation do?
it generates a sample distribution for a given test statistic. this test statistic can then be used to assess significance
global statistics
value summarizes a characteristic for an entire study region
why is it important to use measures of autocorrelation in a region?
spatial homogeneity does not exist over global regions/entire study area
what do you call it when autocorrelation is low in one area of a region and high in another
spatial heterogeneity
LISA
local indicators of spatial autocorrelation
local versions of geary’s c and moran’s i
what does LISA measure that is different than geary’s c or moran’s i?
LISA measures levels of particular clusters, not overall clustering
what is the preffered tets of choice for local clustering measures
moran’s i
name 4 objectives of a regression analysis
to determine whether a relationship exists
to describe the nature of the relationship mathematically
to assess the degree of accuracy with which the model represents the relationship
in the case of multiple regression, to understand the relative importance of individual independent vairables
regression VS correlation
correlation provides us with the extent of a relationship between two variables
a regression analysis provides us with the nature of that relationship
y in regression
the dependent variable
x in regression
the independent variable
a and b in regression
the correlation coefficients
e in regression
the random error or residual that the model does not account for
what is the line and what does it show in regression
the line is a statistical model that shows the expected mean value of y for each value x
how do we create a regression line
by applying a least square criterion
what does the least square criterion do
it chooses the line that minimizes the differences between the line it creates and the data points that are given
what are the 4 steps to a regression analysis
- specify independent and dependent variables
- use sample data to estimate a and b in the model
- estimate model error and check assumptions
- evaluate the statistical usefulness of the model
can regression describe causality?
NO, it only helps describe the nature of the relationship
what are the 4 assumptions made in a regression analysis
- mean error is 0
- variance of the error is constant across x values
- error is normally distributed
- no relationship exists between y and the residual/error
is regression an extension of correlation or is correlation an extension of regression?
regression is an extension of correlation
what does ANOVA measure
is measures the variance and overall significance of a regression model
how does the size of residuals affect a regression model
smaller residuals mean that the line is a good fit and the model is accurate
what is the range for r squared values
0-1
what does an r squared value of 0 or 1 mean
o means the line is excellent and there is no difference
1 means the line is horribly off and there is large differences
what may r squared look like in a software output
ESS
what does standard error of estimates show
it estimates the standard deviation of the errors/residuals
how close are the observed values to the line?
how many values fall within 95% of the value of the fitted line
what is a regression model not good for?
estimating a value outside the range of observed value EXTRAPOLATION
what is the difference between multiple regression and simple regression
multiple regression uses multiple independent variable
name an example of a multiple regression
a linear trend surface
multicollinearity
an assumption in a multiple regression analysis
assuming that independent variables do not exhibit high correlation among each other
what is trend surface analysis an example of
how regression analysis can be applied to spatial problems
what does ANOVA stand for
analysis of variance
what is a synonym of ANOVA
statistical analysis, but ANOVA goes over the top
what does ANOVA address?
different types of variance and then relates them to overall variance
how could you apply ANOVA to following regression
predicting plant growth by fertilizer application
you could additionally asks whether different types of fertilizer has varying effects on plant growth
what is a name for two or more categorical predictor variables in ANOVA
factors
in terms of columns and rows what does ANOVA compare?
the difference between variables within one column to the overall variation between two different columns
name the 4 probability distributions
normal
z
t
f
what probability distribution does ANOVA use
f distribution
what is the f statistic
a measure for the ratio of the first sample variance to the second sample variance
what is a HUGE factor in determining variance values
sample size
the larger the sample size, the _____ the sample variance
smaller
how do we determine degrees of freedom
sample size minus 1
when is ANOVA very useful
when predictor variables are categorical
gender, regions, beer labels
what two groups is variance split into in an ANOVA distribution
within group variance and between group variance
INTRA AND INTER
why are interaction effects important
think about the gender and beer goggles example. beer has a very different effect on different genders. but when looking at the beer goggles effect free of gender the results are very different
what is the main advantage of using anova
it allows for individual studies to be replaced by one study that compares more factors
what happens when your test statistics are not continuous, but categorical
use non parametric stats
name 2 parameters
mean and standard deviation
give three examples of non parametric data
raw counts
number of protected plants in a forest that are stable or declining
number of people receiving social security or not
number of crimes in spring VS summer vs Winter
what is another way to describe the chi square test
a goodness of fit test
what are two variables in the chi square formula
expected and observed variable
what is the most popular non parametric test
chi square
what is the arguable 5th scale of measurement?
cyclic
compass directions, months of year
what would the avg direction be between north and south?
mean objective of descriptive stats
organization and summary of data
what is the main difference between descriptive and inferential stats
inferential stats provide insights of a population on the basis of SAMPLES and test a hypothesis
the three measures of central tendency
mean
mean
median
the three measures of dispersion
range
iqr
variance and/or stand dev
how do you find range of data
it is the difference between the highest and lowest valued observation
what is the IQR
the difference between the first and third quartile
variance
calculates how much each value differs from the mean
what is stand dev
the square root of the variance
what is first order variation
changes in observation in spatial autocorrelation are due to changes in local environment
what is second order variation
variation in spatial autocorrelation is due to relationship with other attributes - not the environment itself
ecological fallacy
confusing correlation for causation
MAUP
changing classification, boundaries, or extent can change the display of the data
non uniformity of space
coastal area may have more cases of the flue not because they are near the water, but because they also often have higher population density.
edge effects
entities may only have a neighbor on one side. think of a crime map of mexico along the US border without US data
what is the difference between euclidean and manhattan block distance
we can consider euclidean as the crow flies
manhattan must go around edges
quantile classification
every class contains the same number of entities
equal interval classification
dividing your data into equal intervals the difference between the highest and lowest value in each class is the same
advantage and disadvantage of natural breaks
advantage is that it is good for unevenly distributed data
disadvantage is that datasets cannot be easily compared
quantile advantage and disadvantage
advantage is that relative positions (top 20%) can be shown GOOD for evenly distributed data
disadvantage is that the breaks are unnatural
equal interval advantage and disadvantage
good food mapping continuous data and is easy to understand
disadvantage is that if data is clustered some classes will be heavily clustered
goods and bads of stand dev classification
it is good for normally distriobuted data and getting an idea of how data compares to mean
disadvantage is that the actual values are not displayed and outliers strongly influence mean
what classification scheme should be used for evenly/unevenly distributed data
for evenly distributed data use equal interval, stand dev, or quantile
for uneven use natural breaks
about how many classes should be used
use between 3 and 7 classes
mean center
simply the average of the x and y coordinates
center of gravity
what is the problem with mean center
outliers affect the hell out of it
what is an example of weighted mean center
rather than simply finding the mean center national park, weight the values by weighing the amount of visitors each has per year
median center
the coordinate with the shortest distance to all features in the study
central feature
the FEATURE with the shortest distance to all other features
median center vs central feature
median center does not need to exist
central feature must exist
median calculates the most accessible location while central feature finds the most accessible entity
what are the three defining parameters of a standard deviational elipse
the dispersion along the major axis
the dispersion along the minor axis
the angle of rotation
what is the difference between absolute and relative frequencies
relative frequencies are absolute frequencies divided by total number of observations and.
all of them will add up to 1
what is the link between observed data and the normal distribution curve
the z score
population
total set of elements under examination in a study
sample
group of elements actually studied
census
when an entire population is studied
sampling error
when uncertainty arises from working with a sample rather than a population
sampling bias
when the samples used contain a certain population characteristics
central limit theorem
if many samples of the same size are taken the distribbution will be normal
the mean should be the same as the population mean
what is a type 1 error
the null hypothesis is true, but we reject it
what is a type 2 error
the null hypothesis is false, but we do no reject it
what type of error is it if the alternative hypothesis is true, but we accept the null
type 2
what type of error is it if the alternative hypothesis is false, but we we reject the null
type 1
what kind of associations can there be between 2 variables?
experimental and correlational
experimental correlation
we are in charge of one of the variable
correlational correlation
we simply observe both the control and the other
what does pearson’s r measure?
the strength of a linear relationship between two variables
what is the value range for pearson’s r?
-1 to 1
what would the pearson value be if both x and y increase simultaneously
near 1
positive
what are 2 conditions that should be had if using pearson’s r
the data should not contain extreme outliers
the variance of x and the variance of y should be roughly equal - homoelasicity
what happens to mean and variation when data is aggregated?
variation is minimized, but mean remains constant
problem with MAUP and data aggregation
if you aggregate data n/s vs e/w the aggregated results will be different
what kind of distance can have multiple shortest routes?
manhattan distance
is adjacenecy a binary concept?
yes
how do you calculate margin of error
plus or minus 1/(SQRT(N))
what happens to margin of error as the sample size increases
it lowers
what will a distribution table be two tailed
when using the alternative hypothesis`
what is the empirical rule
68% of data lies within 1 stand dev of mean
define type 1 and 2 errors simply
if ho is true type 1
if ho is false type 2
why must we use INVERSE distance weighting
if we used the raw data then features with greater distances would have a greater effect on features, but we want them to have less of an effect because they are far away
Does a value of .8 mean moran’s I is significant?
no, moran’s i indicates the strength of a correlation and significance must be addressed in an entirely different manner
explain clusters vs clustering
when we say clusters we are referring to a specific cluster of high values (for example counties)
if we speak of clustering we may be discussing the general amount of clusters all throughout Pennsylvania
what is the difference between pearson’s r and spearman’s correlation coefficient?
pearson’s r refers to a parametric test involving two quantitative variables
spearman’s refers to a non parametric test used for qualitative or ordinal data
when is mean not a good measure of center?
when the data is not normally distributed or skewed left or right
give an example of a run in coins
having 8 heads in a row would be a run
give an example of a join using coins
a join would be having a head, then a tail
what determines whether if it will be one or two tailed
the alternative hypothesis
when will you have a two tailed sitribution
when the observation in the test statistic does not equal the control
how do you standardize a row
divide the weight in question by the sum of the entire row.
basically it’s getting the percentage
what is factorial ANOVA
this is used when you want to measure the effects two or more independent variables have on an independent variable