Descriptive Statistics(2) Flashcards
variable
variable: a population characteristic
which takes on different values for
the elements comprising the
population
define:
population
sample
parameter
statistic
population: the total set of elements (objects, persons, regions,
neighbourhoods, rivers, etc.) under examination in a particular study
sample: a subset of the elements in a population, which is used to make
inferences about certain characteristics of the population as a whole
parameter: a quantity that defines a certain characteristic of a population
- If you take the average of a POPULATION that is a PARAMETER
statistic: a quantity that defines a certain characteristic of a sample
two ways to present descriptive statistics-describe both
tables are great for summarizing large quantities of complex data, but
can be a challenge to read and interpret
-frequency tables
graphs can be used for simpler datasets, and are easily interpreted by
the reader
..Difference between bar graph and histogram
-bar graph is good for categorical(nominal/ordinal) data
-gaps are between in bar graph because they are distinct bars
-histogram has continuous data on the bottom, for interval and ratio data
-Boxplot is not used to often because it isn’t nice to look at and doesn’t convey much
natural break points
.natural break points: can be where frequency is 0 or low
Frequency Tables-5 important things to not when making one
- use intervals with simple bounds
- respect natural breakpoints
- the intervals must not overlap and must include all observations
- all intervals should be the same width
- select an appropriate number of classes
• this is hardest to determine
a histogram must have _____ breaks
equal
Line Graphs Vs Scatterplots
.Line is used for categorical data
-you CANNOT use the line to guess values, only for seeing pattern
.Scatter is used for interval/ratio with continuous data
The rose diagram is used for
o directional data has its own specific visual descriptive – the rose
diagram
-.By making intervals of wedges increase to the outside, it makes more sense as the wedges are bigger towards the outside
CENTRAL TENDENCY:
define each one below
midrange
mode
median
mean
o midrange: the midpoint between the largest and smallest values of a variable
in the data set
-the midrange is strongly affected by extreme values
o mode: the value of the most common/frequent value of a variable in the data
set
-what is the mode of a data set with no repeating values? No Mode
=the midrange and mode are crude statistics, and often do not provide an accurate
measure of centrality
o median: the value of a variable that divides the observations in half
o mean: the average value of a variable in the data set
Arithmetic Mean:
population mean symbol
sample mean symbol
u with extended vertical line at front
x with horizontal line overtop
Geometric Mean
notice that the
values are not evenly influenced(weighted differently) – we need a geometric mean
Arithmetric Vs. Geometric Mean
o the arithmetic mean is used when each data point has the same influence or
“weight” as all the other points
o the geometric mean is used when each data point has an associated frequency,
influence, or weight attached to it, such that some data points are more important
than others
in geometric mean the f(i) stands for the …
weight of the value
Ranges
ex: $0-10000
we need to start making assumptions – first, assume that the midpoint
of each range is a suitable option (is it always?)
.then we can determine the geometric mean
Which measure of central tendency is best?
o the centrality statistic should represent the typical value of the data set
o only the mean considers all of the values in the data set; the other statistics
only rely on specific values
o if you change any value in the set, the mean will also change
o usually, the mean is considered the best because of this property, but there
are some exceptions
Times when the mean is not reflective of the typical value:
1.o bimodal distributions
-the mean and median do not reflect
the typical value, but the modes do
- Prescence of extreme values that will highly effect the mean
- median and mode best - Skewed distributions
- the mode is most typical here,
while the mean is affected by the
stretch in the data set
How to evaluate the dispersion around the mean median and mode? 3
1.Range
the range is the difference between the largest and smallest data point
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 = 24 − 4 = 20
-the range only considers the 2 extreme values of the data set, which often
are not representative of the whole data set
-also, larger samples tend to have larger ranges, since they are more likely
to contain the rare or unusual members of a population
2.Percentiles & Quartiles
- recall that the median splits the data set in half and is known as the 50th
percentile (or 2nd quartile) – half of the data is above the median and half is
below
.to find the 25th percentile: (𝑛 + 1) 𝑃 = (6 + 1) 0.25 = 1.75
-the 25th percentile is the 1.75th value in the data set = 5.5
-25% of the data set is below 5.5, 75% of the data set is above 5.5
3.Variance
-the variance can be described as the sum of the squared differences of each
value from the mean
-becomes standard deviation when you take the solutions square root
-strongly affected by extreme values
Interquartile range
difference between the 1st and 3rd quartiles
_________ is a
better measure of dispersion as it omits those extreme
values
interqurtile range
Coefficient of Variation
if you want to compare the standard deviations of 2 data sets, you must
ensure that they have the same mean value (not always applicable)
.when that isnt possibe, coeffecient of variation is used to compare
-o a data set with a low CV is less variable than
one with a high CV
a data set with a low CV is _____ variable than
one with a high CV
less
Skewness
Positive
none
negative
the skewness of a data set describes how symmetrical the values are around
the mean, or the difference between the mean and median
.positive skewness the mean is greater than the median – the data set is asymmetrical
.no skewness
the mean is equal to the
median – the data set is
symmetrical
.negative skewness the mean is less than the median – the data set is asymmetrical
kurtosis
the kurtosis of a data set describes how peaked the data set is
.the data set is
relatively flat and
spread out
kurtosis < 3
mesokurtic the data set is relatively normally distributed kurtosis = 3
leptokurtic
the data set is narrow
and peaked
kurtosis > 3
Standardization
o it can be difficult to compare multiple data sets that each have different
means and standard deviations – to do this we have to standardize the data
o standardization translates the data set so that it has a mean of 0 and a
standard deviation of 1 – this allows you to compare multiple standardized
data sets easily
- the standardized values are known as z-scores – each z-score describes how
many standard deviations the value is from the mean
z scores
the standardized values are known as z-scores – each z-score describes how
many standard deviations the value is from the mean
Types of Data used commonly by geographers (
1.Areal Data
data are frequently published for discrete areal units such as provinces,
countries, census tracts, watersheds, and other bounded units
o the location quotient is a frequently used statistic in economic geography and
locational analysis
- Point Data
-o distance is either explicitly or implicitly
included within these measures
o if we collect a datum from a point on Earth’s
surface, that datum can be georeferenced –
subsequent data can then be spatially related to
each other
Location quotient
σ 𝐴𝑖 𝐵𝑖 σ 𝐵𝑖 = (5/ 100)/ (150/ 1000) = 0.05/ 0.15 = 0.333
- if LQ > 1, this indicates a relative concentration of the activity in area I,
compared to the region as a whole - if LQ = 1, the area has a share of the activity in accordance with its share
of the base - if LQ < 1, the area has less of a share of the activity than is more generally,
or regionally, found
o location quotients can be easily mapped
-mostly used in economic geog
Lorenz Curve
. the Lorenz curve is a graphical technique for describing the distribution of a
variable among spatial unit
-the more the Lorenz curve deviates from a 1:1 line, the more
concentrated an activity is in one unit compared to the other in
the region
Gini coefficient
is a commonly used descriptive statistic used in conjunction
with the Lorenz curve, and is defined as the maximum deviation between the
Lorenz curve and the 1:1 line
-the range of the Gini coefficient is from 0 to 100 %
-another way of thinking about this is that the activity
becomes less similar throughout the units – for this reason
the Gini coefficient is often called an index of dissimilarity
-0 is on the 1:1 line, 100 is complete inequality
- lower the % the more evenly distributed the sector is
- larger the % the more unevenly distributed
mean centre and manhatten median
mean centre- does ot consider built environment
-o the mean centre provides a location which minimizes the distance travelled
from each point – we can also determine the dispersion of the points by
calculating the standard distance
Manhattan Median:
. For places on a grid system like Manhatten
-value may not identify a point but an AREA
the Manhattan median is a unique point only when there are an odd number of
observations – when there are an even number of observations it is an area
standard distance vs. relative distance
the mean centre provides a location which minimizes the distance travelled
from each point – we can also determine the dispersion of the points by
calculating the standard distance
o the relative distance provides a more intuitive measure, and it allows us to
compare different spatial data sets that do not share the same mean centre or
standard distance
dispersion and clustering
o dispersion v. clustering
o one of the main ways we can describe spatial data is to define whether the
points/locations are clustered or dispersed around a central point
o but clustering/dispersion may also be related to the size of the sample area
standard deviation ellipse
steps 5
the standard deviational ellipse summarizes dispersion in a point pattern
as an ellipse rather than a circle – 2 dimensions vs 1 dimension
- transpose the data such that the origin is at the mean centre
- calculate the angle of rotation – this determines the direction of maximum
dispersion
-this shows you the primary and secondary trend of data - calculate the standard deviation parallel to the new y-axis
- calculate the standard deviation parallel to the new x-axis
- fit an ellipse with dimensions of sx and sy