Descriptive Statistics(2) Flashcards by James Hamel

variable

variable: a population characteristic
which takes on different values for
the elements comprising the
population

How well did you know this?

Not at all

Perfectly

define:

population

sample

parameter

statistic

population: the total set of elements (objects, persons, regions,
neighbourhoods, rivers, etc.) under examination in a particular study

sample: a subset of the elements in a population, which is used to make
inferences about certain characteristics of the population as a whole

parameter: a quantity that defines a certain characteristic of a population
- If you take the average of a POPULATION that is a PARAMETER

statistic: a quantity that defines a certain characteristic of a sample

How well did you know this?

Not at all

Perfectly

two ways to present descriptive statistics-describe both

tables are great for summarizing large quantities of complex data, but
can be a challenge to read and interpret
-frequency tables

graphs can be used for simpler datasets, and are easily interpreted by
the reader
..Difference between bar graph and histogram
-bar graph is good for categorical(nominal/ordinal) data
-gaps are between in bar graph because they are distinct bars
-histogram has continuous data on the bottom, for interval and ratio data
-Boxplot is not used to often because it isn’t nice to look at and doesn’t convey much

How well did you know this?

Not at all

Perfectly

natural break points

.natural break points: can be where frequency is 0 or low

How well did you know this?

Not at all

Perfectly

Frequency Tables-5 important things to not when making one

use intervals with simple bounds
respect natural breakpoints
the intervals must not overlap and must include all observations
all intervals should be the same width
select an appropriate number of classes
• this is hardest to determine

How well did you know this?

Not at all

Perfectly

a histogram must have _____ breaks

equal

How well did you know this?

Not at all

Perfectly

Line Graphs Vs Scatterplots

.Line is used for categorical data
-you CANNOT use the line to guess values, only for seeing pattern
.Scatter is used for interval/ratio with continuous data

How well did you know this?

Not at all

Perfectly

The rose diagram is used for

o directional data has its own specific visual descriptive – the rose
diagram

-.By making intervals of wedges increase to the outside, it makes more sense as the wedges are bigger towards the outside

How well did you know this?

Not at all

Perfectly

CENTRAL TENDENCY:

define each one below

midrange

mode

median

mean

o midrange: the midpoint between the largest and smallest values of a variable
in the data set
-the midrange is strongly affected by extreme values

o mode: the value of the most common/frequent value of a variable in the data
set
-what is the mode of a data set with no repeating values? No Mode
=the midrange and mode are crude statistics, and often do not provide an accurate
measure of centrality

o median: the value of a variable that divides the observations in half

o mean: the average value of a variable in the data set

How well did you know this?

Not at all

Perfectly

Arithmetic Mean:
population mean symbol

sample mean symbol

u with extended vertical line at front

x with horizontal line overtop

How well did you know this?

Not at all

Perfectly

Geometric Mean

notice that the

values are not evenly influenced(weighted differently) – we need a geometric mean

How well did you know this?

Not at all

Perfectly

Arithmetric Vs. Geometric Mean

o the arithmetic mean is used when each data point has the same influence or
“weight” as all the other points

o the geometric mean is used when each data point has an associated frequency,
influence, or weight attached to it, such that some data points are more important
than others

How well did you know this?

Not at all

Perfectly

in geometric mean the f(i) stands for the …

weight of the value

How well did you know this?

Not at all

Perfectly

Ranges

ex: $0-10000
we need to start making assumptions – first, assume that the midpoint
of each range is a suitable option (is it always?)

.then we can determine the geometric mean

How well did you know this?

Not at all

Perfectly

Which measure of central tendency is best?

o the centrality statistic should represent the typical value of the data set
o only the mean considers all of the values in the data set; the other statistics
only rely on specific values
o if you change any value in the set, the mean will also change
o usually, the mean is considered the best because of this property, but there
are some exceptions

How well did you know this?

Not at all

Perfectly

Times when the mean is not reflective of the typical value:

Study These Flashcards

1.o bimodal distributions
-the mean and median do not reflect
the typical value, but the modes do

Prescence of extreme values that will highly effect the mean
- median and mode best
Skewed distributions
- the mode is most typical here,
while the mean is affected by the
stretch in the data set

How to evaluate the dispersion around the mean median and mode? 3

Study These Flashcards

1.Range
the range is the difference between the largest and smallest data point
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 = 24 − 4 = 20
-the range only considers the 2 extreme values of the data set, which often
are not representative of the whole data set
-also, larger samples tend to have larger ranges, since they are more likely
to contain the rare or unusual members of a population

2.Percentiles & Quartiles
- recall that the median splits the data set in half and is known as the 50th
percentile (or 2nd quartile) – half of the data is above the median and half is
below
.to find the 25th percentile: (𝑛 + 1) 𝑃 = (6 + 1) 0.25 = 1.75
-the 25th percentile is the 1.75th value in the data set = 5.5
-25% of the data set is below 5.5, 75% of the data set is above 5.5

3.Variance
-the variance can be described as the sum of the squared differences of each
value from the mean
-becomes standard deviation when you take the solutions square root
-strongly affected by extreme values

Interquartile range

Study These Flashcards

difference between the 1st and 3rd quartiles

_________ is a
better measure of dispersion as it omits those extreme
values

Study These Flashcards

interqurtile range

Coefficient of Variation

Study These Flashcards

if you want to compare the standard deviations of 2 data sets, you must
ensure that they have the same mean value (not always applicable)

.when that isnt possibe, coeffecient of variation is used to compare
-o a data set with a low CV is less variable than
one with a high CV

a data set with a low CV is _____ variable than

one with a high CV

Study These Flashcards

less

Skewness

Positive

none

negative

Study These Flashcards

the skewness of a data set describes how symmetrical the values are around
the mean, or the difference between the mean and median

.positive skewness
the mean is greater
than the median –
the data set is
asymmetrical

.no skewness
the mean is equal to the
median – the data set is
symmetrical

.negative skewness
the mean is less
than the median –
the data set is
asymmetrical

kurtosis

Study These Flashcards

the kurtosis of a data set describes how peaked the data set is

.the data set is
relatively flat and
spread out
kurtosis < 3

mesokurtic
the data set is
relatively normally
distributed
kurtosis = 3

leptokurtic
the data set is narrow
and peaked
kurtosis > 3

Standardization

Study These Flashcards

o it can be difficult to compare multiple data sets that each have different
means and standard deviations – to do this we have to standardize the data
o standardization translates the data set so that it has a mean of 0 and a
standard deviation of 1 – this allows you to compare multiple standardized
data sets easily

the standardized values are known as z-scores – each z-score describes how
many standard deviations the value is from the mean

z scores

the standardized values are known as z-scores – each z-score describes how many standard deviations the value is from the mean

Types of Data used commonly by geographers (

1.Areal Data data are frequently published for discrete areal units such as provinces, countries, census tracts, watersheds, and other bounded units o the location quotient is a frequently used statistic in economic geography and locational analysis 2. Point Data -o distance is either explicitly or implicitly included within these measures o if we collect a datum from a point on Earth’s surface, that datum can be georeferenced – subsequent data can then be spatially related to each other

Location quotient

``` σ 𝐴𝑖 𝐵𝑖 σ 𝐵𝑖 = (5/ 100)/ (150/ 1000) = 0.05/ 0.15 = 0.333 ``` 1. if LQ > 1, this indicates a relative concentration of the activity in area I, compared to the region as a whole 2. if LQ = 1, the area has a share of the activity in accordance with its share of the base 3. if LQ < 1, the area has less of a share of the activity than is more generally, or regionally, found o location quotients can be easily mapped -mostly used in economic geog

Lorenz Curve

. the Lorenz curve is a graphical technique for describing the distribution of a variable among spatial unit -the more the Lorenz curve deviates from a 1:1 line, the more concentrated an activity is in one unit compared to the other in the region

Gini coefficient

is a commonly used descriptive statistic used in conjunction with the Lorenz curve, and is defined as the maximum deviation between the Lorenz curve and the 1:1 line -the range of the Gini coefficient is from 0 to 100 % -another way of thinking about this is that the activity becomes less similar throughout the units – for this reason the Gini coefficient is often called an index of dissimilarity -0 is on the 1:1 line, 100 is complete inequality - lower the % the more evenly distributed the sector is - larger the % the more unevenly distributed

mean centre and manhatten median

mean centre- does ot consider built environment -o the mean centre provides a location which minimizes the distance travelled from each point – we can also determine the dispersion of the points by calculating the standard distance Manhattan Median: . For places on a grid system like Manhatten -value may not identify a point but an AREA the Manhattan median is a unique point only when there are an odd number of observations – when there are an even number of observations it is an area

standard distance vs. relative distance

the mean centre provides a location which minimizes the distance travelled from each point – we can also determine the dispersion of the points by calculating the standard distance o the relative distance provides a more intuitive measure, and it allows us to compare different spatial data sets that do not share the same mean centre or standard distance

dispersion and clustering

o dispersion v. clustering o one of the main ways we can describe spatial data is to define whether the points/locations are clustered or dispersed around a central point o but clustering/dispersion may also be related to the size of the sample area

standard deviation ellipse steps 5

the standard deviational ellipse summarizes dispersion in a point pattern as an ellipse rather than a circle – 2 dimensions vs 1 dimension 1. transpose the data such that the origin is at the mean centre 2. calculate the angle of rotation – this determines the direction of maximum dispersion -this shows you the primary and secondary trend of data 3. calculate the standard deviation parallel to the new y-axis 4. calculate the standard deviation parallel to the new x-axis 5. fit an ellipse with dimensions of sx and sy

Descriptive Statistics(2) Flashcards

(33 cards)