Descriptive Statistics(2) Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

variable

A

variable: a population characteristic
which takes on different values for
the elements comprising the
population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

define:

population

sample

parameter

statistic

A

population: the total set of elements (objects, persons, regions,
neighbourhoods, rivers, etc.) under examination in a particular study

sample: a subset of the elements in a population, which is used to make
inferences about certain characteristics of the population as a whole

parameter: a quantity that defines a certain characteristic of a population
- If you take the average of a POPULATION that is a PARAMETER

statistic: a quantity that defines a certain characteristic of a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

two ways to present descriptive statistics-describe both

A

tables are great for summarizing large quantities of complex data, but
can be a challenge to read and interpret
-frequency tables

graphs can be used for simpler datasets, and are easily interpreted by
the reader
..Difference between bar graph and histogram
-bar graph is good for categorical(nominal/ordinal) data
-gaps are between in bar graph because they are distinct bars
-histogram has continuous data on the bottom, for interval and ratio data
-Boxplot is not used to often because it isn’t nice to look at and doesn’t convey much

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

natural break points

A

.natural break points: can be where frequency is 0 or low

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Frequency Tables-5 important things to not when making one

A
  1. use intervals with simple bounds
  2. respect natural breakpoints
  3. the intervals must not overlap and must include all observations
  4. all intervals should be the same width
  5. select an appropriate number of classes
    • this is hardest to determine
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

a histogram must have _____ breaks

A

equal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Line Graphs Vs Scatterplots

A

.Line is used for categorical data
-you CANNOT use the line to guess values, only for seeing pattern
.Scatter is used for interval/ratio with continuous data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The rose diagram is used for

A

o directional data has its own specific visual descriptive – the rose
diagram

-.By making intervals of wedges increase to the outside, it makes more sense as the wedges are bigger towards the outside

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

CENTRAL TENDENCY:

define each one below

midrange

mode

median

mean

A

o midrange: the midpoint between the largest and smallest values of a variable
in the data set
-the midrange is strongly affected by extreme values

o mode: the value of the most common/frequent value of a variable in the data
set
-what is the mode of a data set with no repeating values? No Mode
=the midrange and mode are crude statistics, and often do not provide an accurate
measure of centrality

o median: the value of a variable that divides the observations in half

o mean: the average value of a variable in the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Arithmetic Mean:
population mean symbol

sample mean symbol

A

u with extended vertical line at front

x with horizontal line overtop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Geometric Mean

A

notice that the

values are not evenly influenced(weighted differently) – we need a geometric mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Arithmetric Vs. Geometric Mean

A

o the arithmetic mean is used when each data point has the same influence or
“weight” as all the other points

o the geometric mean is used when each data point has an associated frequency,
influence, or weight attached to it, such that some data points are more important
than others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

in geometric mean the f(i) stands for the …

A

weight of the value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Ranges

A

ex: $0-10000
we need to start making assumptions – first, assume that the midpoint
of each range is a suitable option (is it always?)

.then we can determine the geometric mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which measure of central tendency is best?

A

o the centrality statistic should represent the typical value of the data set
o only the mean considers all of the values in the data set; the other statistics
only rely on specific values
o if you change any value in the set, the mean will also change
o usually, the mean is considered the best because of this property, but there
are some exceptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Times when the mean is not reflective of the typical value:

A

1.o bimodal distributions
-the mean and median do not reflect
the typical value, but the modes do

  1. Prescence of extreme values that will highly effect the mean
    - median and mode best
  2. Skewed distributions
    - the mode is most typical here,
    while the mean is affected by the
    stretch in the data set
17
Q

How to evaluate the dispersion around the mean median and mode? 3

A

1.Range
the range is the difference between the largest and smallest data point
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 = 24 − 4 = 20
-the range only considers the 2 extreme values of the data set, which often
are not representative of the whole data set
-also, larger samples tend to have larger ranges, since they are more likely
to contain the rare or unusual members of a population

2.Percentiles & Quartiles
- recall that the median splits the data set in half and is known as the 50th
percentile (or 2nd quartile) – half of the data is above the median and half is
below
.to find the 25th percentile: (𝑛 + 1) 𝑃 = (6 + 1) 0.25 = 1.75
-the 25th percentile is the 1.75th value in the data set = 5.5
-25% of the data set is below 5.5, 75% of the data set is above 5.5

3.Variance
-the variance can be described as the sum of the squared differences of each
value from the mean
-becomes standard deviation when you take the solutions square root
-strongly affected by extreme values

18
Q

Interquartile range

A

difference between the 1st and 3rd quartiles

19
Q

_________ is a
better measure of dispersion as it omits those extreme
values

A

interqurtile range

20
Q

Coefficient of Variation

A

if you want to compare the standard deviations of 2 data sets, you must
ensure that they have the same mean value (not always applicable)

.when that isnt possibe, coeffecient of variation is used to compare
-o a data set with a low CV is less variable than
one with a high CV

21
Q

a data set with a low CV is _____ variable than

one with a high CV

A

less

22
Q

Skewness

Positive

none

negative

A

the skewness of a data set describes how symmetrical the values are around
the mean, or the difference between the mean and median

.positive skewness
the mean is greater
than the median –
the data set is
asymmetrical

.no skewness
the mean is equal to the
median – the data set is
symmetrical

.negative skewness
the mean is less
than the median –
the data set is
asymmetrical
23
Q

kurtosis

A

the kurtosis of a data set describes how peaked the data set is

.the data set is
relatively flat and
spread out
kurtosis < 3

mesokurtic
the data set is
relatively normally
distributed
kurtosis = 3

leptokurtic
the data set is narrow
and peaked
kurtosis > 3

24
Q

Standardization

A

o it can be difficult to compare multiple data sets that each have different
means and standard deviations – to do this we have to standardize the data
o standardization translates the data set so that it has a mean of 0 and a
standard deviation of 1 – this allows you to compare multiple standardized
data sets easily

  • the standardized values are known as z-scores – each z-score describes how
    many standard deviations the value is from the mean
25
Q

z scores

A

the standardized values are known as z-scores – each z-score describes how
many standard deviations the value is from the mean

26
Q

Types of Data used commonly by geographers (

A

1.Areal Data
data are frequently published for discrete areal units such as provinces,
countries, census tracts, watersheds, and other bounded units
o the location quotient is a frequently used statistic in economic geography and
locational analysis

  1. Point Data
    -o distance is either explicitly or implicitly
    included within these measures
    o if we collect a datum from a point on Earth’s
    surface, that datum can be georeferenced –
    subsequent data can then be spatially related to
    each other
27
Q

Location quotient

A
σ 𝐴𝑖
𝐵𝑖
σ 𝐵𝑖
=
(5/
100)/
(150/
1000)
=
0.05/
0.15
= 0.333
  1. if LQ > 1, this indicates a relative concentration of the activity in area I,
    compared to the region as a whole
  2. if LQ = 1, the area has a share of the activity in accordance with its share
    of the base
  3. if LQ < 1, the area has less of a share of the activity than is more generally,
    or regionally, found
    o location quotients can be easily mapped
    -mostly used in economic geog
28
Q

Lorenz Curve

A

. the Lorenz curve is a graphical technique for describing the distribution of a
variable among spatial unit
-the more the Lorenz curve deviates from a 1:1 line, the more
concentrated an activity is in one unit compared to the other in
the region

29
Q

Gini coefficient

A

is a commonly used descriptive statistic used in conjunction
with the Lorenz curve, and is defined as the maximum deviation between the
Lorenz curve and the 1:1 line

-the range of the Gini coefficient is from 0 to 100 %
-another way of thinking about this is that the activity
becomes less similar throughout the units – for this reason
the Gini coefficient is often called an index of dissimilarity
-0 is on the 1:1 line, 100 is complete inequality

  • lower the % the more evenly distributed the sector is
  • larger the % the more unevenly distributed
30
Q

mean centre and manhatten median

A

mean centre- does ot consider built environment
-o the mean centre provides a location which minimizes the distance travelled
from each point – we can also determine the dispersion of the points by
calculating the standard distance

Manhattan Median:
. For places on a grid system like Manhatten
-value may not identify a point but an AREA
the Manhattan median is a unique point only when there are an odd number of
observations – when there are an even number of observations it is an area

31
Q

standard distance vs. relative distance

A

the mean centre provides a location which minimizes the distance travelled
from each point – we can also determine the dispersion of the points by
calculating the standard distance

o the relative distance provides a more intuitive measure, and it allows us to
compare different spatial data sets that do not share the same mean centre or
standard distance

32
Q

dispersion and clustering

A

o dispersion v. clustering
o one of the main ways we can describe spatial data is to define whether the
points/locations are clustered or dispersed around a central point
o but clustering/dispersion may also be related to the size of the sample area

33
Q

standard deviation ellipse

steps 5

A

the standard deviational ellipse summarizes dispersion in a point pattern
as an ellipse rather than a circle – 2 dimensions vs 1 dimension

  1. transpose the data such that the origin is at the mean centre
  2. calculate the angle of rotation – this determines the direction of maximum
    dispersion
    -this shows you the primary and secondary trend of data
  3. calculate the standard deviation parallel to the new y-axis
  4. calculate the standard deviation parallel to the new x-axis
  5. fit an ellipse with dimensions of sx and sy