Descriptive Statistics Flashcards
Analysis and summary of the measurements of a set of objects on a single variable
e.g
water levels in a group of wells
ages of water in a group of aquifers
nitrate concentrations in a group of lakes
analysis of 1D vector data
tabular and graphical summaries
frequency distributions (most basic type)
-count of how often values occur
relative frequency
-dividing each frequency by total number of observations -%
cumulative frequency
-tallying number of values that occur up to and including each value
-how many observations are above or below a certain value
cumulative relative frequency
-cumulative frequency divided by the total number of observations
-%
most commonly displayed as a histogram
data has to be grouped - how do we know how many intervals to have
usually between 5 and 15 intervals
-goal is to reduce data without masking important features
for samples of less than 200, use sturges rule k=3.3logn
theoretical frequency distributions
frequency distributions based on observations are empirical (finite number of observations)
when you decrease an interval and add groups, get more information about shape
theoretical (imagine what it would look like if it was continuous
use empirical distribution to estimate the properties of the theoretical distribution
(use a sample of a population to estimate the whole population)
shapes of distributions
uniform (straight line across)
u-shaped (reveals polarization, more that favour either end and less in the middle)
J-shaped
-counting number of defects in a quality controlled products
-more will have closer to 0 defects, number of products with higher defects is less and less
bell-shaped
-normal distribution (very common)
-heights of males, grades, etc
skewed
-varies like a bell shape but is skewed over to one side
bi-modal
- two bumps
-eg heights of a population that includes both male and female
skewed distribution
named according to the side the tail exists on
central tendancy
single summary value that suggests a typical or representative observation
tends to describe the value that occurs the most often
assess using mode (value in distribution that occurs the most frequently)
-if data is grouped, it is the interval with the most observations
median: middle value of a set of ordered data (50% of observations lower, 50% higher)
-if grouped, it is the midpoint of the interval with the crf of 50%
-not sensitive to extremes
mean
-most important measure of central tendency
-takes into account each value
-very sensitive to outliers of data/
weighted mean
-provides a central tendency measure of data when the observations vary in their degree of importance
-each observation is multiplied by importance weight
-sensitive to extremes
How to operations on observations affect the mean
any operation applied to a set of observations will apply the same operation to the mean
x= set of observations
c= constant
X= mean
x+c = X+c
x-c = X-c
xc = Xc
x/c = X/c
transforming data
if data are skewed, we will want to transform to reduce skewness before calculating central tendency
-log transform (take natural log of each value of data)
- lots of stats require normal distribution
- log transform cant be done on values less than 0
range
difference between highest and lowest observed value
-only takes into account extremes
Mean absolute deviation (MAD)
-greater variation corresponds with greater deviation from the mean
-average of deviations
variance
- similar to MAD but the squares of the deviations are taken before summing them to get an average squared deviation
-we get better representation of true population with more data
standard deviation
square root of the variance (standard deviation)
-aka root mean square (RMS)
variation in a normal distribution
about 95% of data are within 2 standard deviations
-about 100% of data is within 3 standard deviations
standardized z scores
distance an observation is from the mean in standard deviations
value - mean over standard deviation
-can find the proportion of data that falls above or below a certain z score
how do operations on observations affect standard deviation