Summarizing Data Flashcards
Any characteristic that differs from person to person, such as height, sex, smallpox vaccination status, or physical activity pattern.
The value of a variable is the number or descriptor that applies to a particular person
Variable
Epidemiologic database organized like a spreadsheet with rows and columns
Line listing
Each row representing one person or case of disease
Record or observation
Column contains information about one characteristic of the individual such as race or date of birth
Variable
Categorical variable
Qualitative
Nominal
Ordinal
Continuous
Quantitative
Interval
Ratio
Categories without any numerical ranking such as county of residence
Alive or dead
Ill or well
Nominal scale
Nominal variable with two mutually exclusive categories
Ill or well
Dichotomous
Values that can be ranked but are not necessarily evenly spaced
Stage of cancer
Ordinal-scale variable
Measured on a scale of equally spaced units, but without a true zero point such as date of birth
Interval-scale variable
Interval variable with true zero point,
height in centimeters or duration of illness
Ratio-scale variable
Where the distribution has its peak
Clustering at a particular value
Central location
Central tendency of a frequency distribution
How widely dispered it is on both sides of the peak
Variation, dispersion
Distribution out from a central value
Independent of its central location
Spread
Bell shaped curve
Normal distribution
Three measures of central location
Mean
Median
Mode
Midrange
Geometric mean
Third property of a frequency distribution where it may be asymmetrical or symmetric
Shape
The tail of bell and not the hump
Skewness
Long tail to the left
Skewed to the left
Distribution that has a central location to the left and a tail off to the right is said to be
positively skewed
skewed to the right
Common in distributions that begin with 0
ex number of servings consumed, number of sexual partners
Skewed to the right
Classic or symmetrical bell-shaped curve
Defined by a mathematical equation
Mean, median and mode coincide at the central peak but the area under the curve helps determine measures of spread such as the standard deviation and confidence interval
Normal distribution
Gaussian distribution
Types of variable that may be summarized in ratio or proportion
Nominal
Ordinal
Interval
Ratio
Types of variable where measures of central location may be employed
Interval
Ratio
Types of variable where measures of central location may be employed
Interval
Ratio
Provides a single value that summarizes an entire distribution of data
Measure of central location
Ave age of affected
Selecting the best measure to use for a given distribution depends largely on two factors:
Shape or skewness of distribution
Intended use of measure
Value that occurs most often in a set of data
Mode
If the frequency distribution can have more than one mode
Bi-modal
In a histogram, the mode is the
Tallest column
Preferred measure of central location for addressing which value is the most popular or the most common
Used almost exclusively as descriptive measure
It is not typically affected by one or two extreme values (outliers)
Mode
Middle value of a set of data that has been put into rank order
Value that divides the data into two halves with one half of the observations being smaller than the median value and the other half being larger
50th percentile of distribution
Median
Middle position =
(n+1)/2
If odd, middle position falls on single observation, median is the value of that observation
If even, middle position falls between two observations, median equals the average of the two values
Good descriptive measure for data that are skewed because it is the central point of distribution
Not generaly affected by extremes (outliers)
Median
Value that is closest to all other values in a distribution
Add all observed values in the distribution
Divide the sum by the number of observations
Mean
When the mean is subtracted from each observation in the data set, the sum of these differences is zero
Also called center of gravity
Point at which the distribution would balance
Not a good measure for severely skewed data or have extreme values in one direction or another
Affected by extreme value because the mean uses all of the observations in the distribution
Centering property of the mean
Halfway point or the midpoint of a set of observations
Calculated as intermediate step in determining other measures
Identify the smallest (minimum) observation and the largest (maximum) observation
Add the minimum + maximum, then divide by two
Midrange
Mean or average of a set of data measured on a logarithmic scale
Used when the logarithms of the observations are distributed normally (symmetrically) rather than the observations themselves
Geometric mean
Uses all data but not as sensitive to outliers as arithmetic mean
Geometric mean
Most sensitive to outliers
Midrange
Describe the dispersion (or variation) of values from that peak in the distribution
Measures of spread
Measures of spread
Range
Interquartile range
Standard deviation
Difference between its largest (maximum) value and its smallest (minimum value)
From the minimim to maximum
Range
Divide the data in a distribution into 100 equal parts
90th percentile has 90% of the observations at or below it
Percentile
Messure of spread most commonly used with median
Central portion of distribution from 25th to 75th percentile
Interquartile range
Measure of spread used most commonly with the arithmetic mean
Subtracting the mean from each observation
The difference between the mean and each observation is squared to eliminate negative numbers
Average is caculated and square root is taken to get back
Variability of data
Standard deviation
Calculated when the data is more-or-less normally distributed ie data fal into a typical bell shaped curve
Recommended measure of spread
Standard deviation
Variability we might expect in the arithmetic means of repeated samples taken from the same population
Assumes that the data you have is actually a sample from a larger population
Calculates confidence intervals around arithmetic mean
Standard error of mean
Indicates a measurement’s precision
Based on the mean itself and some multiple standard of error (variability of means that might be calculated from repeated samples from the same population)
Confidence interval
Regardless of how data are distributed, means (particularly from large samples) tend to be normally distibuted
Central Limit Theorem
Range of values consistent with data from a study
A guide to the variability in the study
Confidence intervals
Distribution where the mean, median and mode would have the same values
Bell shaped curve
Normal distribution
Normal type of distribution
MCL?
MOS?
Arithmetic mean
Standard deviation
Asymmetrical or skewed type of distribution
MCL?
MOS?
Median
Range or interquartile range
Exponential or logarithmic type of distribution
MCL?
MOS?
Geometric mean
Geometric standard