Lecture 2 (DESCRIPTIVE STATISTICS II) Flashcards
MEASURES OF CENTRAL TENDENCY
Yield information about “particular places places or locations in a group of numbers”.
MODE
The most frequently occurring value in a data set.
Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio)
Can be used to determine what categories occur most frequently.
BIMODAL : In a tie for the most frequently occurring value, two modes are listed.
MULTIMODAL: Data sets that contain more than two modes.
MEDIAN
Middle value in a ordered array of
numbers.
For an array with an odd number of terms, the median is the middle number.
For an array with an even number of terms the median is the average of the middle two numbers.
ARITHMETIC MEAN
Mean is the average of a group of numbers.
Applicable for interval and ratio data.
Not applicable for nominal or ordinal data.
Affected by each value in the data set, including extreme values.
Computed by summing all values in the data set and dividing the sum by the number of values in the data set.
Population mean
μ
Sample mean
x bar
PERCENTILES
Measures of central tendency that divide a group of data into 100 parts.
At least n% of the data lie below the nth percentile, and at most (100-n)% of the data lie above the nth percentile.
How to calculate percentiles
Organise data into ascending ordered array.
Calculate the percentile location i= (P/100)*n
Determine the percentile’s location and its value.
If i is a whole number, the percentile is the average of the values at the i and (i+1) positions.
If i is not a whole number, the percentile is at the (i+1) position in the ordered array.
QUARTILES
Measure of central tendency that divide a group of data into four subgroups.
Q1: 25% of the data fall below the first quartile.
Q2: 50% of the data set is below the second quartile
Q3: 75% of the data set is below the third quartile.
MEASURES OF VARIABILITY
Tools that describe the spread or the dispersion of a set of data.
RANGE
The difference between the largest and the smallest values in a set of data.
ADVANTAGE: Easy to compute
DISADVANTAGE: is affected by extreme values
INTERQUARTILE RANGE
Range of values between the first and third quartiles.
Range of the middle half; middle 50%
Useful when researchers are interested in the middle 50% and not the extremes.
Used in the construction of box plots and whisker plots
Q3 - Q1
Mean Absolute Deviation, variance, and Standard Deviation
These data are not meaningful unless the data are at least interval level data.
One way for researchers to look at the spread of the data is to subtract the mean from each data set.
Subtracting the mean from each data value gives the deviation from the mean (X - μ)
An examination of deviation from the mean can reveal information about the variability of data.
The sum of deviation from the arithmetic mean is always zero.
ABSOLUTE DEVIATION
An obvious way to force the sum of deviations to have a non zero total is to take the absolute value of each deviation around the mean.
Allows on to solve for the Mean Absolute Deviation
MEAN ABSOLUTE DEVIATION
Average of the absolute deviations from the mean.
(ΣN[X-μ])/N
POPULATION VARIANCE
Average of the squared deviations from the arithmetic mean σ^2
SUM OF SQUARED DEVIATIONS
SSD about the mean of a set of values
SAMPLE VARIANCE
Average of the squared deviations from the arithmetic mean.
S^2 = (Σ(X-Xbar)^2) / n-1
SAMPLE STANDARD DEVIATION
Is the square root of the sample variance.
EMPIRICAL RULE
A guideline that states the approximate % of values that fall within a given number of standard deviations of a mean of a set of data that are normally distributed.
Distance from the mean:
μ +/- 1σ
Percentage of values falling within distance: 68
Distance from the mean:
μ +/- 2σ
Percentage of values falling within distance:
95
Distance from the mean:
μ +/- 3σ
Percentage of values falling within distance:
99.7
Applies when data are approximately normally distributed.
CHEBYSHEV’S THEOREM
Applies to all distribution, and they can be used whenever the data distribution shape is unknown or non-normal.
At least 1 - 1/k^2 values fall within + and - standard deviations of the mean, regardless of the shape of the distribution.
k is the number of standard deviations.
Z-SCORES
Represents the number of Std Dev a value (x) is above or below the mean of a set of numbers when the data are normally distributed.
Allows the translation of a value’s raw distance from the mean into units of std dev.
z = (x - u)/o
COEFFICIENT OF VARIATION
Ratio of the standard deviation to the mean, expressed as a percentage.
Measurement of relative dispersion
CV = o/u * 100
SYMMETRICAL
The right half is a mirror image of the left half
SKEWNESS
Shows that the distribution lack symmetry; used to denote the data is sparse at one end, and piled at the other end.
COEFFICIENT OF SKEWNESS
Compares the mean and median in light of the magnitude to the standard deviation; Md is the median; o is the standard deviation
Sk = (3(u-Md)) / o
If Sk < 0 The distribution is negatively skewed. (left)
If Sk = 0, the distribution is symmetric (not skewed)
If Sk > 0, the distribution is positively skewed (right)
Describe the distribution of the mean, median and mode when data is negatively skewed
Mean is lowest value, median is middle value, mode is highest value.
Describe the distribution of the mean, median and mode when data is symmetric.
Mean, mod and median all have the same value.
describe the distribution of the mean, median and mode when data are positively skewed.
Mode is lowest, median is middle, mean is highest.
Kurtosis
Peakedness
LEPTOKURTIC: high and thin
MESOKURTIC: normal in shape
PLATYKURTIC: flat and spread out
BOX AND WHISKER PLOT
Five specific values are used: Median, Q2 First Quartile, Q1 Third Quartile, Q3 Minimum value in data set Maximum value in data set.
INNER FENCES:
IQR = Q3 - Q1
Lower inner fence = Q1 - 1.5 IQR
Upper inner fence = Q3 - 1.5 IQR
OUTER FENCES:
Lower inner fence = Q1 - 3.0 IQR
Upper outer fence = Q3 + 3.0 IQR