Central Tendency, Variability, & Z-scores Flashcards
What data classes are best for mean?
interval, ratio
What data classes are best for mode?
nominal, ordinal, interval, ratio
What data classes are best for median?
ordinal, interval, ratio
Under what conditions might a median be a better measure of central tendency than the mean?
- when the data is ordinal (mean does not apply)
- interval/ratio data if there are extreme values
It should seem clear how the mean and the median are measures of the central tendency of the data since the mean is is a familiar average and the median is the middle. However, explain why mode is also considered a measure of central tendency?
most data sets peak in the middle (bell shape). The mode is the highest frequency so it’s usually in the middle somewhere.
For any data set, what is Σ(X − X ̅)?
Σ(X − X) may be written as:
ΣX − ΣX ̅ = ΣX − nX ̅ = ΣX − n(ΣX)/n = ΣX − ΣX = 0.
The following data represent a sample of the time to complete a certain task in minutes and seconds (mm:ss).
6:30, 11:15, 6:22, 11:32, 8:12, 5:02, 9:17, 6:51, 8:44, 7:45, 9:37, 7:28, 4:29, 7:42
compute the mean:
compute the std. dev.:
Since the values are given in minutes and seconds they first need to be converted to either minutes and decimal parts (eg. 6:30 = 6 + 30/60 = 6 + 0.5000 = 6.5000min) or to seconds (eg. 6:30 = 6*60 + 30 = 360 + 30 = 390s) so that they can be easily added.
mean: 7:55
std. dev.: 2:04
For a certain set of data, the mean and standard deviation are computed.
How does X ̅ (data treated as sample) compare to μ (data treated as a population)?
How does s (data treated as sample) compare to σ (data treated as a population)?
X ̅ is the sample mean, μ is the population mean; they are calculated the same way.
standard deviation of sample (N-1) vs. population (N)
Given the following sample data set:
6, 12, 9, 7, 8, 4, 3, 12, 15
Compute the mean.
What is the median?
What is the mode?
Compute the variance.
Compute the standard deviation.
mean: 8.44
median: 8
mode:12
variance: 15.77
standard deviation: 3.97
For the following sample data set:
X frequency
52 5
54 8
57 2
Compute the mean.
Compute the variance.
Compute the standard deviation.
mean: 53.73
variance: 2.635
standard deviation: 1.62
The following sample data of the number of communications are taken from logs of communication with Distance Education students:
5, 9, 5, 23, 27, 55, 34, 7, 30, 15, 22, 60, 14, 52, 297, 8, 51, 15, 51, 35, 15, 39, 137, 43, 38, 14, 93, 7
Compute the mean.
Compute the standard deviation.
Draw a boxplot with the minimum, Q1, Q2, Q3, and maximum.
Which is a better representation of the central tendency: mean or median? Explain.
mean: 42.89
std. dev.: 57.28
Minimum: 5
Q1: 14
Q2: 28.5
Q3: 51
Maximum: 297
The mean is; this is due to extreme values.
If the two largest values in the sample data set of the previous problem were omitted,
Compute the mean.
Compute the standard deviation.
Draw a boxplot with the minimum, Q1, Q2, Q3, and maximum.
Which is a better representation of the central tendency: mean or median? Explain.
mean: 29.50
std. dev.: 21.68
minimum: 5
Q1: 14
Q2: 25
Q3: 43
Maximum: 93
Mean may now be a better measure because extreme outliers have been removed.
Consider the following data set:
21, 34, 18, 26, 30, 35, 24, 29, 25
If this is a population, compute the mean.
If this is a sample, compute the mean.
If this is a population, compute the standard deviation.
If this a sample, compute the standard deviation.
μ=26.9
X ̅= 26.9
σ= 5.34
s= 5.67
If we had a set of ordinal values (not interval/ratio), could you create a boxplot?
Technically yes, because quartiles depend only on the position in the ordered data set. Thus, one could determine the positions in the ordered set for Q1, Q2 (median), and Q3 and the first and last position for the minimum and maximum. However, without interval/ratio data, visualizing this with a boxplot would not make sense.
For example, imagine you ask 9 people what size drink they ordered, small, medium, or large. The ordered data might be: small, small, small, medium, medium, large, large, large, large. Q1 is position 2.5 (small), Q2 is position 5 (medium), and Q3 is position 7.5 (large), minimum is position 1 (small) and maximum is position 9 (large).
Typically we consider quantitative data that is symmetric about the mean. If we have a data set that has a few extreme high values, then
a. How is it skewed?
b. Would you use a mean or median? Why?
It is positively skewed (right-skewed)
You would use median since it is less sensitive to extreme values.
- For the MCAT, µ = 500 and σ = 10. What is the probability of an individual getting a score greater than 502.5?
z=0.25
p=0.413
- For the MCAT, µ = 500 and σ = 10.
What is the minimum score would you have to obtain to be in the top 5%?
What is the minimum score you would have to obtain to be in the top 2.5%?
95%
500 + (1.64 x 10) = 516.4
97.5%
500 + (1.96 x 10)= 519.6
Correlational method
looking for relationships between variables (correlation or regression)
Experimental method
manipulating one variable to determine if this causes changes in another variable
independent variable
what we control/manipulate
dependent variable
what we measure (is influenced)
confounding/extraneous variables
other things impacting (things impacting dependent that aren’t independent)
random assignment
equal chance to end up in group (bigger the better)
helps decrease extraneous variables
experimental vs control groups
experimental: at least 2 different groups
control: group with no treatment (placebo)
Placebo
any treatment that has no active properties
hypothetical constructs
an explanatory variable which is not directly observable
we must find ways to operationalize these
operational definition
how do we assign a number?
population
all the people we want to apply results to (we control/decide)
sample
we find a subset of the pop. that is representative of the whole pop. (random)
random sample
random group from population
descriptive statistics
summarizes data
inferential statistics
trying to infer back to population (generalize)
parameter vs. statistics
statistics for sample; parameter for population
sampling error
the difference between stat. from sample and it’s parameter
discrete vs. continuous variable
discrete: categories w/ nothing in between
continuous: infinite values between any two categories
quantitative vs. categorical data
quantitative: directly measuring something (continuous data)
categorical data: counts of things (discrete variables)
scales of measurements
nominal: no inherent order of different categories (weakest)
ordinal: one group is above other, not evenly spaced (can use median)
interval: there’s equal interval, but no true zero
ratio: there’s equal interval, but is true zero
frequency distributions
- real lower limit
- real upper limit
- midpoint
visualizing data
- histogram: frequency distribution turned into a graph. We can see shape o destitution and the spread of the data.
- line graph: good for looking at change/time
- scatterplot: tells us about relationships between variables. shows pos. & neg. relationships. strength of relationship based on how linear.
- boxplot: box represents 50% of data.
shapes of distributions
symmetrical
- unimodal: bell-shaped (normal dist.)
- bimodal: clear 2 peaks (one can be higher)
- rectangular: data of equal freq. for all values
asymmetrical
- pos. skewed: skewed to right (not norm. but unimodal)
- neg. skewed: skewed to left (not norm. nut unimodal)
central tendency
mean: avg. of all numbers
median: middle number in list
mode: most freq. number
variability
-range
-interquartile range
- variance
- standard dev.
range: x(max)- x(min)
interquartile range:
Q1=.25 x (# in data data set)
Q2=.50 x (# in data data set)
Q3=.75 x (# in data data set)
IQR= Q3-Q1
variance: avg. squared dev. of each number from mean
std. dev.: sqrt var. (takes away squared unit)
z-scores
raw score to z-score: x=u+2o
z-score to raw score: z=(x-u)/o
standardize a dist.
shape of standard distribution: the shape of the distribution of z-scores will be the same as the shape of the original dist. raw scores
mean: z-score dist. always have mean of zero so and above=+ and any below=-
standard deviation: the z-score dist. will always have a standard dist. of 1. The numerical val. of z-score is exactly the same number of standard deviation from the mean
normal distribution
empirical rule: the following apprrox. holds
- 68% of obs. fall between u-o & u+o
95% of obs. fall between u-2o & u+2o
99.7% of obs. fall between u-3o & u+3o
un-biased stat
a statistic whose long range average is equal to the parameter it estimates