Ch4: Central Tendency and Variables Flashcards
Central tendency
4.1: Central Tendency
- …the descriptive statistic that best represents the center of a data set, the particular value that all the other data seem to be gathering around; the “typical” score
- Usually, at (or near) the highest point in the histogram or the polygon
- Expressed in three different ways: mean, median, mode
The mean expressed by symbolic notation
4.1: Central Tendency
We need to understand only a handful of symbols to express the ideas necessary to understanding stats
Several symbols can represent the mean:
* M: on the left side of the formula
* X: a single score
* μ:
* Σ: sum of single scores (X)
* N: total number of scores
The numbers based on samples taken from a population are called…
4.1: Central Tendency
- Statistics
- E.g., M is a statistic
The numbers based on whole populations are called
4.1: Central Tendency
- Parameters
- E.g., μ is a parameter
Steps to calculating the mean:
4.1: Central Tendency
- Step 1: add up all the scores in the sample. In statistical notation, this is ΣX
- Step 2: divide the total of all the scores by the number of donation
– The total number of scores in a sample is typically represented by N
– The full equation would be: M = ΣX / N
Median (Mdn)
4.1: Central Tendency
the middle score of all the scores in a sample when the scores are arranged in ascending order; AKA 50th percentile
How to determine median
4.1: Central Tendency
- Step 1: line up all the scores in ascending order
- Step 2: find the middle score.
- With an odd number of scores, there will be an actual middle score.
- With an even number of scores, there will be no actual middle score
In this case, calculate the mean of the two middle scores
4.1: Central Tendency
- The most common score of all the scores in a sample
- Doesn’t have a symbol nor abbreviation
- Mode can be used with scale data, but is more commonly used with nominal data
- EX: maps based on census data that showed how residents of England and Wales typically commute to work
When there is more than one mode, whether a single score or an interval, we report both, or all, of the most common scores
- 1 mode
- 2 modes
- > 2 modes
4.1: Central Tendency
- When a distribution of scores has one mode, we refer to it as: unimodal
- When a distribution has two modes, we call it: bimodal
- When a distribution has more than two modes, we call it: multimodal
How outliers affect measures of central tendency:
4.1: Central Tendency
Mean - greatly affected by outliers: extreme scores that are either very high or very low in comparison with the other scores
Different measures of central tendency can lead to different conclusions, but when a decision needs to be made, the choice is usually between the mean and the median
Mode is generally used in three situations:
4.1: Central Tendency
- When one particular score dominates a distribution
- When the distribution is bimodal or multimodal
- When the data are nominal
4.2: Measures of Variability
a numerical way of describing how much spread there is in a distribution
2 ways to describe/compute variance:
4.2: Measures of Variability
- Computing its range
- Compute variance and its square root, known as standard deviation
4.2: Measures of Variability
a measure of variability calculated by subtracting the lowest score (the minimum) from the highest score (the maximum).
Range equation:
4.2: Measures of Variability
range = X(highest) - X(lowest)
Range is a useful first indicator of variability, but… (downsides to range)
4.2: Measures of Variability
….is influenced only by the highest and lowest scores
* All other scores in-between could be clustered near the highest score, huddled near the center spread out evenly, or have some unexpected pattern; WE CAN’T KNOW SOLELY BASED ON THE RANGE!
Whenever there are outliers, the range will be an exaggerated measure of the variability - WHAT’S THE SOLUTION?
4.2: Measures of Variability
Interquartile range
What does interquartile range indicate?
4.2: Measures of Variability
- A measure of the distance between the first and third quartiles
- Like the median marks the 50th percentile of a data set, the first quartile marks the 25th percentile of a data set, and the third quartile marks the 75th percentile of a data set
- Essentially, the first (25th percentile) and third (75th percentile) are medians of the TWO HALVES of data: the half below the median, and the half above
WHY use interquartile range?
4.2: Measures of Variability
Because it’s based on values that come from the middle 50% of the data (between 25%-75%) of the distribution, it’s unlikely to be influenced by outliers (not considering scores < 25% or > 75%)
Steps to determining interquartile range
4.2: Measures of Variability
- 1: calculate the median
- 2: look at all of the scores BELOW the median. Then, the median of these scores, the lower half, is the first quartile, often called Q1 for short
- 3: look at all of the scores ABOVE the median. Then, the median of these scores, the upper half of the scores, is the third quartile, often called Q3 for short
- 4: subtract Q1 from Q3
- The interquartile range, often abbreviated as IQR, is the difference between the first and third quartiles
4.2: Measures of Variability
- the average of the squared deviations from the mean
- When something varies, it must vary from (or be different from) some standard - standard as in the mean
- Thus, when we compute variance, that number describes how far a distribution varies around the mean
Variance - why can’t we just take the square of each deviation from the mean?
4.2: Measures of Variability
If we do, we get 0
- Remember, the mean is the point at which all scores are perfectly balanced; mathematically, the scores have to balance out - yet we know that there is variability among these scores
- To eliminate the negative signs, SQUARING ALL THE DEVIATIONS is what statisticians do to solve this problem
- Once we square the deviations, we can take their average and get a measure of variability
- Later, we will “unsquare” those deviations to calculate the SD
4.2: Measures of Variability
1: subtract the mean from every score (X-M)
* AKA deviations from the mean
2: square every deviation from the mean
* AKA squared deviations
3: sum of all squared deviations
* AKA sum of squared deviations, or sum of squares for short
4: divide the sum of squares (the sum of each score’s squared deviation from the mean) by the total number in the sample
* EX: average squared deviation = 48.80
* Total # of scores: 5
* 48.80/5 = 9.76
* Thus, variance = 9.76
Symbols that represent the variance of a sample include:
4.2: Measures of Variability
- SD2 (standard deviation squared)
- s^2 (standard deviation squared)
- MS (comes from “mean square”, referring to average of the squared deviation)
Most basic formula for SD
SD = square root of SD^2
Full formula for SD
SD = square root of Σ (X-M)2 / N
How can we tell what’s based on a sample vs. population (equation)?
- Sample: used M not μ
First step to calculating the median
list all scores in ascending order
We can use central tendency as a clue to distribution shape: perfect shape, positive skew, negative skew
- In a symmetrical “bell shaped” curve: mean = median = mode
- Positive skew: mean > median > mode (mean gets pulled by upper tail)
- Negative skew: mean < median < mode (mean gets pulled by lower tail)
What measure (mean, median, mode) is the best to describe central tendency - 4 key points:
- Usually the mean
- Small dataset - harder to interpret
- If extreme outliers, consider calculating with/without
- If unsure, report all three - except note that mode is the only option if nominal data