Ch4: Central Tendency and Variables Flashcards
Central tendency
4.1: Central Tendency
- …the descriptive statistic that best represents the center of a data set, the particular value that all the other data seem to be gathering around; the “typical” score
- Usually, at (or near) the highest point in the histogram or the polygon
- Expressed in three different ways: mean, median, mode
The mean expressed by symbolic notation
Mean
4.1: Central Tendency
We need to understand only a handful of symbols to express the ideas necessary to understanding stats
Several symbols can represent the mean:
* M: on the left side of the formula
* X: a single score
* μ:
* Σ: sum of single scores (X)
* N: total number of scores
The numbers based on samples taken from a population are called…
Mean
4.1: Central Tendency
- Statistics
- E.g., M is a statistic
The numbers based on whole populations are called
Mean
4.1: Central Tendency
- Parameters
- E.g., μ is a parameter
Steps to calculating the mean:
4.1: Central Tendency
- Step 1: add up all the scores in the sample. In statistical notation, this is ΣX
- Step 2: divide the total of all the scores by the number of donation
– The total number of scores in a sample is typically represented by N
– The full equation would be: M = ΣX / N
Median (Mdn)
4.1: Central Tendency
the middle score of all the scores in a sample when the scores are arranged in ascending order; AKA 50th percentile
How to determine median
4.1: Central Tendency
- Step 1: line up all the scores in ascending order
- Step 2: find the middle score.
- With an odd number of scores, there will be an actual middle score.
- With an even number of scores, there will be no actual middle score
In this case, calculate the mean of the two middle scores
Mode
4.1: Central Tendency
- The most common score of all the scores in a sample
- Doesn’t have a symbol nor abbreviation
- Mode can be used with scale data, but is more commonly used with nominal data
- EX: maps based on census data that showed how residents of England and Wales typically commute to work
WHAT DO WE CALL EACH CASE?
When there is more than one mode, whether a single score or an interval, we report both, or all, of the most common scores
- 1 mode
- 2 modes
- > 2 modes
4.1: Central Tendency
- When a distribution of scores has one mode, we refer to it as: unimodal
- When a distribution has two modes, we call it: bimodal
- When a distribution has more than two modes, we call it: multimodal
How outliers affect measures of central tendency:
4.1: Central Tendency
Mean - greatly affected by outliers: extreme scores that are either very high or very low in comparison with the other scores
Different measures of central tendency can lead to different conclusions, but when a decision needs to be made, the choice is usually between the mean and the median
Mode is generally used in three situations:
4.1: Central Tendency
- When one particular score dominates a distribution
- When the distribution is bimodal or multimodal
- When the data are nominal
Variability:
4.2: Measures of Variability
a numerical way of describing how much spread there is in a distribution
2 ways to describe/compute variance:
4.2: Measures of Variability
- Computing its range
- Compute variance and its square root, known as standard deviation
Range
4.2: Measures of Variability
a measure of variability calculated by subtracting the lowest score (the minimum) from the highest score (the maximum).
Range equation:
4.2: Measures of Variability
range = X(highest) - X(lowest)
Range is a useful first indicator of variability, but… (downsides to range)
4.2: Measures of Variability
….is influenced only by the highest and lowest scores
* All other scores in-between could be clustered near the highest score, huddled near the center spread out evenly, or have some unexpected pattern; WE CAN’T KNOW SOLELY BASED ON THE RANGE!
Whenever there are outliers, the range will be an exaggerated measure of the variability - WHAT’S THE SOLUTION?
4.2: Measures of Variability
Interquartile range
What does interquartile range indicate?
4.2: Measures of Variability
- A measure of the distance between the first and third quartiles
- Like the median marks the 50th percentile of a data set, the first quartile marks the 25th percentile of a data set, and the third quartile marks the 75th percentile of a data set
- Essentially, the first (25th percentile) and third (75th percentile) are medians of the TWO HALVES of data: the half below the median, and the half above
WHY use interquartile range?
4.2: Measures of Variability
Because it’s based on values that come from the middle 50% of the data (between 25%-75%) of the distribution, it’s unlikely to be influenced by outliers (not considering scores < 25% or > 75%)
Steps to determining interquartile range
4.2: Measures of Variability
- 1: calculate the median
- 2: look at all of the scores BELOW the median. Then, the median of these scores, the lower half, is the first quartile, often called Q1 for short
- 3: look at all of the scores ABOVE the median. Then, the median of these scores, the upper half of the scores, is the third quartile, often called Q3 for short
- 4: subtract Q1 from Q3
- The interquartile range, often abbreviated as IQR, is the difference between the first and third quartiles
Variance
4.2: Measures of Variability
- the average of the squared deviations from the mean
- When something varies, it must vary from (or be different from) some standard - standard as in the mean
- Thus, when we compute variance, that number describes how far a distribution varies around the mean
Variance - why can’t we just take the square of each deviation from the mean?
4.2: Measures of Variability
If we do, we get 0
- Remember, the mean is the point at which all scores are perfectly balanced; mathematically, the scores have to balance out - yet we know that there is variability among these scores
- To eliminate the negative signs, SQUARING ALL THE DEVIATIONS is what statisticians do to solve this problem
- Once we square the deviations, we can take their average and get a measure of variability
- Later, we will “unsquare” those deviations to calculate the SD
4 STEPS TO CALCULATE VARIANCE:
4.2: Measures of Variability
1: subtract the mean from every score (X-M)
* AKA deviations from the mean
2: square every deviation from the mean
* AKA squared deviations
3: sum of all squared deviations
* AKA sum of squared deviations, or sum of squares for short
4: divide the sum of squares (the sum of each score’s squared deviation from the mean) by the total number in the sample
* EX: average squared deviation = 48.80
* Total # of scores: 5
* 48.80/5 = 9.76
* Thus, variance = 9.76
Symbols that represent the variance of a sample include:
4.2: Measures of Variability
- SD2 (standard deviation squared)
- s^2 (standard deviation squared)
- MS (comes from “mean square”, referring to average of the squared deviation)
Most basic formula for SD
SD = square root of SD^2
Full formula for SD
SD = square root of Σ (X-M)2 / N
LECTURE
How can we tell what’s based on a sample vs. population (equation)?
- Sample: used M not μ
First step to calculating the median
list all scores in ascending order
We can use central tendency as a clue to distribution shape: perfect shape, positive skew, negative skew
- In a symmetrical “bell shaped” curve: mean = median = mode
- Positive skew: mean > median > mode (mean gets pulled by upper tail)
- Negative skew: mean < median < mode (mean gets pulled by lower tail)
What measure (mean, median, mode) is the best to describe central tendency - 4 key points:
- Usually the mean
- Small dataset - harder to interpret
- If extreme outliers, consider calculating with/without
- If unsure, report all three - except note that mode is the only option if nominal data