Lecture 4-Measures of Central Tendency and Dispersion Flashcards

1
Q

Where is the dataset centred?

A

Measures of central tendency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How dispersed is the dataset about its centre?

A

Measures of central dispersion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A measure of central tendency for a dataset

A

a number that indicates the ‘centre’ of the distribution of the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Three main measures of central tendency

A

1.Mean 2.Mode 3.Median and TriMean (not main)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The Mean

A

Ideally, we should be able to summarize all the facts about a group of data in one figure
*In fact, this is frequently done by a measure that is at the core of a data set - the average.
*But what is an average?
*In all cases, what we are really aiming at is some notion of a typical value. You may be already familiar with the most popular of average measures - the arithmetic mean.
For grouped data, we compute the mean from the FREQUENCY TABLE /DISTRIBUTION
*Assume that all the datapoints that fall in a class are centred at the class mark
*For each class, find the product of the class mark and the class frequency
*The Mean is found by adding these products and dividing the sum by the total frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The Mean example

A

Let us denote the amount of runs that the batsman makes in any one innings as X. Better yet, let us define as X1 the amount he makes in the first innings, X2 what he makes in the second, X5 what he makes in the fifth and so on.
*Generically, we may let Xi stand for the amount he scores in the ith innings, where i can be made equal to 1, or 2, or 5 or n!
–For three innings, the arithmetic mean of his scores would be:
(X1 + X2 + X3) / 3
–For five innings it would be:
(X1 + X2 + X3 + X4 + X5) / 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The mean from grouped data

A

Freq f
10-14
2
15 – 19
12
20- 24
23
25-29
60
30-34
77
35-39
38
40 - 44
8
220
8
*Consider how to find a mean value from a grouped data set
*We have 220 datapoints . To find the mean, we want to add them up and divide by 220 to find the mean
*HOWEVER, individual values are lost in a frequency distribution!
*What to do?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Mean pt 2

A

Assume that all the datapoints that fall in a class are centred at the class mark
*So, we know 2 datapoints fell within the 10-14 class, but that is all
*We find the mid-point of the class 10+14 / 2 = 12
*We assume the values of those 2 datapoints to be 12
*In reality, they may not be this at all! But this is the price we pay with grouped data
So, the first two datapoints are 12
*To begin calculating a mean, we would start with 12+12+…
*In other words, we multiply the frequency of each class by the mid-point
*The Mean = 6560/220 = 29.8
*Note that, if we were able to use the actual datapoints, the value we derive for the Mean may well be different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define Tri-Mean

A

If the trimmed mean does not differ considerably from the mean, then we know that the extreme values of the data did not significantly bias the mean calculation
*If they differ, however, then we know that our data set was characterized by untypical, extreme values
*If left unrecognised, this fact could lead us to draw to some erroneous conclusions by relying on the Mean alone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Tri-mean examples

A

Scores of a student over 20 courses arranged in ascending
order as follows:
0, 89, 89, 89, 89, 89, 89, 90, 90, 90,
90, 90, 90, 90, 91, 91, 91, 91, 91, 91
*Mean of all 20 scores = 85.5 (not a typical score)
Quick inspection of the data reveals:
* Scores typically between 89 and 91
*‘0’ is not typical; it is an outlier
*If we trim off the first 5% and the last 5% of the dataset, the mean of the remaining 18 scores = 89.9
*Tri-Mean = 89.9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Other types of mean

A

Harmonic mean
*Geometric mean
*Quadratic mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define the Median

A

The median divides a data set into halves
*The Median is therefore the datapoint which lies at the centre of the dataset when arranged in ascending or descending order
*It is the point below which 50% of the data lies, and above which 50% of the data lies
*For this reason, it is also known as the 50th Percentile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Example of median example

A

Suppose we had the following data set and we wanted to find the median point – {22, 27, 8, 5 13}
*The first thing to do would be to rank the data in ascending order – {5, 8, 13, 22, 27}
*In this manner the median or middle value of 13, becomes obvious: half of the population is to the left of the median, and the other half is to the right.
*The median value is therefore that value which cuts the population into half
*In other words, the position of the median is n+1 / 2 = 3rd value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Difference between mean and the median

A

The difference in these two examples is that one data set comprised of an odd set of numbers, while the other was even
*With odd-numbered data sets, the median value is simply the middle value after the data has been ranked in ascending order (in the position of n+1/2)
*With even-numbered data sets, however, the median is found by taking the average value of the two middle values
*Therefore, depending on the data set, the median does not necessarily have to be a value of the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define Mode

A

The Mode is the most frequently occurring datapoint in a dataset
*It can be read off a Stem-and-Leaf Display. The leaf with the highest frequency points us to the Mode.
*In the case of grouped data presented in a frequency table, we can identify the modal class (the class with the highest frequency) and proceed to estimate the Mode by the class mark of that class
The Mode may not necessarily be affected by a change in one datapoint
*It may not be unique
*If it is unique, the dataset is unimodal
*Otherwise the dataset may be bi-modal, multi-modal or even possess no mode at all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

key points to remember

A

– select the Mode
*If we seek a measure in the context of an ‘average value’ – select the Mean
*If we seek a measure in the context of a ‘middle value’ – select a Median
*If Mean, Mode & Median agree, the dataset is said to be symmetric.
*Otherwise it is said to be skewed

17
Q

Symmetry vs Skewness

A

Relationship between mean, median and mode Symmetric Mean = Median = Mode
Positively skewed
Mode < Median < Mean
Negatively skewed
Mean < Median < Mode

18
Q

Skewness

A

You must be able to distinguish between a Positive Skew and a Negative Skew.
*You must be able to estimate the Degree of Skewness of a dataset by using either Pearson’s First or Second coefficients:
*(mean – mode)/standard deviation.
*3(mean – median)/standard deviation.

19
Q

Intro to Measures of Dispersion

A

Consider two hypothetical countries, country A and country B, in which only three individuals live.
*Clearly, if we were to rely solely on the mean measure as an index of well-being in both countries, we would conclude that it is fundamentally similar.
*However, it is clear that there are fundamental differences between these 2 countries

20
Q

The Range

A

Perhaps the most rudimentary measure of dispersion or variability is the range.
*It is calculated as the difference between the highest and the lowest recorded values in a data set. For country A, the range is calculated as $1,000 while for country B it is established as $4,000. The greater level of dispersion in country B tells us, among other things, that the mean value of $4,000 is less reliable as an index of well-being in country B than it is in country A
The Range = Largest Value - Smallest Value
*It is easy to compute
*It depends on only two data points
*It responds to any change(s) in these two data points .Shortcomings –Two significantly different datasets can have the same range
–Any error in one of the two extreme datapoints will bias the range

21
Q

The Inter-Quartile Range

A

The usefulness of the range as a measure of dispersion is limited by the fact that its value can be artificially affected by the presence of a few extreme outliers. *A possible solution to this dilemma would be to calculate something like a “trimmed” range, similar in spirit to the trimmed mean we met above. *A frequently employed adaptation of such a procedure is to eliminate the lowest and highest 25% of the values and to consider only the range of the remaining values. The measure so obtained is called the interquartile range (IQR).

22
Q

More on Inter-Quartile Range

A

Remember that the median is sometimes called the 50th percentile? By a similar token, we can speak of the 10th percentile, which is the mark below which 10% of the population is found and so on. *The more popular percentile measures are, however, the 25th percentile and the 75th percentile *The 25th percentile or the mark below which 25% of the data can be found, is frequently called the first quartile, while the 75% percentile is known as the third quartile.
So named because its lower and upper limits are called, respectively, the first and third quartiles
*Reflects the spread of the middle 50% of the dataset *Interquartile Range IQR = Q3 – Q1 where Q1 & Q3 are the first & third quartiles respectively. *25% of the distribution is to be found below the first quartile (Q1) and 25% above the third quartile (Q3).
35

23
Q

Quartile Deviation

A

Very often though the Semi Interquartile Range (otherwise called the quartile deviation) is used
*Quartile Deviation QD = ½ (Q3 – Q1)
*The quartile deviation should be used as the dispersion measure of choice when the median is used as the measure of central tendency, that is, when the distribution is skewed
*It is the average spread of the 2nd and 3rd quartiles. Hence we can consider using the QD to distinguish between the spread of two different datasets

24
Q

Boxplot/WhiskerPlot

A

It gives us a graphical display of the “5 Number Summary”, that is,
–The minimum value
–The maximum value
–The first quartile
–The second quartile
–The third quartile
*The plot comprises a rectangular box (whose length equals the difference between the first and third quartiles) and two tails or whiskers (one from the minimum value to the first quartile and the other from the third quartile to the maximum value)
*The second quartile is the median and that is highlighted in the plot by a bar in the box.

25
Q

Deviations from the Mean

A

The mean is by far the most popular measure of central tendency and it would seem only reasonable that, when it is used, it should be accompanied by a measure of dispersion that takes the mean explicitly into account.
*An intuitively appealing idea would be to measure the distance of each individual data point from the mean value.
*Recall our Country Example:

MEAN
A 3 4 5 4
B 1 4 7 4

Country A: Deviations from the Mean
* Country B: Deviations from the Mean
41
Income Deviations from Mean Income
3 -1
4 0
5 1

Income Deviations from Mean Income
1 -3
4 0
7 3

26
Q

More on Deviations from the Mean

A

Point by point, the deviations for country B are larger than for country A, confirming once again the greater inequality of income in country B.
*As the data set becomes larger, it becomes more difficult to make point by point comparisons, especially if there is no point by point superiority of one set over another.

27
Q

Mean Absolute Deviation (MAD)

A

The deviations from the mean are also measures of dispersion or spread. It is tempting to get a summary measure by adding them up and dividing by 3 to get an average value of this dispersion.
*For Country A: (-1) + (0) + (1) / 3 = 0
*For Country B: (-3) + (0) + (3) / 3 = 0
*The deviations sum to zero. This is in fact, a standard result of summing deviations from the mean.
A solution in this case would be to ignore the sign attached to the deviations and calculate the mean of the sum of the absolute values of the deviations from the mean.
*For Country A: (1) + (0) + (1) / 3 = 2/3
*For Country B: (3) + (0) + (3) / 3 = 9
*These figures are examples of a measure of dispersion known as the Mean Absolute Deviation (MAD).

28
Q

Lastly on Mean Absolute Deviation

A

Yet another approach is to focus on the deviations from the mean
*Given a datapoint x, its deviation from the mean is given by (Actual Value x – Mean Value of the dataset)
*Some of these deviations will be positive; others negative; some even zero
*Any attempt to find a mean deviation from the mean would always result in a zero answer
*Instead we drop the sign on each deviation thereby creating absolute deviations and proceed to find the mean of these absolute deviation; this is called the Mean Absolute Deviation (MAD)
*However, this is mathematically clumsy

29
Q

The Variance

A

The reason for taking absolute values was to lose the negative sign from the deviations from the mean (or else the deviations would sum to 0)
*However, there is another way to deal with negative values *We speak of the Variance, and its associated measure, the Standard Deviation
*Instead of the absolute value of the deviation from the mean, the variance uses squared deviations.
Devia Squared Dev Devia Squared Deviations
-1 1 -3 9
0 0 0 0
1 1 3 9

Variance, Country A = (1+0+1) / 3 = 2/3
*Variance, Country B = (9+0+9) / 3 = 3

30
Q

Facts about the Variance

A

Variance is never negative
*It eliminates the clumsy absolute deviation that we encountered in the MAD
*It attaches a greater penalty to greater deviations from the mean (i.e. The equivalent of the square of the value)
*The greater the dispersion the greater the variance
*Unfortunately it is not expressed in the same unit of measure as the datapoints

31
Q

Standard Deviation

A

Getting around the shortcoming of Variance requires that we find its square root
*The square root will have the same unit of measure as the data in the dataset.
*The square root of variance is called standard deviation.
*SD of Country A = √ (2/3) = 0.8165
*SD of Country B = √ (6) = 2.449

32
Q

Example of Squared Deviation

A

Country A’s income is distributed with a mean of $4 and a standard deviation of $0.8165
*Country B’s is distributed with a mean of $4 and a standard deviation of $2.449
*Looking at the Means alone tell us that the data sets are similar (and we know they are not)
*Looking at the Means together with, say, Standard Deviations, tell us that the data sets are in fact very different.
*The wider dispersion in country B is an indication that the mean is not as reliable, say, as a measure of economic well being as it is for country A.

33
Q

The Variance and SD with Grouped Data

A

For grouped data the variance is found by a modified approach:
–Assume again that all data points in a class are centred at the class mark
–For each class compute the square of the deviation of the class mark from the mean
–For each class find the product of the squared deviation and the class frequency
–Sum these products over all classes in the frequency table
–Divide the total by N i.e. the sum of the frequencies
–Variance = * Σfi (xi – Mean) 2]/ N
–Standard Deviation = Positive Square Root of the Variance

34
Q

Standard Deviation chart

A

Mid x Freq f xf (x-µ) f(X-µ)2
10-14 12 2 24 316.84 633.68
15 – 19 17 12 204 163.84 1966.08
20- 24 22 23 506 60.84 1399.83
25-29 27 60 1620 7.84 470.4
30-34 32 77 2464 4.84 372.68
35-39 37 38 1406 51.84 1969.92
40 - 44 42 8 336 148.84 1190.72

standard deviation =
8002.8 over = square root of 36.38 = 6.03
220

35
Q

Summary

A

Where is the dataset centred?
–This is answered by way of measures of central tendency
–Mean, Tri-Mean, Median, Mode
*How dispersed is the dataset about its centre?
–This is answered by way of measures of dispersion
–Range, IQR, QD, MAD, Variance, SD