Data Representation Flashcards
Representation of Data,
qualitative (categorical) variable
quantitative variable
types of quantitative variables;
type 1
type 2
Unable to take a numerical value
Can take a numerical value
Continuous quantitative variable:
Variable which can take any value in a given range
Discrete quantitative variable:
Variable has clear steps (gap) between its possible values.
Measures of central tendency,
mean;
description
formula
median;
description
process of finding
mode,
description
process of finding
advantages of the measures of central tendency;
mean
median
mode
disadvantages of the measures of central tendency;
mean
median
mode
The mean [ x̄ or E(X) ] of a data set is equal to the sum of values in the data set divided by the number of values, if the total number of values is n and we use the sigma notation ( Σ ) for the sum of all the values
x̄ = Σx / n = ( X1 + X2 + … + Xn ) / n
The median is defined to be the middle value of the data
After sorting the values numerically, if n is odd the median is the middle value else it is the sum of the 2 middle values divided by 2.
The mode is defined to be the most common, or most frequently occuring value in a data set
The mode is the value that appears the most, there may also be multiple modes if there are multiple values that occur the most, likewise there may be no mode
Mean (+):
– Uses all values
– Can be found on calculator easily
Median (+):
– Not influenced by extreme high or low values
Mode (+):
– Good for finding the most popular value
Mean (–):
– Influenced by extreme high or low values
Median (–):
– Difficult to work if there are a large number of values
Mode (–):
– May not at all be representative of a set of values
Measures of spread,
description
range;
description
formula
inter-quartile range (IQR);
description
process
standard deviation and variance; description 3 formulas process of finding from the table standard formula for variance
coded data;
mean formula
standard deviation formula
variance formula
combined sets of data;
mean
standard deviation
variance
aplications;
mean formula
standard deviation formula
variance formula
The range, inter-quartile range, standard deviation and variance tell us about the spread of a data set
Range:
The range is the difference between the largest and smallest value
maximum - minimum = range
Inter-quartile range:
The upper quartile and lower quartile together with the median, seperate a set of data values into 4 quarters
The lower-quartile is the value at which 25% or a quarter of the values lie, it can be found be finding the value in the middle of the minimum value and the median or the value corresponding with 0.25n
The upper-quartile is the value at which 75% or three quarters of the values lie, it can be found be finding the value in the middle of the maximum value and the median or the value corresponding with 0.75n
The IQR is the upper-quartile minus the lower-quartile, or the range of the quartiles.
Standard deviation [ σ of Sd(X)], Variance [σ^2 or V ar(X)]:
The standard deviation of a set of data values is a measure of their spread about the mean.
σ = √[ Σ(x - x̄)^2 ] / n σ = √( Σx^2 / n ) - ( x̄ )^2 σ = √[ Σx^2 - n( x̄ )^2 ] / n
- Find the mean
- Write down the difference of each value from the mean, (x - x̄)
- Square the difference, ( x - x̄ )^2
- Average the squares, ( x - x̄ )^2 / n
- Square the values to reverse the effect squaring
Variance = σ^2 = Σ( x - x̄ )^2 / n
Coded mean formula:
x̄ = [ Σ( x ± a ) / n ] ∓ a
Coded standard deviation formula:
σ = √{ [ Σ( x ± a )^2 ] / n} - { [ Σ( x ± a ) / n ] ^2 }
Coded variance formula:
σ^2 = { [ Σ( x ± a )^2 ] / n} - { [ Σ( x ± a ) / n ] ^2 }
Combined mean formula:
x̄ + Ȳ = Σx + Σy / Nx + Ny
Combined standard deviation formula:
σx + σy = √{[Σx^2+Σy^2] /Nx+Ny} - {[(Σx + Σy )^2 /Nx+Ny]^2}
Combined variance formula:
σ^2x + σ^2y =
{[Σx^2+Σy^2] /Nx+Ny} - {[(Σx + Σy )^2 /Nx+Ny]^2}
Applied mean formula:
x̄ = 1/a [ (Σax ± b / n) ∓ b]
Applied standard deviation formula:
σ = 1/a [ √{ [ Σ( ax ± b )^2 ] / n} - { [ Σ( ax ± b ) / n ] ^2 } ]
Applied variance formula:
σ^2 = 1/a^2 { [ Σ( ax ± b )^2 ] / n} - { [ Σ( ax ± b ) / n ] ^2 }
Stem and leaf diagram,
description
steps for construction
notes for constructing a stem and leaf diagram
back to back stem and leaf diagram description
back to back stem and leaf diagram construction steps
points to consider when comparing
A useful way of organising data as it is collected, it can show the distribution of the data because it is arranged from the smallest to largest value, making it easy to locate the quartiles and median
- Rearrange data values in order
- Place first digits of values in order on vertical stem
- Place remaining digits in order horizontally along leaf
a) give the diagram a title
b) show values of median, LQ, UQ, IQR, n, range
c) write a key for the diagram
A back to back stem and leaf diagram is useful to compare 2 different groups of data values
- Rearrange data values in order
- Place first digits of values in order on vertical stem for both of the groups,
- Place the remaining digits of the first group in order horizontally along leaf, to the right of the stem
- Place the remaining digits of the second group in order horizontally along leaf, to the left of the stem
Comment on:
- Median
- Range
- Minimum and maximum values
- IQR
Box and whisker diagram,
description
steps for construction
notes for constructing
outlier definition
A box and whisker diagram shows the location of 5 values from a distribution plotted against a scale in order;
smallest value, LQ, median, UQ, largest value
- Rearrange the data values in order
- Find the smallest value
- Find the LQ
- FInd the median
- Find the UQ
- Find the largest value
- Rule suitable scale
- Mark points
- Draw box and whisker diagram
- Mark any outliers outside the the diagram and label
a) give the diagram a title
b) show the min, max, med, LQ, UQ, IQR, n, mean, range
An outlier is defined as any value exceeding 1.5x the UQ or less than 1.5x the LQ.
Grouping data,
description
description of information collected from group data
When dealing with data that has been measured in some way like height or weight nearly all the values will be different, to show how the heights of people are distributed the most you can do in collect people with similar heights, this process is called grouping, the values are collected in a class interval and end up with data between 2 values
Grouping data from class intervals means some data will be lost, looking at the table would not allow you to know the exact values, the number of observations would be known (frequency showing the number of data groups), the value obtained would be measured to the nearest ____ and whatever that unit or increment is divided by 2 and added and subtracted from the 2 class grouping values would be the “class boundaries”
Class boundaries,
common example or rule of thumb
case exceptions (note be)
types of class boundaries; type 1 example of type 1 type 2 example of type 2 type 3 example of type 3 limitation of type 3
note for age
There is no universally accepted answer for a class boundary but the common value used is 0.5
The class boundaries, for example, for 0 - 9 would be: -0.5 ≤ x < 9.5
Gap:
given –> 160 - 164, 165 - 169, 170 - 174 …
159.5 ≤ x < 164.5 , 164.5 ≤ x < 169.5 , 169.5 ≤ x < 174.5
No gap:
given –> 55 - 60, 60 - 65, 65 - 70 …
55 ≤ x < 60 , 60 ≤ x < 65 , 65 ≤ x < 70
Open ended:
given –> 17 - 20, 20 - 23, 23 -
The problem with this frequency distribution is that the last class is open-ended, meaning you can not deduce the correct class boundaries unless you know the individual data values, a reasonable procedure when dealing with this is to take the width of the last interval to be twice that of the previous one
Age is recorded to the number of completed years so for example the class interval 17 - 20 contains those who passed the test from the day of their 17th birthday and up to but not including the day of their 20th birthday
Histogram,
when likely used
description or key factors
drawing a histogram
When a grouped frequency distribution contains continuous data one of the most common forms of graphical display is a histogram
a) The bars have no spaces between them
b) The area of each bar is proportional to the frequency
- Give the histogram a title
- Label the y-axis frequency density
- Label the x-axis with its corresponding variable and the unit
Frequency density,
proportionality between block area and frequency
frequency density/ height formula
frequency table
mean formula with frequency
two standard deviation formulas with frequency
The simplest way to make the area of a block proportional to the frequency is to make the area of the block equal to the frequency
frequency density (or height) = frequency / class width
Variable | Class boundaries | Class width | Frequency (f) | Frequency density | Midpoint (x) | xf
with Totals labelled : Σf and for Σxf
x̄ = Σxf / Σf
σ = √[ Σf(x)^2 / Σf] - (x̄)^2 σ = √[Σf( x - x̄)^2] / Σf
Cumulative frequency diagram,
description
finding the median, UQ and LQ from the diagram;
median
LQ
UQ
cumulative frequency table example
note for final value of cumulative frequency
notes for drawing a cumulative frequency diagram
The cumulative frequencies are plotted against the upper class boundaries of the corresponding class, it tells us how many values are less than the given measurement (continuous data), too fill in the cumulative frequency table, add up the numbers in the frequency column, up to and including the required position
Median:
- Calculate half the total frequency (50%)
- Locate this number on the vertical axis
- Draw a horizontal line across to the cumulative frequency graph then down to the x-axis
LQ:
- Calculate three quarters of the total frequency (75%)
- Locate this number on the vertical axis
- Draw a horizontal line across to the cumulative frequency graph then down to the x-axis
UQ:
- Calculate one quarter of the total frequency (25%)
- Locate this number on the vertical axis
- Draw a horizontal line across to the cumulative frequency graph then down to the x-axis
Variable | Frequency (f) | Variable | Cumulative Frequency (total number < x)
Variables: 47 - 54 55 - 62 63 - 66 67 - 74 75 - 80 81 - 92
Frequencies: 4 7 8 7 8 4
Variable: < 46.5 < 54.5 < 62.5 < 66.5 < 74.5 < 80.5 < 92.5
Cumulative frequency: 0 4 11 19 26 34 38
The final value of the cumulative frequency should be the same as the total number of eg. people
To then draw the graph the points are plotted (x,y) with x being the variable and y being the cumulative frequency eg. (46.5, 0) , (54.5, 4) etc.
Also ensure your graph is big enough to estimate the median, LQ and UQ
Choosing how to represent data,
Advantages of diagrams; diagram 1 diagram 2 diagram 3 diagram 4
Disadvantages of diagrams; diagram 1 diagram 2 diagram 3 diagram 4
Stem and leaf (+) - Discrete:
Contains all the original data values so the range, median and quartiles can be easily found from it, as well as calculating the mean and standard deviations
Histogram (+) - Continuous:
For large data sets you can make a frequency table and draw the histogram to show the shape of the distribution, it can also group the data into classes of any width
Cumulative frequency (+) - Continuous: It is useful for estimating the number of data values that lie below or above a given value of the variable
Box and whisker (+) - Continuous:
Gives the lowest, highest, median and quartile values directly and is also very useful when comparing several related data sets
Stem and leaf (-) - Discrete:
For large data sets it becomes difficult to draw and can look confusing because it contains so much information
Histogram (-) - Continuous:
Some of the information of the original data set is lost therefore making the values of the mean, median, quartiles and standard deviations estimates rather than exact values
Cumulative frequency (-) - Continuous: The values of the mean, median, quartiles and standard deviations estimates rather than exact values
Box and whisker (-) - Continuous:
Does not provide the mean and standard deviations from it and gives no indication of the size of the data set
Distribution and skewness,
description
type 1; shape in reference to histogram description formula in reference to box and whisker plot
type 2; shape in reference to histogram description formula in reference to box and whisker plot
type 3; shape in reference to histogram description formula in reference to box and whisker plot
Another important feature of a set of data is its shape when represented and a frequency diagram, there are 3 different shapes that commonly occur when you draw histograms or bar charts for data sets
Symmetrical (zero-skewness):
The histogram is symmetrical, bell shaped
UQ - Median ≈ Median - LQ
Median in the centre of the box and whisker plot
Positive skew (skewed positively):
Histogram is not symmetrical and has a tail stretching towards the higher values, kind of like a negative parabola that stretched horizontally when nearing the x-axis
UQ - Median > Median - LQ
Median nearer to the left of the box and whisker plot
Negative skew (skewed negatively):
Histogram is not symmetrical and has a tail stretching towards the lower values, kind of like a positive parabola that starts of with a line horizontally along the x-axis
UQ - Median < Median - LQ
Median nearer to the right of the box and whisker plot