6) Statistics Flashcards
Average (arithmetic mean), median, range, mode, standard deviation
What do you need to be able to calculate average (arithmetic mean) for a set of numbers?
NON EVENLY SPACED SETS:
> (1) Total fixed # of terms in the set (N)
> (2) SUM of the terms in the set
(don’t need individual terms)
EVENLY SPACED SETS:
> First term and last term
(short cut because you can deduce what the SUM and # of terms are based on the end points if you know the increment; Even if you don’t know the increment you just need to know the endpoints)
Maximization / minimization average problems
To MAXIMIZE the possible value of one variable, we need to MINIMIZE the other variables
Similarly, to MINIMIZE the possible value of one variable, we need to MAXIMIZE the other variables
You must maximize or minimize in the context of CONSTRAINTS set by the question (e.g., “if each remaining child own at least 4 goldfish … minimum number of goldfish becomes 4)
Evenly spaced sets: What are they?
How do you find the # of terms in an evenly spaced set?
Evenly spaced sets are sets in which EACH PAIR of consecutive numbers in the set has the SAME DIFFERENCE (+ or -)
of terms in an evenly spaced set (inclusive of both end points) = (Last term - First term)/(increment) + 1
> make sure “last” and “first” term are INCLUSIVE numbers of the set (and ADJUST the endpoints before using the formula)
e.g., if finding the number of multiples of 3 between 1 and 100 inclusive, we need to adjust the end points to find the HIGHEST # DIVISIBLE BY 3 and the LOWEST # DIVISIBLE BY 3 [3, 99]
Even if the end points in the set you are trying to find are NOT inclusive, you can always adjust them so that the above formula works
Evenly spaced sets: How do you calculate the average (arithmetic mean)?
Average = (First term + Last term)/2 = Median
> just need to arrange the numbers in a way that creates an evenly spaced set
> for a set with an ODD number of terms, the average will be equal to a TERM
> for a set with an EVEN number of terms, the average will be the average of the two middle terms
Why does this formula work?
> because it’s the same thing as RE-ORDERING the terms in a way that equals the same number, multiple times
e.g., 5, 22, 39, 56, 73, 90, and 107
> common difference is 17
> 107 + 5 = 112
> 22 + 90 = 112
> 39 + 73 = 112
> middle term is 56
> average = 112*3 + 56 / 7
= 392/7 = 56 = median = (107+5)/2
Evenly spaced sets: How do you calculate the sum of terms?
Shortcut using averages: Sum = average * N
For evenly spaced sets, we know short cuts for Average and N
Sum = (first + last)/2 * (last - first)/increment + 1
LAST RESORT: Add up individual terms
How do you count the # of multiples of EITHER integer A or integer B in a set of consecutive integers?
e.g., determine the # of multiples of 3 OR 4 from 1 to 90, inclusive
= # of multiples of A + # of multiples of B - # multiples of LCM(A,B)
> need to REMOVE duplicated numbers to avoid double counting (aka multiples of A AND B = multiples of LCM of A and B)
LCM!!!!
How do you count the # of multiples of EITHER integer A or integer B, BUT NOT BOTH of those integers, in a set of consecutive integers?
e.g., determine the # of integers from 10 to 100, inclusive that are multiples of 2 OR 3, but not of both?
> the tricky part of this type of Q is that we want to remove ALL instances of multiples of LCM(A,B), not just duplicates
= # of multiples of A + # of multiples of B - 2*(# of multiples of LCM(A,B))
Weighted average
WA = Sum of weighted terms / frequency
= [(data point 1 * frequency 1) + (data point 2 * frequency 2) …] / total frequency of data points
= [(data point 1 * % frequency) + (data point 2 * % frequency)…], where total frequency sums to 100%
> same formula as Simple Average, except terms are not necessarily equally occurring
Tip:
> create a table with “N” and “data point”
> the frequency in the DENOMINATOR does not have to be count (N) of terms … it is whatever we want to WEIGHT THE AVERAGE BY —-> pay attention to what the question is asking for (e.g., miles per gallon = total miles / total gallons)
e.g., weight the average by TIME, weight the average by DISTANCE travelled, weight the average of # of items
> if you are given two end points and the weights and total are all unknown, always re-express the total as its components
e.g., 0.2x + 0.15y = 0.18z and we know x+y = z … replace z with x+y
0.2x + 0.15y = 0.18(x + y) —> allows you to identify relationship/ratio between the two components
Boundaries of simple versus weighted average of two different data points
The BOUNDARIES for average must be set based on the data points provided
(cannot be outside these boundaries!)
e.g., if 15% of sophomores use laptops and 10% of freshman use laptops, then the % of the total group must be between 10 and 15
The SIMPLE AVERAGE of two different data points will be the MEDIAN (middle value)
** The WA of two different data points will be CLOSER to the data point with the GREATER # of observations or greater weighted percentage
Implication:
> Therefore, you KNOW which data point has the HIGHER FREQUENCY OF OBSERVATIONS (or greater weighted percentage)
e.g., Tickets to play cost $10 for children and $25 for adults. If the average revenue per ticket was $18.25, which is greater than the simple average of $17.50, then there must be MORE adult tickets sold than children tickets sold
Also note:
> when there are only two data points, the sum of the weighted percentages = 100%
> If you think about a teeter-totter with endpoints set by the “boundaries” A and B…
(1) the distance from A to the WA = % B
(2) the distance from WA to B = % A
(we KNOW the ratio of the QUANTITY of the two data points)
How do you calculate weighted average if the percentages (that represent frequency) do not add up to 100%?
e.g., In a department store, 12% of the customers spend exactly $10 each, 18% spend exactly $20 each, and the rest spend more than $20 each. What is the average amount spent per person for all those who spend $20 or less?
Divide the sum BY the percentages we DO have
> can think of have 100 items and then asked about a subset of them
e.g., (1210 + 1820)/(12 + 18)
Percentages and ratios
If you KNOW the percentage of two data points, you KNOW the ratio of the data points
Similarly, if you KNOW the ratio, you KNOW the percentage of the data points
e.g., Data point A has frequency of 20% while data point B has a frequency of 80%
Therefore, A = 20% = 1/5
B = 80% = 4/5
Ratio of A:B = 1:4
A/B = 1/4 = 20%/80%
You can ALSO re-express relationship as A = B/4 and replace A
> For WA calculations, you will notice that “B” cancels out in the numerator and denominator, leaving you with a numerical answer for weighted average!
Using ratios and fractions when solving weighted averages
Given the value of two data points AND the ratio or fraction of the quantity of two data points, we can calculate the weighted average of the two data points
Similarly, given the value of two data points AND the weighted average of the two data points, we can calculate the RATIO or fraction of the quantity of the two data points
(you just need one ratio to know the other ratio since there are only two data points)
Why this works?
> linear equation with one unknown
e.g., A = 1000, B = 2000, and ratio of quantity of A to B is 1/2 …
A/B = 1/2 —-> A = B/2
WA = (1000A + 2000B)/(A+B) —> replace A with B/2
= (1000B/2 + 2000B)/(B/2 + B) —-> B’s cancel out
= (500 + 2000)/(3/2)
= 5000/3
= 1666.67
** Key characteristic for this problem type:
> two data points only,
> known boundaries
> AND either WA or ratio of quantities
(but if you know WA AND ratio of quantities, and ONE boundary, then you can solve for the remaining boundary’s data point)
What does median mean?
How do you calculate the POSITION of the median value?
Median is the value that is the MIDDLE of the ARRANGED set (numerical order from lowest to highest)
> 50% of the data points fall BELOW the median and 50% of the data points fall ABOVE the median (NOT including the median value itself)
To find the median, we need to find its POSITION in the set:
(A) Sets with an ODD number of terms (N): Round UP N/2 to the nearest whole numbers
> e.g., 7/2 = 3.5 –> median is located at position 4
Alternatively, median is at the (N+1)/2 position
(B) Sets with an EVEN number of terms (N): ADD 0.5 to N/2
e.g., 6/2 = 3 –> median is located at position 3.5 (average of values at position 3 and 4)
Alternatively, median averages the values at the N/2 and (N+2)/2 positions
What is the bare minimum you need to be able to calculate the Median of a set?
For sets with an ODD number of terms –> need to know the value of the MIDDLE term
For sets with an EVEN number of terms –> need to know the values of the PAIR OF MIDDLE terms
Therefore, you do NOT need to know all the values of a set (just the middle ones)!
For DS Qs, it is helpful to test different positions of unknown terms to see if that changes the value of the median
Mode - can you have more than one? Can you have none?
e.g., what is the mode of the set {1,2,3,1,2,3}
You can have zero modes, one mode, or more than one mode
Mode = the number that appears most frequently in a data set (need to track FREQUENCY of appearance of each data point)
> if each number appears the SAME NUMBER OF TIMES (1x, 2x, 3x etc.), then the set has NO MODE
e.g., set {1,2,3,1,2,3} has each number occurring 2x, so there is NO MODE
Helpful tip is to create a table with frequency (N) and data point