Data Analysis Flashcards
Quantitative or numerical variables
Result is a number (age, height, etc.)
Categorical or nonnumerical variables
Result is something other than a number (eye color, person voted for, etc.)
Frequency or count
Number of times a variable appears in the data
Relative frequency
Frequency of the variable appearing divided by the total number of data (appears as fractions, decimals, or percents)
Histograms (4 things)
Show interval data (often in percentage of relative frequency) and there are NO gaps between bars like in bar graphs. A gap indicates no data for that interval. Useful for identifying the shape or spread of data.
Measures of central tendency
Goal: find the “center” of the data. Mean, median, and mode.
Weighted mean
divide only the numbers that are DIFFERENT (not the frequencies for each one) multiplied by the frequencies
Ex: 2, 4, 5, 5, 6, 6, 6, 7, 9
(2) + (4) + 2(5) + 3(6) + (7) + (9) / 6 = 8.333
Which measure of central tendency is least affected by outliers?
The median
Measures of position (6)
Least, greatest, median, quartiles, percentiles (99 to divide into 100 groups)
How to calculate the 1st and 3rd quartiles
The median of the lower half of the data from the median as a whole, and the median of the upper half of the data (in an ordered list!)
Measures of dispersion (3)
indicate the degree of spread of the data
range, interquartile range, standard deviation
Interquartile range
difference between 3rd quartile and 1st quartile (measures the spread of the middle half of the data; less susceptible to outliers)
How to find the standard deviation (5 steps)
- find the mean
- find the difference between the mean and each value
- square each difference
- find the average of the squared differences
- take the square root of the average
The mean is X SD away from the mean.
The mean is 0 SD from the mean
Most data fall within X SD of the mean
3 SD
How many elements does set S have?
{1, 2, 3, 2,}
3
How many elements does set S have?
{3, 1, 2}
3
T/F: {1, 2, 3, 2} {3, 1, 2} are the same set
True
In a set, repetitions…
and order…
repetitions are not counted
order does not matter
In a list, repetitions…
and order…
repetitions are counted
order does matter
T/F 1, 2, 3, 2 and 1, 2, 2, 3 are the same list
False
A U C =
The union (overlap) between sets A and C
A ^ C =
Sets A and C are mutually exclusive
Inclusion-exclusion principle
The numbers of elements in a union of two sets equals the sum of their individual numbers of elements minus the elements in their intersection
(Think chemistry, algebra, physics problem)
Multiplication principle
Two choices, made sequentially, second choice is independent of the first, k(m) = number of possibilities
Ex: 5 entrées, 3 desserts = 15 different meal combinations
0!
1
permutations of n objects taken k at a time (select and order k objects from a group of n objects)
n! / (n-k)!
Permutations vs. combinations
Permutations–order DOES matter (can’t repeat or put back, etc.)
Combinations–order does NOT matter
combinations of n objects taken k at a time (n choose k)
n! / k! (n-k)!
Probability formula
the number of outcomes in the event (possible that fit parameters) / total possible outcomes
Ex: probability of rolling an even number on a die: 3 (2, 4, 6) / 6 = 1/2
If event E is certain to occur, then P is…
1
If event E is certain NOT to occur, then P is…
0
If an event is possible but not certain, than P is…
between 0 and 1
The probability that an event will NOT occur is equal to…
1 - probability that it will occur (E/TP)
P(E or F) =
in general
P(E) + P(F) - P(E and F)
P(E or F) =
are mutually exclusive
P(E) + P(F) if E and F are mutually exclusive
P(E and F) =
P(E) P(F) if E and F are independent
What is the link between data distributions and probability distributions?
For a random variable that represents a randomly chosen value from a distribution of data, the probability distribution of the random variable is the same as the relative frequency distribution of the data.
4 properties of a bell curve
- mean, median, and mode are all nearly equal
- data are grouped fairly symmetrically around the mean
- 2/3 of data are within 1 SD of mean
- almost all of the data are within 2 SD of the mean