unit 1 - chapter 2 - descriptive statistics Flashcards
mean, median and mode
Where’s the middle of the distribution (shows curved to left bar graph)
Mode highest point on graph
Median will be somewhere in the middle
Mean will be pulled/dragged by the outliers
Bell curve means all are the same
levels of data… measurements to use
1 - Nominal.… Mode
2 - Ordinal…Median (p50/50th percentile)
3 - Interval…Mean
4 - Ratio….Mean
nominal - mode
Mode for category or value for the graph of top billion dollar content companies?
Disney is the mode it is the top of the chart (the most)
Mean for Netflix original content hours?
Drama and Kids because they are the top of the chart (the most)
Mean for time per day on netflix
Highest point on the graph is sunday (2.10 HRS)
It is not tuesday/wednesday and friday because they add up the most and are the same (1:30 HRS)
ordinal - median
Median for movie rating and attendance?
For value it is 300 which is PG13 based off attendances and going half way up
Median for levels of pain and frequency
3.5 level of pain based off of half way up of frequency
Median = PCT(n+1)
PCT = %
N = sample
interval and ratio - mean
Add everything up and divide by N
X bar = sigma (x) / n
Mean for blended strawberry
CF is cumulative frequency
N = 52 (CF top number)
X bar = sigma (x * f)/ n
Units * frequency / CF
Check this…..
The range is
Get it from units!!
= 60-55
= 5
mode
components:
quantity:
outliers:
components: no formula
quantity: one or more
outliers: not affected
median
components:
quantity:
outliers:
components: size of dataset
quantity: only one
outliers: not affected
mean
components:
quantity:
outliers:
components: dataset size and data points
quantity: only one
outliers: affected
4 levels of data
- Nominal - variation ratio
- Ordinal - median deviation
- Interval - standard deviation
- Ratio - standard deviation
standard deviation vs variance
sd
(more risk) Sample = S
Population = Sigma (o)
variance
Sample = s^2
Population =sigma^2 (o^2)
population = parameter
statistic = sample
standard deviation
Different answers: s or o (s2 or o2)
Easier to solve by hand
Square the numerator because x - x bar = 0
Downside of s2 and o2
Problem is variance is in a magnitude greater than data
Answer is squared
standard deviation
Will not zero out
Is based on the mean
Average distance (ruler)
Same scale as original data
Quiz question: Thus the standard deviation is the..
Standard (benchmark)*
Note: SD is influence by outliers
standard deviation is used for
Used as a descriptor
Used to normalize data
In business as a measure of volatility, risk, control and outcome assignment
the variance
Historical value
Appears ana an element or aggregate variability in statistical tools such as ANOVA, SLR, and multiple regression
Dropped in bigger formulas
facebook practice problem
Can you calculate the mean for less than $40 million
What facebook said they would do:
Avg duration of video viewed = total time watched / total number of users watching video
What they actually is:
Avg duration of video viewed = total time watched / total number of users watching video 3 or more seasons
Denominator smaller avg will be bigger
Average duration metrics were inflated 150-200%
Highlight clips increased this number
Tik tok vs youtube videos on facebook
Calculate the mean for less than 40 million dollar contract for Facebook
quartiles vs percentiles
Quartiles are special percentiles. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median, M, is called both the second quartile and the 50th percentile.
The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine.
The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data.
To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths.
To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.
sample mean vs population mean
The letter used to represent the sample mean is an x with a bar over it (pronounced “x bar”): 𝑥–
The Greek letter μ (pronounced “mew”) represents the population mean. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random.
- when is the mean = median
- when is the mean > median
- when the distribution is symmetrical
- when the distribution is skewed to the right
mean not mode!
where does the average lie?
why is the standard deviation important?
- provides a numerical measure of the overall amount of variation in a data set, and
- can be used to determine whether a particular data value is close to or far from the mean.
- the standard deviation provides a measure of the overall variation in a data set
variability in samples
Observational or measurement variability
Natural variability
Induced variability
Sample variability
variability in samples - measurement variability
Measurement variability occurs when there are differences in the instruments used to measure or in the people using those instruments.
If we are gathering data on how long it takes for a ball to drop from a height by having students measure the time of the drop with a stopwatch, we may experience measurement variability if the two stopwatches used were made by different manufacturers
variability in samples - natural variability
Natural variability arises from the differences that naturally occur because members of a population differ from each other.
For example, if we have two identical corn plants and we expose both plants to the same amount of water and sunlight, they may still grow at different rates simply because they are two different corn plants.
variability in samples - induced variability
Induced variability is the counterpart to natural variability; this occurs because we have artificially induced an element of variation (that, by definition, was not present naturally):
For example, we assign people to two different groups to study memory, and we induce a variable in one group by limiting the amount of sleep they get.
variability in samples - sample variability
Sample variability occurs when multiple random samples are taken from the same population. For example, if I conduct four surveys of 50 people randomly selected from a given population, the differences in outcomes may be affected by sample variability.