Module 2 - Section 2 Flashcards
What graphs are best for smaller data sets of numerical variables?
Stem plots and dot plots
What graphs are best for large data sets of quantitative data?
histograms
appearance of a dot plot?
y-axis: frequency
x-axis: name of variable and the values that the data will fall between
. .
. . . . . . .
. . . . . . . . .
values
-dot above where that data point is
-more dots above a point to indicate a frequency more than one
-(i don’t know look at notes if you are confused)
stem
the leading digits of the number in the data
ex: 75 has leading digit or stem 7
100 could have leading digits 100 or 1 (depending on the data)
leaf
the last digit of the number in the data
ex: 75 has leaf 5
a key is required for …
a stemplot
bins
equal-width interval for multiple different numbers of data that are close in values
ex: 70-79 is one bin if 7 is the stem 0-9 are the leaves
appearance of stemplot
stem | leaves 4 |0 5 | 6 |05588 7 |00000455 8 |5 9 |05
Price of Walking shoes
8|5 represents $85
back-to-back stem plots
-used for the comparison of the distribution of two groups
leaves | stem | leaves
-still require key
-leaves get bigger as you move away from stem! pay attention to left side group
left inclusion
-interval notation as [a,b)
so a on the left is included but not b
-used for histograms along the x-axis to organize bins
histogram appearance
- bins on x-axis
- frequency or relative frequency on y-axis
- bars with no spaces between (unless there is an empty bin)
For dot plots, stem plots, and histograms, which does/does not retain all data values
dot and stem plots retain all data values but not histograms
how can we describe the distribution of a plot?
shapes - modes, symmetry or skewness, deviation or outliers
center
spread
mode(s)
number of bumps / humps / peaks
uniform
no modes, square / rectangle appearance
unimodal
a single peak
bimodal
two peaks
ex:heights of adults and children will have two peaks one for adults and one for children
multimodal
rarely occurs (except for covid?) more than two peaks
symmetry
when a graph is symmetrical
if you didn’t get this…I am ashamed lol
non symmetric graphs are
skewed
skewed to the right
positively skewed
peaks quickly and then slowly trickles down to the right
as if the tail end of the peak on the right has been pulled to the right
negatively skewed
skewed to the left
the left tail is extended and longer than the right tail ( if peak is essentially symmetric)
……^. .
Outlier
a deviation that does not follow the overall pattern of the graph
numerical summaries
a few important and meaningful numbers that preserves the relevant features of the data set so that you can draw useful conclusions
y
variable of interest
the variable for which we have sample data
n
the sample size / number of observations of the variable y
y₁
the first sample observation of the variable y
yn
the nth sample observation of the variable y
center and examples
the value that split the data in half or a typical range of values at the center of the graph
median, mean, mode
spread and examples
how much do the data values vary around the center?
the range of values, concentration, are most values close to or far from the center?
range, standard deviation, IQR
n
Σyᵢ
i=1
What is this? describe all elements.
n is the upper boundary
i is the lower boundary
where the set runs from the ith to the nth piece of data
Σ is sigma or summation
This describes adding all of the values of y
used to find the mean
ȳ
y bar is the mean mean is n Σyᵢ i=1 -------- n aka the sum of all the values in a data set divided by the number of observations
M
median
the value that divides the ordered sample into two sets
for n is odd, it is the middle value
for n is even, it is the mean of the two middle values
mean vs median
mean is affected by outliers, while the median is resistant to outliers or skewness
mode
the value that occurs with the highest frequency in a data set
may be more than one mode
center values of symmetric, right skewed and left skewed data sets
symmetric: mean=median=mode
right skewed: mean>median>mode
left skewed: mean
range
describes spread
the difference between the maximum and minimum values in a data set
Range = max - min
strongly influenced by outliers
larger range means
larger variability (usually) however sometimes outliers overestimate this
deviation
yᵢ - ȳ
The deviation of an observation from the mean
positive vs negative deviation
positive means it is above the mean
negative means it is below the mean
the set of all deviations
- all add to 0
- describes the variability
- can square every deviation before summing them all up to make the deviations more useful as a number for calculations
variance
s² = (Σ (yᵢ-ȳ)²) / (n-1)
where Σ has lower boundary i-1 and upper boundary n
why is variance problematic?
It is measured in squared units which is not very interpretable on its own
standard deviation
s
square root of the variance
most common measure of variability
tells us how closely data is clustered around the mean
measured in the same units as the original data
when would s=0
when all observations have the same value
what happens if s > 0
the standard deviation s increases as observations become more spread out / has greater variability
when can/should we use standard deviation? why?
we should only use standard deviation and mean together
neither of them are resistant to outliers, thus neither should be used if outliers are present and affecting them to be inaccurate
IQR
interquartile range
measure of variability
resistant to outliers, ∴ goes with median
divides the data into 4 equal sections ( quartiles
percentile
the pth percentile is the value so that p% of the measurements fall below the pth percentile and (100-p)% are above it
what is the median in percentile?
50%
can 215 be p?
no, percentiles are always between 0-100
Q₁
the lower quartile is the 25th percentile (separates 25% and 75% of measurements)
median between measurements that fall below the overall median
Q₃
the upper quartile is the 75th percentile ( separates the top 25% from the bottom 75%)
median between measurements that fall above the overall median
what is between Q₁ and Q₃?
the middle 50% of measurements that fall between Q₁ and Q₃
IQR calculation
Q₃-Q₁
if IQR is small
data is clustered around the center
if IQR is large
data is scattered far from the center
how do we choose a numerical summary?
- draw a graph
- use mean and standard deviation for reasonably symmetric data
- use median and IQR for skewed data
- If there are multiple modes try to understand why and consider splitting data into two groups
- If using mean and standard deviation with outliers, report them with outliers present and removed
five-number-summary
minimum, Q₁, median, Q₃, maximum
boxplot
visual representation of data using the 5 number summary
shows the center, spread, symmetry/skewness at the same time
useful for comparing groups
fences
upper fence = Q₃ + (1.5 x IQR)
lower fence = Q₁ - (1.5 x IQR)
measurements outside the fences are considered outliers
whiskers
line drawn at the end of the box plot where the highest or lowest value is that is within the fences (not an outlier)
far outliers
outliers that are farther than 3 IQRs from the quartiles
appearance of boxplots
x___|——-|̲̅ ̅ ̲̲̲̅̅ ̲̅ ̲̅ ̲̅ ̲̅|̲̅ ̲̅ ̲̲̅̅ ̲̅|——–|
symbols for outliers, whiskers, box for the IQR and a line in the box for the median
box plots that are skewed
symmetrical
skewed right: median to the left of center and a long right whisker
skewed left: median to the right of center and a long left whisker
comparative box plots
draw two box plots in one graph to compare the data in two different categories
time plot
used when interested in how the data behaves over time