section 2.2: considering categorical data Flashcards
what is a contingency table?
a table that summarizes data for two categorical variables
what is a bar plot?
common way to display a single categorical variable
what is a relative-frequency bar plot?
a bar plot where proportions instead of frequencies are shown
how are bar plots different than histograms?
Bar plots are used for displaying distributions of categorical variables, while
histograms are used for numerical variables. The x-axis in a histogram is a
number line, hence the order of the bars cannot be changed, while in a bar plot
the categories can be listed in any order
what is variance?
the standard deviation squared
what is the equation for variance?
s^2 = (sum of(x - x̄)^2)/(n-1)
what points make a larger difference in variance?
points that are far away from the mean
Why do we use the squared deviation in the calculation of variance?
To get rid of negatives so that observations equally distant from the mean are weighed equally.
To weigh larger deviations more heavily.
what is standard deviation?
the square root of the variance, and has the
same units as the data
what is the median?
the value that splits the data in half when ordered in ascending order
what is the 50th percentile?
the median
what is the 25th percentile?
the first quartile, Q1
what is the 75th percentile?
the third quartile, Q3
what is interquartile range (IQR)?
where the middle 50% of the data is
what is the equation for IQR?
IQR = Q3 - Q1
what does the box in a box plot represent?
represents the middle 50% of the data, and
the thick line in the box is the median
what is the max upper whisker reach of a box plot?
Q3 + 1.5 x IQR
what is the max lower whisker reach of a box plot?
Q1 - 1.5 x IQR
what is an outlier?
observation beyond the maximum reach of the whiskers
why is it important to look for outliers?
Identify extreme skew in the distribution.
Identify data collection and entry errors.
Provide insight into interesting features of the data.
what are the robust statistics?
median and IQR
what are the non-robust statistics?
mean, variance (standard deviation)
for skewed distributions it is often more helpful to use ___________ to describe the center and spread
median and IQR
for symmetric distributions it is often more helpful to use __________ to describe the center and spread
the mean and SD
if a distribution is symmetric, the center is defined as _______
the mean
mean ~ median
if a distribution is skewed or has extreme outliers, the center is defined as _______
the median
if a distribution is right-skewed, the mean is
greater than the median
if a distribution is left-skewed, the mean is
less than the median
what is a side by side bar plot?
Displays the same information by placing
bars next to, instead of on top of, each other
what is a standardized stacked bar plot?
a stacked bar plot where the variables are measured as a proportion compared to the whole
what is a mosaic plot?
visualization technique suitable for contingency tables that resembles a standardized stacked bar plot with the
benefit that we still see the relative group sizes of the primary variable as well.
what are the ways to measure center?
histograms, mean (average), median
what are the ways to measure shape?
modality, skewness
what are the ways to measure spread?
variance (standard deviation), IQR
If you would like to estimate the typical household income for a student, would you be more interested in the mean or median income?
the median, because the distribution is skewed