Module 2: Summarizing Data Flashcards
scatterplots
-useful for visualizing the relationship between 2 numerical variables
-each point is a single case
dot plots
useful for visualizing one numerical variable
what is one way to measure the centre of a distribution of data?
the mean (average)
sample statistic
point estimate of the population mean
histograms
-view of data density
-convenient for describing the shape of the data distribution
what do higher bars on histograms represent?
where the data are relatively more common
4 types of modality
unimodal, bimodal, multimodal, and uniform
3 types of skewness
right skewed, left skewed, or symmetric
2 measures of variability
variance and standard deviation
deviation
distance of an observation from the mean
variance
-average squared deviation from the mean
-tells you the amount of spread in the data
standard deviation
-square root of the variance and has the same units as the data
-useful for considering how far data are distributed around the mean
box plot
summarizes a data set using 5 statistics while also plotting unusual observations
5 statistics (plus 1 optional one) used for box plots
upper whisker, Q3, median, Q1, lower whisker, mean (optional)
median
value that splits the data in half when ordered in ascending order
what is the median when there are an even number of observations?
the average of the 2 values in the middle
25th percentile =
first quartile Q1
50th percentile=
median
75th percentile=
third quartile Q3
outlier
-observation beyond the max reach of the whiskers
-appears extreme relative to the rest of the data
interquartile range, IQR
range between Q3 and Q1
whiskers
-capture data outside of the IQR box
T or F: median and IQR are more robust to skewness and outliers than mean and SD
true
is distribution is skewed or has extreme outliers, centre is often defined as _________
the median
if distribution is symmetric, centre is often defined as _______
the mean
contingency table
summarizes data for 2 categorical variables
name 2 plots that combine numerical and categorial data to compare numerical data across groups
side-by-side plots and multiple histograms
bar plot
displays a single categorical variable
relative frequency bar plot
bar plot where there are proportions instead of frequencies
stacked bar plot
graphical display of contingency table info for counts
side-by-side bar plot
same info as stacked bar plot but has info beside each other instead of on top
frequency
shows the count in each category
*difficult to interpret if groups have unequal numbers
row proportion
shows the proportion of the row total
*easier to compare between rows
column proportion
useful to show proportion of explanatory variable