Week 2 - Summarizing Data Flashcards
Scatter plot
allows us to visualize the nature of the relationship between 2 variables.
linear relationship
when the gradient of the slope stays the same throughout
Histogram
used to understand the distribution of a single numerical variable - ie how data is spread out or arranged
mean
the center or typical value in a set of data. one of the most common measures of central tendency is the mean.
in a sample the mean is: x^-
one pitfall of mean
its very sensitive to outliers, and they can have a huge impact on the accuracy of the result
median
another measure of central tendency. median represents the middle value - it seperates the smallest 50% from the largest 50%
standard deviation
this is a measure of dispersion. it tells us how far away observations may be from the mean
interquartile range
the difference between the 75th percentile in a set of data and the 25th percentile
this is robust to outliers because it focuses only on the middle 50
% of data
Box whisker plot
the box represents the interquartile range
the whiskers extend to 1.5 times the interquartile range
shapes of distributions
one important dimension is the symmetry and skew-ness of a distribution
different skews
right skewed - the mean is larger than the median
left skewed - the mean is smaller than the median
presenting categorical data
this is harder, its typically summarized using counts or proportions of different outcomes
contingency table
this summarizes data for 2 categorical variables