2.1 Examining Numerical Data Flashcards
What does a scatterplot provide?
A scatterplot provides a case-by-case view of data for two numerical variables. Each point
represents a single case.
When are scatterplots helpful?
Scatterplots are helpful in quickly spotting associations relating variables, whether those
associations come in the form of simple trends or whether those relationships are more complex.
What does a dot plot provide?
A dot plot provides a view of a single variable.
It’s the most basic of displays. A dot plot is a one-variable scatterplot;
What is the sample mean?
The mean, often called the average, is a common way to measure the center of a distribution
of data. To compute the mean interest rate, we add up all the interest rates and divide by the number
of observations. The sample mean is often labeled ¯x.
What is a population mean?
The average of the entire population. Computed the same way as the sample mean.
However, the population mean has a special label: µ. The symbol µ is the Greek letter mu and
represents the average of all observations in the population. Sometimes a subscript, such as _x, is used
to represent which variable the population mean refers to, e.g. µ_x. Often times it is too expensive
to measure the population mean precisely, so we often estimate µ using the sample mean, ¯
What does a histogram provide?
Useful for larger data sets. Rather than showing the value of each observation, we think of the value as belonging to a bin. Observations that fall on the boundary of a bin (e.g. 10.00%) are allocated to the lower bin. These binned counts are plotted as bars into what is called a histogram.
Histograms provide a view of the data density.
What does right and left skewed mean? And what about symmetric?
When data trail off to the right and has a longer right tail, the shape is said to be right skewed. 5Other ways to describe data that are right skewed: skewed to the right, skewed to the high end, or skewed
to the positive end.
Data sets with the reverse characteristic – a long, thinner tail to the left – are said to be left skewed. We also say that such a distribution has a long left tail. Data sets that show roughly equal trailing off in both directions are called symmetric.
What does it mean, when a distribution has a “long tail”?
When data trail off in one direction
What is a mode?
A mode is represented by a prominent peak in the distribution.
s histograms that have one, two, or three prominent peaks. Such distributions
are called unimodal, bimodal, and multimodal, respectively. Any distribution with more than
2 prominent peaks is called multimodal. Notice that there was one prominent peak in the unimodal
distribution with a second less prominent peak that was not counted since it only differs from its
neighboring bins by a few observations
What are the two measures of varability?
The variance and the standard deviation.
The standard deviation roughly describes how far away the typical observation is from the mean.
We call the distance of an observation from its mean its deviation.
If we square deviations and then take an average, the result is equal to the sample variance, denoted by s^2.
We divide by n − 1, rather than dividing by n, when computing a sample’s variance; there’s some
mathematical nuance here, but the end result is that doing this makes this statistic slightly more
reliable and useful.
Notice that squaring the deviations does two things. First, it makes large values relatively
much larger, seen by comparing (−0.67)^2
, (−1.65)^2
, (14.73)^2
, and (−5.49)^2
. Second, it gets rid of
any negative signs.
The standard deviation is defined as the square root of the variance.
The variance is the average squared distance from the mean. The standard deviation is the
square root of the variance. The standard deviation is useful when considering how far the data
are distributed from the mean.
The standard deviation represents the typical deviation of observations from the mean. Usually
about 70% of the data will be within one standard deviation of the mean and about 95% will
be within two standard deviations. However, as seen in Figures 2.8 and 2.9, these percentages
are not strict rules.
the population values for variance and standard deviation have special symbols:
sigma in the second for the variance and sigma for the standard deviation.
What can you see in a box plot?
A box plot summarizes a data set using ve statistics while also plotting unusual observations.
What is the median and how to find it?
Splits the data in half. 50% of the data falling below the median and other 50% falling above the median. If there is an even number in the dataset, the median is the average of the two observations closest to
the 50th percentile.
When there are an odd number of observations, there will be exactly one observation that splits the data into two halves, and in such a case that observation is the median (no average needed).
What is the INTERQUARTILE RANGE (IQR)
The IQR is the length of the box in a box plot. It is computed as
IQR = Q3 - Q1
where Q1 and Q3 are the 25th and 75th percentiles.
It, like the standard deviation, is a measure of variability in data. The more
variable the data, the larger the standard deviation and IQR tend to be. The two boundaries of the
box are called the rst quartile (the 25th percentile, i.e. 25% of the data fall below this value) and
the third quartile (the 75th percentile)
What are whiskers?
Extending out from the box, the whiskers attempt to capture the data outside of the box.
However, their reach is never allowed to be more than 1,5 * IQR. They capture everything within
this reach.
What are outliers?
Any observation lying beyond the whiskers is labeled with a dot. The purpose of labeling these
points { instead of extending the whiskers to the minimum and maximum observed values { is to help
identify any observations that appear to be unusually distant from the rest of the data. Unusually
distant observations are called outliers.