Describing and Summarizing Data Flashcards
Histogram’s X & Y Axis
x-axis represents bins corresponding to ranges of data; its y-axis indicates the frequency of observations falling into each bin.
What is an outlier?
An outlier is a value that falls far from the rest of the data.
What does skewness measures?
Skewness measures the degree of a graph’s asymmetry.
If the right tail is longer, we say it is skewed …
to the right or “right-tailed
“central tendency”
an indication of where the “center” of the data set lies. We usually start by calculating the mean, the most common measurement of central tendency.
MEAN =
“average” of a set of numbers
=AVERAGE(number 1, [number 2], …)
MODE =
the value that occurs most frequently in a data set
=MODE.SNGL(number 1, [number 2], …)
bimodal distribution
A distribution is called bimodal if it has two clearly defined peaks (two points with very high frequency). The two peaks may have equal frequency and hence be true modes, or one peak may be a mode and the other peak may simply have a very high (but not the highest) frequency.
MEDIAN =
is the middle value of the data set. The median is the 50th percentile of the data set.
=MEDIAN(number 1, [number 2], …)
PERCENTILE =
The value beneath which a certain percentage of the data lie. For example, someone who scored in the 95th percentile of a test scored equal to or higher than 95% of all people who took that test. We can also say that person scored in the top 5%.
=PERCENTILE.INC(array, k)
array is the range of data for which we want to calculate a given percentile.
k is the percentile value. For example, if we want to know the 95th percentile, k would be 0.95.
VARIABILITY measures
How widely dispersed are the data
To gain insight into the spread of the distribution we calculate …
Variance. Variance looks at how far is min and max value away from the MEAN.
Small Standard Deviation = they are close to MEAN, Large -= they are far.
The standard deviation is equal to the square root of the variance. If the variance is 9, then the standard deviation must be 3.
To calculate the variance or standard deviation of a sample in Excel, we can use the following functions:
=VAR.S(number 1, [number 2], …)
=STDEV.S(number 1, [number 2], …)
Coefficient of Variation =
the amount of variation in two different data sets.
To compare variation in 2 Data Sets, we calculate a value called the coefficient of variation (CV).
= the standard deviation / the mean.
To visualize the relationship between two variables, we typically use
a scatter plot. One variable is plotted on the horizontal axis (x-axis), and the other is plotted on the vertical axis (y-axis).