Describing and Summarizing Data Flashcards
Histogram’s X & Y Axis
x-axis represents bins corresponding to ranges of data; its y-axis indicates the frequency of observations falling into each bin.
What is an outlier?
An outlier is a value that falls far from the rest of the data.
What does skewness measures?
Skewness measures the degree of a graph’s asymmetry.
If the right tail is longer, we say it is skewed …
to the right or “right-tailed
“central tendency”
an indication of where the “center” of the data set lies. We usually start by calculating the mean, the most common measurement of central tendency.
MEAN =
“average” of a set of numbers
=AVERAGE(number 1, [number 2], …)
MODE =
the value that occurs most frequently in a data set
=MODE.SNGL(number 1, [number 2], …)
bimodal distribution
A distribution is called bimodal if it has two clearly defined peaks (two points with very high frequency). The two peaks may have equal frequency and hence be true modes, or one peak may be a mode and the other peak may simply have a very high (but not the highest) frequency.
MEDIAN =
is the middle value of the data set. The median is the 50th percentile of the data set.
=MEDIAN(number 1, [number 2], …)
PERCENTILE =
The value beneath which a certain percentage of the data lie. For example, someone who scored in the 95th percentile of a test scored equal to or higher than 95% of all people who took that test. We can also say that person scored in the top 5%.
=PERCENTILE.INC(array, k)
array is the range of data for which we want to calculate a given percentile.
k is the percentile value. For example, if we want to know the 95th percentile, k would be 0.95.
VARIABILITY measures
How widely dispersed are the data
To gain insight into the spread of the distribution we calculate …
Variance. Variance looks at how far is min and max value away from the MEAN.
Small Standard Deviation = they are close to MEAN, Large -= they are far.
The standard deviation is equal to the square root of the variance. If the variance is 9, then the standard deviation must be 3.
To calculate the variance or standard deviation of a sample in Excel, we can use the following functions:
=VAR.S(number 1, [number 2], …)
=STDEV.S(number 1, [number 2], …)
Coefficient of Variation =
the amount of variation in two different data sets.
To compare variation in 2 Data Sets, we calculate a value called the coefficient of variation (CV).
= the standard deviation / the mean.
To visualize the relationship between two variables, we typically use
a scatter plot. One variable is plotted on the horizontal axis (x-axis), and the other is plotted on the vertical axis (y-axis).
Correlation coefficient measures …
the strength of a linear relationship between two variables.
The correlation coefficient tells us the strength of association and its direction.
For example, we can determine if the variables are directly or inversely correlated based on the sign on the coefficient.
Excel: Correlation Coefficient
=CORREL(array 1, array 2)
Hidden Variable is …
a variable that is correlated with each of two variables (such as ice cream and snow shovel sales) that are not fundamentally related to each other. EXAMPLE: Shovel Sales & Ice Cream SALES
The value of the correlation coefficient ranges between
-1 and +1.
A correlation coefficient near zero indicates
a weak or nonexistent linear relationship.
A correlation coefficient near zero does not mean there is no relationship between the two variables; it indicates only that any relationship that does exist is not linear.
When one of the variables is time, the relationship is known as a …
time series
Cross-sectional data provides a snapshot of
data across multiple groups at a given point in time.