Module 1 - Describing and Summarizing Data Flashcards
Use the descriptive statistics tool to calculate the summary statistics for the data set provided below. Enter C1 as the output range, and include a label for the data.
The Input Range is A1:A13. You must check the Labels in first row box since we included a label in cell A1 to ensure that the output table is appropriately labeled, and you must select Summary Statistics to produce the output table.
Calculate the average weight of the outfielders on the 2013 Red Sox roster.
The average weight of the outfielders is AVERAGEIF(B2:B11,“Outfielder”,C2:C11), or equivalently, AVERAGEIF(B2:B11,E2,C2:C11)=192.5 pounds.
The following data set lists the prices for thirty houses in and around Boston, Massachusetts. Create a histogram of the data using the bins provided in column D.
The Input Range is B1:B31 and the Bin Range is D1:D8. You must check the Labels in first row box since we included B1 and D1 to ensure that the histogram’s axes are appropriately labeled.
For the following scenario, determine whether it would be better to analyze cross-sectional or time series data.
We want to compare the daily sales of stores in a mall during a day-long mall-wide event.
- Cross-Sectional
- Time Series
Cross-Sectional
Since we are interested in the sales of different stores on a single day (a single point in time), we should analyze a cross-section of the stores in the mall.
Time Series
See correct answer for explanation.
How would you describe the shape of the distribution shown below of the real estate pricing data?
- Uniform
- Right-tailed
- Left-tailed
- Symmetric
Uniform
A uniform distribution has constant probability across a range of possible outcomes. Thus the bars of the histogram of a uniform distribution will have the same frequency provided the bins over the range of possible outcomes are of equal size. Since the frequencies of the bins in this graph vary, the distribution is not uniform.
Right-tailed
This graph has a tail that extends out the right side. As selling price increases, the frequency of each bin above $600,000 is much less than those below $600,000. Therefore, we infer that this distribution is skewed to the right, or right-tailed.
Left-tailed
This graph is not left-tailed. Although it has a tail, the tail extends out the right side, not the left side. Thus we cannot infer that the distribution is left-tailed.
Symmetric
This graph is not symmetric; it has a tail that extends out to one side.
Calculate the mean, median, and mode of the Boston real estate prices data using the appropriate Excel functions.
The mean is equal to the sum of all of the Boston real estate prices in the sample, divided by the number of prices in the sample. The mean can be calculated using AVERAGE(B2:B31)= $459,330. Alternatively, the mean can be calculated using =SUM(B2:B31)/COUNT(B2:B31). The mean can also be found using the descriptive statistics tool.
The median is the middle value of the sample of Boston real estate prices. The median can be calculated using MEDIAN(B2:B31)= $393,500. The median can also be found using the descriptive statistics tool.
The mode is the real estate price that appears most frequently in the sample of Boston real estate prices. The mode can be calculated using MODE.SNGL(B2:B31)= $365,000. The mode can also be found using the descriptive statistics tool.
If a dataset has multiple modes, the MODE.SNGL function (or the descriptive statistics tool) reports only the first value in the list of data that occurs most frequently. Excel’s MODE.MULT function can be used to identify more than one mode in a dataset. Note that our embedded spreadsheet does not support the MODE.MULT function.
Calculate the 25th percentile for the Boston real estate prices data.
The 25th percentile is PERCENTILE.INC(B2:B31,0.25)=$ 290,750. In this sample, there is no point that lies exactly at the 25th percentile. In this case, the 25th percentile is the point halfway in between the two points closest to having exactly one-fourth of the sample smaller than they are. This is similar to what we do when we want to find the median of a sample that has an even number of cases (we take a point half way between the two cases closest to the middle).
Below are three histograms showing the heights of several members of the Boston Red Sox. Which do you think is more effective in showing the distribution of player heights?
This histogram effectively shows the overall distribution of player heights. The size of each bin is neither too large nor too small to give an accurate representation of this sample of Boston Red Sox players’ heights.
Below are data showing the relationship between temperature in degrees Fahrenheit on a given day and sales of hot cocoa at a coffee shop. Create a scatterplot to show the relationship.
The Input Y Range is B1:B17 and the Input X Range is A1:A17. You must check the Labels in first row box since we included labels in cells A1 and B1 to ensure that the scatter plot’s axes are appropriately labeled.
The correct scatterplot appears as:
What is the correlation between temperature and hot cocoa sales?
The correlation coefficient of temperature and cups of cocoa sold is CORREL(A2:A17,B2:B17)= -0.79.
Note that the fact that the correlation is a negative number is reinforced by viewing the scatter plot of the data; there appears to be a negative trend in the data.
Suppose we wanted to compare the variability of the selling prices of Boston real estate to the variability of Boston real estate lot sizes. Calculate the coefficient of variation of the selling prices.
The coefficient of variation is STDEV.S(B2:B31)/AVERAGE(B2:B31)= 0.63
Calculate the variance of the sample of Boston real estate pricing data.
The variance is VAR.S(B2:B31)= 82,917,725,621 squared dollars.
The variance of the Boston real estate pricing data sample is approximately 82,917,725,621 squared dollars. Calculate the standard deviation of the data sample.
Recall that the standard deviation is the square root of the variance. The standard deviation is SQRT(B1)=$287,954. When provided with the detailed sample data, the standard deviation can also be computed using STDEV.S.
How many houses cost more than $400 thousand and less than or equal to $800 thousand?
- Approximately 2
- Approximately 11
- Approximately 15
- Approximately 25
Approximately 2
By convention, Excel includes in a bin’s range the number represented by the bin label. For example, the first bin (labeled $200,000) includes all houses with values less than or equal to $200,000 and the second bin (labeled $400,000) includes all houses with values greater than $200,000 but less than or equal to $400,000. The only bins with frequency 2 are the fourth bin (labeled $800,000), which indicates that approximately 2 houses cost more than $600,000 and less than or equal to $800,000, and the sixth bin (labeled $1,200,000), which indicates that approximately 2 houses cost more than $1,000,000 and less than or equal to $1,200,000). The number of houses that cost more than $400,000 and less than or equal to $800,000 is indicated by the height of the bars at bins $600,000 and $800,000. How many houses cost more than $400,000 and less than or equal to $800,000?
Approximately 11
The number of houses that cost more than $400,000 and less than or equal to $800,000 is indicated by the height of the bars at bins $600,000 and $800,000. The frequency of the bar above bin $600,000 is approximately 9 and the frequency of the bar above bin $800,000 is approximately 2. Therefore, approximately 9+2=11 houses cost more than $400,000 and less than or equal to $800,000.
Approximately 15
By convention, Excel includes in a bin’s range the number represented by the bin label. For example, the first bin (labeled $200,000) includes all houses with values less than or equal to $200,000 and the second bin (labeled $400,000) includes all houses with values greater than $200,000 but less than or equal to $400,000. Approximately 15 houses cost less than or equal to $400,000. The number of houses that cost more than $400,000 and less than or equal to $800,000 is indicated by the height of the bars at bins $600,000 and $800,000. How many houses cost more than $400,000 and less than or equal to $800,000?
Approximately 25
By convention, Excel includes in a bin’s range the number represented by the bin label. For example, the first bin (labeled $200,000) includes all houses with values less than or equal to $200,000 and the second bin (labeled $400,000) includes all houses with values greater than $200,000 but less than or equal to $400,000. Approximately 25 houses cost less than or equal to $600,000. The number of houses that cost more than $400,000 and less than or equal to $800,000 is indicated by the height of the bars at bins $600,000 and $800,000 How many houses cost more than $400,000 and less than or equal to $800,000?
The following data set contains the heights of several members of the Boston Red Sox. Create a histogram of the data using the bins provided in column C.
The Input Range is B1:B11 and the Bin Range is C1:C4. You must check the Labels in first row box since we included B1 and C1 to ensure that the histogram’s axes are appropriately labeled.
Calculate the 33rd percentile for the oil consumption data.
PERCENTILE.INC(A2:A11,0.33)=2.79.
33%, or approximately one-third, of the countries in our data set consume less than 2.79 million barrels of oil per day. 66%, or approximately two-thirds, of the countries in our data set consume more than this amount.
Below are data showing students’ grades on a statistics quiz and the number of hours they spent studying. Create a scatterplot to show the relationship.
The Input Y Range is B1:B25 and the Input X Range is A1:A25. You must check the Labels in first row box since we included labels in cell A1 and B1 to ensure that the scatter plot’s axes are appropriately labeled.
The correct scatterplot appears as:
What is the correlation between hours studying and quiz grades?
The correlation coefficient of hours studying and quiz grades is CORREL(A2:A25,B2:B25)=0.67.
Note that the fact that the correlation is a positive number is reinforced by viewing the scatterplot of the data; there appears to be a positive trend in the data.
Calculate the average weight of the infielders on the 2013 Red Sox roster.
The average weight of the infielders is AVERAGEIF(B2:B11,“Infielder”,C2:C11), or equivalently, AVERAGEIF(B2:B11,E2,C2:C11)=198.0 pounds.