Module 1 - Describing and Summarizing Data Flashcards

1
Q

Use the descriptive statistics tool to calculate the summary statistics for the data set provided below. Enter C1 as the output range, and include a label for the data.

A

The Input Range is A1:A13. You must check the Labels in first row box since we included a label in cell A1 to ensure that the output table is appropriately labeled, and you must select Summary Statistics to produce the output table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Calculate the average weight of the outfielders on the 2013 Red Sox roster.

A

The average weight of the outfielders is AVERAGEIF(B2:B11,“Outfielder”,C2:C11), or equivalently, AVERAGEIF(B2:B11,E2,C2:C11)=192.5 pounds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The following data set lists the prices for thirty houses in and around Boston, Massachusetts. Create a histogram of the data using the bins provided in column D.

A

The Input Range is B1:B31 and the Bin Range is D1:D8. You must check the Labels in first row box since we included B1 and D1 to ensure that the histogram’s axes are appropriately labeled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

For the following scenario, determine whether it would be better to analyze cross-sectional or time series data.

We want to compare the daily sales of stores in a mall during a day-long mall-wide event.

  • Cross-Sectional
  • Time Series
A

Cross-Sectional

Since we are interested in the sales of different stores on a single day (a single point in time), we should analyze a cross-section of the stores in the mall.

Time Series

See correct answer for explanation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How would you describe the shape of the distribution shown below of the real estate pricing data?

  • Uniform
  • Right-tailed
  • Left-tailed
  • Symmetric
A

Uniform

A uniform distribution has constant probability across a range of possible outcomes. Thus the bars of the histogram of a uniform distribution will have the same frequency provided the bins over the range of possible outcomes are of equal size. Since the frequencies of the bins in this graph vary, the distribution is not uniform.

Right-tailed

This graph has a tail that extends out the right side. As selling price increases, the frequency of each bin above $600,000 is much less than those below $600,000. Therefore, we infer that this distribution is skewed to the right, or right-tailed.

Left-tailed

This graph is not left-tailed. Although it has a tail, the tail extends out the right side, not the left side. Thus we cannot infer that the distribution is left-tailed.

Symmetric

This graph is not symmetric; it has a tail that extends out to one side.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Calculate the mean, median, and mode of the Boston real estate prices data using the appropriate Excel functions.

A

The mean is equal to the sum of all of the Boston real estate prices in the sample, divided by the number of prices in the sample. The mean can be calculated using AVERAGE(B2:B31)= $459,330. Alternatively, the mean can be calculated using =SUM(B2:B31)/COUNT(B2:B31). The mean can also be found using the descriptive statistics tool.

The median is the middle value of the sample of Boston real estate prices. The median can be calculated using MEDIAN(B2:B31)= $393,500. The median can also be found using the descriptive statistics tool.

The mode is the real estate price that appears most frequently in the sample of Boston real estate prices. The mode can be calculated using MODE.SNGL(B2:B31)= $365,000. The mode can also be found using the descriptive statistics tool.

If a dataset has multiple modes, the MODE.SNGL function (or the descriptive statistics tool) reports only the first value in the list of data that occurs most frequently. Excel’s MODE.MULT function can be used to identify more than one mode in a dataset. Note that our embedded spreadsheet does not support the MODE.MULT function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Calculate the 25th percentile for the Boston real estate prices data.

A

The 25th percentile is PERCENTILE.INC(B2:B31,0.25)=$ 290,750. In this sample, there is no point that lies exactly at the 25th percentile. In this case, the 25th percentile is the point halfway in between the two points closest to having exactly one-fourth of the sample smaller than they are. This is similar to what we do when we want to find the median of a sample that has an even number of cases (we take a point half way between the two cases closest to the middle).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Below are three histograms showing the heights of several members of the Boston Red Sox. Which do you think is more effective in showing the distribution of player heights?

A

This histogram effectively shows the overall distribution of player heights. The size of each bin is neither too large nor too small to give an accurate representation of this sample of Boston Red Sox players’ heights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Below are data showing the relationship between temperature in degrees Fahrenheit on a given day and sales of hot cocoa at a coffee shop. Create a scatterplot to show the relationship.

A

The Input Y Range is B1:B17 and the Input X Range is A1:A17. You must check the Labels in first row box since we included labels in cells A1 and B1 to ensure that the scatter plot’s axes are appropriately labeled.

The correct scatterplot appears as:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the correlation between temperature and hot cocoa sales?

A

The correlation coefficient of temperature and cups of cocoa sold is CORREL(A2:A17,B2:B17)= -0.79.

Note that the fact that the correlation is a negative number is reinforced by viewing the scatter plot of the data; there appears to be a negative trend in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Suppose we wanted to compare the variability of the selling prices of Boston real estate to the variability of Boston real estate lot sizes. Calculate the coefficient of variation of the selling prices.

A

The coefficient of variation is STDEV.S(B2:B31)/AVERAGE(B2:B31)= 0.63

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Calculate the variance of the sample of Boston real estate pricing data.

A

The variance is VAR.S(B2:B31)= 82,917,725,621 squared dollars.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The variance of the Boston real estate pricing data sample is approximately 82,917,725,621 squared dollars. Calculate the standard deviation of the data sample.

A

Recall that the standard deviation is the square root of the variance. The standard deviation is SQRT(B1)=$287,954. When provided with the detailed sample data, the standard deviation can also be computed using STDEV.S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How many houses cost more than $400 thousand and less than or equal to $800 thousand?

  • Approximately 2
  • Approximately 11
  • Approximately 15
  • Approximately 25
A

Approximately 2

By convention, Excel includes in a bin’s range the number represented by the bin label. For example, the first bin (labeled $200,000) includes all houses with values less than or equal to $200,000 and the second bin (labeled $400,000) includes all houses with values greater than $200,000 but less than or equal to $400,000. The only bins with frequency 2 are the fourth bin (labeled $800,000), which indicates that approximately 2 houses cost more than $600,000 and less than or equal to $800,000, and the sixth bin (labeled $1,200,000), which indicates that approximately 2 houses cost more than $1,000,000 and less than or equal to $1,200,000). The number of houses that cost more than $400,000 and less than or equal to $800,000 is indicated by the height of the bars at bins $600,000 and $800,000. How many houses cost more than $400,000 and less than or equal to $800,000?

Approximately 11

The number of houses that cost more than $400,000 and less than or equal to $800,000 is indicated by the height of the bars at bins $600,000 and $800,000. The frequency of the bar above bin $600,000 is approximately 9 and the frequency of the bar above bin $800,000 is approximately 2. Therefore, approximately 9+2=11 houses cost more than $400,000 and less than or equal to $800,000.

Approximately 15

By convention, Excel includes in a bin’s range the number represented by the bin label. For example, the first bin (labeled $200,000) includes all houses with values less than or equal to $200,000 and the second bin (labeled $400,000) includes all houses with values greater than $200,000 but less than or equal to $400,000. Approximately 15 houses cost less than or equal to $400,000. The number of houses that cost more than $400,000 and less than or equal to $800,000 is indicated by the height of the bars at bins $600,000 and $800,000. How many houses cost more than $400,000 and less than or equal to $800,000?

Approximately 25

By convention, Excel includes in a bin’s range the number represented by the bin label. For example, the first bin (labeled $200,000) includes all houses with values less than or equal to $200,000 and the second bin (labeled $400,000) includes all houses with values greater than $200,000 but less than or equal to $400,000. Approximately 25 houses cost less than or equal to $600,000. The number of houses that cost more than $400,000 and less than or equal to $800,000 is indicated by the height of the bars at bins $600,000 and $800,000 How many houses cost more than $400,000 and less than or equal to $800,000?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The following data set contains the heights of several members of the Boston Red Sox. Create a histogram of the data using the bins provided in column C.

A

The Input Range is B1:B11 and the Bin Range is C1:C4. You must check the Labels in first row box since we included B1 and C1 to ensure that the histogram’s axes are appropriately labeled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Calculate the 33rd percentile for the oil consumption data.

A

PERCENTILE.INC(A2:A11,0.33)=2.79.

33%, or approximately one-third, of the countries in our data set consume less than 2.79 million barrels of oil per day. 66%, or approximately two-thirds, of the countries in our data set consume more than this amount.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Below are data showing students’ grades on a statistics quiz and the number of hours they spent studying. Create a scatterplot to show the relationship.

A

The Input Y Range is B1:B25 and the Input X Range is A1:A25. You must check the Labels in first row box since we included labels in cell A1 and B1 to ensure that the scatter plot’s axes are appropriately labeled.

The correct scatterplot appears as:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the correlation between hours studying and quiz grades?

A

The correlation coefficient of hours studying and quiz grades is CORREL(A2:A25,B2:B25)=0.67.

Note that the fact that the correlation is a positive number is reinforced by viewing the scatterplot of the data; there appears to be a positive trend in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Calculate the average weight of the infielders on the 2013 Red Sox roster.

A

The average weight of the infielders is AVERAGEIF(B2:B11,“Infielder”,C2:C11), or equivalently, AVERAGEIF(B2:B11,E2,C2:C11)=198.0 pounds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Calculate the coefficient of variation of the lot sizes.

A

The coefficient of variation is STDEV.S(C2:C31)/AVERAGE(C2:C31)=0.72.

21
Q

For the following scenario, determine whether it would be better to analyze cross-sectional or time series data.

We want to see if the Red Sox performance changes over the course of the baseball season.

  • Cross-Sectional
  • Time Series
A

Cross-Sectional

See correct answer for explanation.

Time Series

Since we are interested in comparing the Red Sox performance at different points in time during the baseball season, we should analyze time series data.

22
Q

The following data set contains the heights of several members of the Boston Red Sox. Create a histogram of the data using the bins provided in column C.

A

The Input Range is B1:B11 and the Bin Range is C1:C6. You must check the Labels in first row box since we included B1 and C1 to ensure that the histogram’s axes are appropriately labeled.

23
Q

Calculate the 66th percentile for the following data set.

A

PERCENTILE.INC(A2:A13,0.66)=12.26. The 66th percentile is 12.26.

24
Q

The data set below shows annual healthcare expenditures for 192 countries. Create a histogram of the data using the bins provided in column D.

A

From the Data menu, select Data Analysis, then select Histogram. The Input Range is B1:B193 and the Bin Range is D1:D11. You must check the Labels in first row box to ensure that the histogram’s axes are appropriately labeled.

25
Q

Suppose you actually want to calculate the mean annual healthcare expenditures of the 192 countries. Which of the following Excel functions calculates the mean? SELECT ALL THAT APPLY.

  • =MEAN(B2:B193)
  • =AVERAGE(B2:B193)
  • =MEDIAN(B2:B193)
  • =SUM(B2:B193)/192
  • =MODE.SNGL(B2:B193)
A

=MEAN(B2:B193)

=MEAN(B2:B193) is not a function in Excel.

=AVERAGE(B2:B193)

=AVERAGE(B2:B193) calculates the mean of the annual healthcare expenditures. Note that another option is also correct.

=MEDIAN(B2:B193)

=MEDIAN(B2:B193) finds the median, or middle value, of the annual healthcare expenditures.

=SUM(B2:B193)/192

=SUM(B2:B193)/192 calculates the sum of the annual healthcare expenditures and divides that sum by 192, the number of data points. This formula calculates the mean of the annual healthcare expenditures. Note that another option is also correct

=MODE.SNGL(B2:B193)

.=MODE.SNGL(B2:B193) finds the mode, or most common value, of the annual healthcare expenditures.

26
Q

Calculate the standard deviation of the 2012 revenue for Forbes’ 100 top companies.

A

STDEV.S (B2:B101)= approximately $67.32 billion. You can also use the descriptive statistics tool, making sure to link directly to values in order to obtain the correct answer.

27
Q

Consider the four outliers in the 2012 revenue data: companies with revenue of $237 billion, $246 billion, $447 billion, and $453 billion. If we removed these companies from the data set, what would happen to the standard deviation?

  • The standard deviation would remain the same.
  • The standard deviation would increase.
  • The standard deviation would decrease.
  • The answer cannot be determined without further information.
A

The standard deviation would remain the same.

See correct answer for explanation.

The standard deviation would increase.

See correct answer for explanation.

The standard deviation would decrease.

The standard deviation gives more weight to observations that are further from the mean. Therefore, removing the outliers would decrease the standard deviation.

The answer cannot be determined without further information.

See correct answer for explanation.

28
Q

The following data set provides the percent of students from the top 100 ranked U.S. MBA programs that are employed upon graduation. Create a histogram to visualize the data. Use the bins provided in column C.

A

From the Data menu, select Data Analysis, then select Histogram. The Input Range is B1:B101 and the Bin Range is C1:C15. You must check the Labels in first row box to ensure that the histogram’s axes are appropriately labeled.

29
Q

A consultant compiled the following data set that shows the number of visits made to the National Museum of American History from 2001 to 2015. The consultant noticed that the number of visits in 2007 and 2008 seemed unusually low compared to the rest of the data set. What should the consultant do about the data points from 2007 and 2008?

  • Delete both data points because they are outliers.
  • Leave both data points in because one should never delete research-based data.
  • Research the data points and then make a decision based on the findings.
  • Change the data point for 2008 to 4,800,000 and research the data point for 2007.
A

Delete both data points because they are outliers.

The consultant should not delete data points simply because they are outliers.

Leave both data points in because one should never delete research-based data.

The consultant cannot assume that research-based data sets are without fault. There may be situations where data should be deleted: because of measurement or entry error; because the data are not representative of the population of interest; or any of many other reasons.

Research the data points and then make a decision based on the findings.

The consultant should delete or change data points only if careful examination of the data and the data sources indicates that the data points are incorrect or irrelevant to the research at hand. The consultant must use his or her experience and knowledge of the research question to make decisions on a case-by-case basis. Doing business analytics effectively requires judgment. In this case, the National Museum of American History underwent renovations which reduced significantly the number of visits to the museum in 2007 and 2008. The data points for 2007 and 2008 are correct and should not be changed. However, the fact that the museum was closed during most of that two year period should be considered when drawing conclusions from this data set.

Change the data point for 2008 to 4,800,000 and research the data point for 2007.

Although data entry errors may occur, the consultant cannot know this without researching the data points first. In this case, the National Museum of American History underwent renovations which reduced significantly the number of visits to the museum in 2007 and 2008. The data points for 2007 and 2008 are correct and should not be changed. However, the fact that the museum was closed during most of that two year period should be considered when drawing conclusions from this data set.

30
Q

A jeweler takes stock of the diamonds she has available for mounting in the jewelry she sells. Calculate the average and standard deviation of the total number of carats in the diamonds recorded in the spreadsheet.

A

AVERAGE(A2:A100)=0.98 and STDEV.S(A2:A100)=0.42. You can also use the descriptive statistics tool, making sure to link directly to values in order to obtain the correct answer.

31
Q
A
32
Q

Which of the following histograms has the smallest range? Assume all values in the data set are integers.

A

Option A

The range of this histogram is approximately 7–0=7. This is the smallest range in this set of histograms.

33
Q

The data below show the number of hours 60 fifth-grade students reported reading last week and each student’s gender. Use the AVERAGEIF function to calculate the average number of hours spent reading last week for boys, and the average number of hours spent reading last week for girls.

A

This is a conditional mean, so you can either use AVERAGEIF(B2:B61,”Boys”,A2:A61)=4.48 and AVERAGEIF(B2:B61,”Girls”,A2:A61)=5.55 or AVERAGEIF(B2:B61,D2,A2:A61)=4.48 and AVERAGEIF(B2:B61,D3,A2:A61)=5.55. You could also just sort the data by gender and compute the averages of each gender, but we want you to learn how to do conditional averages in Excel. As always, it is important that you link to the cells with the data.

34
Q

The following data set provides the average driving distance for 185 members of the Professional Golf Association (PGA) Tour.

Use the descriptive statistics tool to calculate the summary statistics for the average driving distance. Make sure to set the output range to cell D1 so your table is graded accurately.

A

Use the descriptive statistics tool to calculate all of the summary statistics. The Input Range is B1:B186. You must check the Labels in first row box to ensure that the output table is appropriately labeled. You must select Summary Statistics in order to produce the output table.

35
Q

Which of the following formulas would calculate the statistic that is MOST APPROPRIATE for comparing the variability of two data sets with different distributions?

  • Mean/Standard Deviation
  • Standard Deviation/Mean
  • Mean-Median
  • Median-Mean
  • Mean/Variance
  • Variance/Mean
A

Mean/Standard Deviation

This is the inverse of the formula for the coefficient of variation.

Standard Deviation/Mean

This is the formula for the coefficient of variation, the best statistic to compute to compare the variability of two data sets with different distributions. Dividing by the mean provides a measure of the distribution’s variation relative to the mean.

Mean-Median

Although the difference between the mean and the median may provide information about whether a dataset is skewed, it does not provide useful information for comparing variability across different distributions.

Median-Mean

Although the difference between the mean and the median may provide information about whether a dataset is skewed, it does not provide useful information for comparing variability across different distributions.

Mean/Variance

The mean and variance are measured in different units. For example, if the mean is measured in feet, the variance is measured in square feet. The coefficient of variation is calculated using the mean and standard deviation, both of which have the same units.

Variance/Mean

The mean and variance are measured in different units. For example, if the mean is measured in feet, the variance is measured in square feet. The coefficient of variation is calculated using the mean and standard deviation, both of which have the same units.

36
Q

Calculate the coefficient of variation for the average driving distances of the PGA Tour.

A

Coefficient of Variation = Standard Deviation/Mean. Entering =E6/E2 calculates the coefficient of variation, which is approximately 0.03. You must link directly to values in order to obtain the correct answer.

37
Q

The data set below provides information about 125 randomly selected companies from the Standard and Poor’s (S&P) 1500. Calculate the average number of employees for technology companies.

A

This is a conditional mean, so you can either use AVERAGEIF(B2:B126,”Technology”,C2:C126) or AVERAGEIF(B2:B126,E2,C2:C126). The average number of employees at technology companies in this data set is approximately 7,318.

38
Q

The following data set provides the 2012 revenue (in billions of dollars) for the top 75 companies as declared by the Fortune 500 rankings. What amount do 60% of the companies earn equal to or less than?

A

PERCENTILE.INC(B2:B76,0.60)=$74.40 billion. You must link directly to values in order to obtain the correct answer.

39
Q

The following data set provides the acceptance rate of the top 100 U.S. MBA programs and the percent of students that are employed upon graduation. Create a scatter plot to illustrate the relationship between the acceptance rate at MBA programs and the percent of students that are employed upon graduation. Place “Percent Employed” on the y-axis and “Acceptance Rate” on the x-axis.

A

From the Insert menu, select Scatter, then select Scatter With Only Markers. The Input Y Range is C1:C101 and the Input X Range is B1:B101. You must check the Labels in first row box to ensure that the scatter plot’s axes are appropriately labeled.

40
Q

Calculate the correlation coefficient between the acceptance rate at the top 100 U.S. MBA programs and the percent of students in those programs who are employed upon graduation.

A

CORREL(B2:B101,C2:C101)=-0.32. The correlation coefficient between the acceptance rate at the top 100 U.S. MBA programs and the percent of students that are employed upon graduation is approximately -0.32. You must link directly to values in order to obtain the correct answer.

41
Q

What can be concluded from the fact that the correlation coefficient between the acceptance rate at the top 100 U.S. MBA programs and the percent of students in those programs who are employed upon graduation is -0.32?

  • On average, as the acceptance rate increases, the percent of students employed upon graduation increases.
  • On average, as the acceptance rate decreases, the percent of students employed upon graduation decreases.
  • On average, as the acceptance rate decreases, the percent of students employed upon graduation increases.
  • On average, as the acceptance rate increases, the percent of student employed upon graduation remains the same.
A

On average, as the acceptance rate increases, the percent of students employed upon graduation increases.

A positive correlation coefficient would indicate that, on average, as acceptance rate increases, the percent of students employed upon graduation increases.

On average, as the acceptance rate decreases, the percent of students employed upon graduation decreases.

A positive correlation coefficient would indicate that, on average, as acceptance rate decreases, the percent of students employed upon graduation decreases.

On average, as the acceptance rate decreases, the percent of students employed upon graduation increases.

-0.32 is negative which indicates that, on average, as acceptance rate decreases, the percent of students employed upon graduation increases.

On average, as the acceptance rate increases, the percent of student employed upon graduation remains the same.

A correlation coefficient of zero would indicate no relationship.

42
Q

An internet marketing firm compiled a data set of the number of seconds website visitors stay on one of its client’s homepage before abandoning the site. The firm presented the summary statistics for the data set to the client.

The client asked why the mean of the data set is so much larger than the median. Which of the following is most likely true?

  • The distribution of the data is symmetric
  • The distribution of the data is skewed to the left
  • The distribution of the data is skewed to the right
  • The distribution of the data is bimodal
A

The distribution of the data is symmetric

When the distribution of data is symmetric, the mean and median are equal.

The distribution of the data is skewed to the left

When the distribution of data is skewed to the left, the mean is most likely less than the median. The extreme values in the left tail pull the mean towards them.

The distribution of the data is skewed to the right

When the distribution of data is skewed to the right, the mean is most likely greater than the median. The extreme values in the right tail pull the mean towards them.

The distribution of the data is bimodal

When the distribution of data is bimodal, the mean may be less than, equal to, or greater than the median.

43
Q

Time Series or Cross-Sectional Data?

TO DETERMINE IF ENROLLMENT IN HIGHER EDUCATION IS INCREASING.

A

Time

44
Q

Time Series or Cross-Sectional Data?

TO COMPARE THE INSECT POPULATION IN A GEOGRAPHIC REGION BEFORE AND AFTER AN INSECTICIDE WAS APPLIED.

A

Time

45
Q
A
46
Q

Time Series or Cross-Sectional Data?

TO COMPARE THE CURRENT PRICE OF A GALLON OF GASOLINE ACROSS DIFFERENT GAS STATIONS IN LOS ANGELES, CA.

A

Cross-sectional

47
Q

Time Series or Cross-Sectional Data?

TO SEE IF THERE ARE DIFFERENCES IN THE AVERAGE NUMBER OF CALORIES CONTAINED IN SCHOOL LUNCHES SERVED IN EACH OF THE FIFTY STATES ON DECEMBER 1, 2015.

A

Cross-sectional

48
Q

What is:

Data that provide a measure of an attribute across multiple different subjects (e.g. people, organizations, or countries) at a given moment in time or during a given time period.

A

cross-sectional

49
Q

Which of the following is an example of a hidden variable?

  • Quality of life is a hidden variable because it cannot be measured directly but must be inferred from measurable variables such as wealth, success, and environment.
  • A recent study showed a correlation between a country’s chocolate consumption and the number of Nobel prizes won by its scientists. The hidden variable is a strong university system that fosters talented researchers.
  • The correlation between smoking and lung cancer was a hidden variable for a long time because the cigarette lobby paid to keep the relationship hidden.
  • There is a correlation between the number of firefighters who show up at a fire and how much damage the fire causes. The hidden variable is the size of the fire.
A

Quality of life is a hidden variable because it cannot be measured directly but must be inferred from measurable variables such as wealth, success, and environment.

A hidden variable is one that is correlated with each of two variables that are not fundamentally related to each other. In this example, we are not looking at a correlation between two variables, but rather trying to determine a single variable, quality of life.

A recent study showed a correlation between a country’s chocolate consumption and the number of Nobel prizes won by its scientists. The hidden variable is a strong university system that fosters talented researchers.

A hidden variable is one that is correlated with each of two variables that are not fundamentally related to each other. Although a strong university system is probably correlated with the number of Nobel prizes, it is probably not related to the amount of chocolate consumed, and so does not function as a hidden variable between prizes and chocolate.

The correlation between smoking and lung cancer was a hidden variable for a long time because the cigarette lobby paid to keep the relationship hidden.

A hidden variable is one that is correlated with each of two variables that are not fundamentally related to each other; it is not one that is being hidden due to political pressures.

There is a correlation between the number of firefighters who show up at a fire and how much damage the fire causes. The hidden variable is the size of the fire.

A hidden variable is one that is correlated with each of two variables that are not fundamentally related to each other. In this case, the size of the fire leads to a call for more firefighters, and the size of the fire also generally leads to more damage. The number of firefighters does not lead to a greater amount of fire damage.