HBX- BA - 1 Flashcards by Diana cristea

What is a Histogram & how do you create one in excel?

A histogram’s
- x-axis represents bins corresponding to ranges of data
- y-axis indicates the frequency of observations falling into each bin.
The best bin size depends on what we are trying to learn from the data.
Using larger bins can simplify a histogram, but may make it difficult to see trends in the data.
Very small bins can have such low frequencies that make it difficult to discern patterns.
How to create one:
- Add the information in to the excel sheet to create the graph!
- Label the Columns
- Enter the bin choices that you want.
- Data -> Data Analysis -> Histogram
  - Details to fill out!
  - The Input Range is the original oil consumption data (not the data you created above) Should something look like this: A1**:A11
  - The Bin Range- the boxes that you have the bins in. Should something look like this: D1**:D20
  - Make sure to click labels in 1st row!
Full Definition: A common graphical representation of statistical data, used to represent the distribution of values of a single variable in a data set. The full range of the variable’s values is drawn on the x-axis and divided into non-overlapping intervals called bins. A vertical bar is constructed for each bin. The height of the bar corresponding to a bin is equal to the number of data points in the bin (that is the number of data points with a value within the range of the interval). In an Excel histogram, each bin is labeled by the value of the upper boundary of the bin’s range. For example, in a histogram with three bins (each of width 1), labeled 1, 2, and 3, the bin labeled 2 contains all observations greater than 1 and less than or equal to 2. See bin.

How well did you know this?

Not at all

Perfectly

What’s so special about BINS (for histograms)?

BINS (for histograms)
Bins are like the price is right- you can’t go over the #!

Ex: in a histogram with three bins (each of width 1), labeled 1, 2, and 3, the bin labeled 2 contains all observations greater than 1 and less than or equal to 2. See bin.

How well did you know this?

Not at all

Perfectly

What does it mean for a graph to be skewed?

A characteristic of an asymmetric distribution. A skewed distribution is sometimes characterized by the behavior of the distribution’s tails. For example, a right skewed distribution may have a longer or fatter “tail” of observations extending to the right than to the left and a left skewed distribution may have a longer or fatter “tail” to the left than to the right. Although often useful, this approach can be uninformative in assessing certain distributions.

The attached graph (The oil consumption data set) is skewed to the right.

How well did you know this?

Not at all

Perfectly

Select the countries that are in bin 4.

India & Russa

Bin 4 includes all countries that consume more than 3 million barrels and less than or equal to 4 million barrels of oil per day. Russia consumes 3.2 million barrels per day and therefore falls into bin 4.

How well did you know this?

Not at all

Perfectly

Greater than 2 million and less than or equal to 3 million barrels per day

The tallest bar corresponds to bin 3, which means that bin 3 has the highest frequency. Therefore, the range containing the most countries in the data set includes all countries that consume more than 2 million and less than or equal to 3 million barrels of oil per day.

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

Outlier

A data point in a data set that is atypical in that it lies far outside of the range of the other points in the data set. Technically, an outlier is more than 1.5 times the interquartile range greater than the upper quartile or 1.5 times the interquartile range less than the lower quartile. For example, if the lower quartile, Q1, is 500 and the upper quartile, Q3, is 700, then the interquartile range is 700 – 500 = 200. So any observation with a value less than 200 = 500 – (200*1.5) or greater than 1,000 = 700 + (200*1.5) would be considered an outlier. See quartile.

How well did you know this?

Not at all

Perfectly

What are the formulas for Mean, Median, & Mode in Excel?

Mean, Median, & Mode in Excel

Mean =AVERAGE (Then drag and select all points- EX: A1:A5 OR you could manually enter all the #’s with a comma EX: 1.5, .5, 1)
Median =MEDIAN (Then drag and select all points)
Mode =MODE.SNGL (Then drag and select all points)
*EXTRA: The Excel function MODE.MULT finds all of the modes in a data set, generating a vertical array that has one row for each mode in the data set. Suppose a data set has three modes. To find them, instead of entering the MODE.MULT formula in a single cell, highlight at least three vertically-contiguous cells and then input =MODE.MULT(number 1, [number 2], …) into the formula bar. Then, instead of using ENTER to find the result, use CTRL+SHIFT+ENTER to enter the array. The modes will appear in the first rows of the array (filling as many rows as there are modes in the data set) and #N/A will appear in each of the other rows in the array.*
How do we know how many vertically-contiguous cells to highlight when using this function? To determine how many modes are in a data set, enter =COUNT(MODE.MULT(number 1, [number 2], …)) in a single cell. When you click ENTER, the function will return the number of modes in the data set. The result will tell you how many vertical cells you should highlight to create a MODE.MULT array.*
Note that our embedded spreadsheet currently does not support the functionality required by MODE.MULT.*

How well did you know this?

Not at all

Perfectly

us & china

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

What is a conditional mean and what is the excel formula?

A conditional mean is the mean (average) of a subset of data. We apply a condition and calculate the mean for values that meet that condition. For example, in a data set that contains data on both males and females, a conditional mean might be the mean of the data pertaining to only the females in the data set.

Calculate Conditional Mean in Excel

Conditional Mean =AVERAGEIF(range, criteria, [average_range])
- Range- the options we’re choosing our condition from- but ALL of the options
- Criteria- the specifics of what we’re looking for, the “condition”
- Average Range- all the data from the set we want to average (it will only take the ones you want because we’ve told it to!
It will end up looking up something like this =AVERAGEIF(C2:C11,E2,A2:A11)
- Commas between each category

How well did you know this?

Not at all

Perfectly

PERCENTILE- What is it and what’s the excel formula?

The value of a variable for which a certain percentage of the data set falls below. For example, if 87% of students taking the GMAT exam earn scores below 670, the 87th percentile for the GMAT exam is 670 points.

To find a percentile in Excel, we use the following function:

=PERCENTILE.INC(array, k)
- Array is the range of data for which we want to calculate a given percentile.
- K is the percentile value.
For example, if we want to know the 95th percentile, k would be 0.95.*

How well did you know this?

Not at all

Perfectly

What percentile does the mean represent?

What percentile does the median represent?

What percentile does the mode represent?

The answer cannot be determined without further information

Remember that the mean’s location depends upon the distribution of the data set. Recall how the location of the mean differs for a symmetrical distribution and a skewed distribution. Therefore, there is no way to determine the percentile of the mean without more information about the data set.

50%

Remember that half of a distribution’s data points are less than or equal to the median. Therefore, the median is equal to the 50th percentile, because 50% of the data points are equal to or below this value.

The answer cannot be determined without further information

Remember that the mode’s location depends upon the distribution of the data set. Therefore, there is no way to determine the percentile of the mode without more information about the data set.

How well did you know this?

Not at all

Perfectly

Variance - What is it and what’s the excel formula?

The variance measures how far each point is from the mean.
A measure of the spread of a data set’s values around its mean value.

If the true population mean is known:
the variance = the sum of the squares of the differences between each point of the data set and the population mean / (total number of data points)
If the mean is estimated from a sample:
the variance = the sum of the squares of the differences between each point of the data set and the sample mean, / (total number of data points in the sample - 1)
The variance is the square of the standard deviation.
The variance is measured in squared units (e.g., if the data set contains data denominated in dollars, the variance will be in squared dollars).

Variance in Excel

number 1 is the first number, cell reference, or range of cells for which to calculate the specified value.
=VAR.S(number 1, [number 2], …)
- [number 2],… represents additional numbers, cell references, or ranges of cells. The square brackets indicate that the argument is optional.

Standard Deviation in Excel

=STDEV.S(number 1, [number 2], …)
- number 1 is the first number, cell reference, or range of cells for which to calculate the specified value.
- [number 2],… represents additional numbers, cell references, or ranges of cells. The square brackets indicate that the argument is optional.

****** “.S” - that’s listed above after both things - means we are working with a sample

*OPTION- We can also find the standard deviation using the Excel function =SQRT(number) to take the square root of the variance. For example, =SQRT(16)=4.

How well did you know this?

Not at all

Perfectly

The range is the difference between the maximum value and the minimum value. We can see from the histogram that the maximum value in this data set is 10 and the minimum value is 0, so the range equals 10–0=10.

How well did you know this?

Not at all

Perfectly

If the variance of a data set is 9, what is the standard deviation?

The standard deviation is equal to the square root of the variance. If the variance is 9, then the standard deviation must be 3.

Calculate the mean and the standard deviation for the heights of the ten Boston Red Sox players.

Define & state how to create descriptive statistics in excel.

Also known as summary statistics, these are numbers that provide a quick overview of the properties of a data set. Typically, descriptive statistics include the data set’s mean, median, mode, standard deviation, sample size, minimum, maximum, and range.

Descriptive Statistics in Excel! (This makes a table with Mean, Median, Mode, Standard Deviation, etc…)

Data -> Data Analysis -> Descriptive Statistics
- Input Range (Put in all the boxes of data that you want to use- and included the label! Most likely, they’ll be in one column. For example A1:A11, which means A1 through A11)
  - Click LABEL IN FIRST ROW BOX
Output Range (Where you want the left corner of the box to be)
Select/Check SUMMARY STATISTICS so the table is made.

kurtosis

A measure of the flatness or sharpness of a distribution. A flat distribution with thick tails has a low or negative kurtosis; a very sharp distribution, with thin tails and a sharp rise to the peak, has a large, positive kurtosis.

Coefficient of Variation

The standard deviation describes how much the values in a single data set vary, but what if we want to compare the amount of variation in two different data sets?
Coefficient of Variation is a measure of a data set’s variability relative to its mean. The coefficient of variation (CV) is particularly helpful when comparing the variability of two data sets with different means. Calculated as the standard deviation divided by the mean, the CV is typically expressed as a percentage. For example the CV of a data set with mean = 100 hours and standard deviation = 15 hours is 15 hours/100 hours = 15%.

Let’s use the coefficient of variation to compare a few data sets. Which of the following data sets has the highest relative variation?

Mean=7.00; Standard Deviation=1.64
Mean=82.76; Standard Deviation=189.53
Mean=150.82; Standard Deviation=201.15
Mean=172.00; Standard Deviation=25.54

Mean=82.76; Standard Deviation=189.53

Remember that coefficient of variation=Standard DeviationMeancoefficient of variation=Standard DeviationMean. We could formally compute the value: CV=189.5382.76=2.29CV=189.5382.76=2.29, or we could do a rough estimate of the CV for this option by comparing the numerator and denominator. Since the standard deviation is more than twice the mean, this option has a CV greater than two.

Scatter Plot

A graph showing the relationship between two variables. One variable (generally the independent variable) is measured along the x-axis, and the other (generally the dependent variable) is measured along the y-axis. A single marker is placed for each observation in the data set, allowing for easy visualization of the relationship between the variables. **_Create a Scatter Plot in Excel_** * Insert -\> Scatter -\> Scatter With Only Markers * Input Y range (Ex: C1:C11) * Input X range (Ex: B1:B11) * Check the “Labels in First Box”

**Weight increases** The scatter plot indicates that as height increases, weight also increases.

Correlation Coefficient

**A measure of the strength of a linear relationship between two variables.** The correlation coefficient can range from -1 to +1. A correlation coefficient of -1 indicates a perfect negative linear relationship between two variables, whereas a correlation coefficient of +1 indicates a perfect positive linear relationship. A correlation coefficient of 0 indicates that no linear relationship exists between two variables, _though it is possible that a **non-linear** relationship exists between the two variables._ **_How to find the Correlation Coefficient In Excel_** =CORREL(array 1, array 2) * Array 1 is a set of numerical variables or cell references containing data for one variable of interest * Array 2 “ “ for the OTHER variable of interest * These observations must be equal *An example to remember that you shouldn’t ALWAYS believe a # (See the graph!) The correlation of height and weight between the ten Red Sox players is quite high. However, a high correlation does not imply that one variable causes the other. Taller people may tend to weigh more, but gaining weight won’t make you taller! Correlation indicates a linear relationship, but it does not indicate causality.*

**0.8** The relationship between height and weight is positive and is quite strong so 0.8 is a good estimate.

Hidden Variable

A variable that is correlated with two different variables that are not directly related to each other. The two variables may appear to be unrelated, but are mathematically correlated because each of them is correlated with a third, the hidden variable that drives the observed correlation. A hidden variable makes its presence known through its relationship with each of the two variables that are being observed. *There is a positive correlation between cranberry sauce sales and airplane ticket prices. The hidden variable is Thanksgiving! Every southern family has cranberry sauce at Thanksgiving (I don’t know of anyone that eats it at any other time of the year) and airline prices are insanely expensive at this time.*

A hidden variable, such as GDP, may explain variation in oil consumption across various countries, and provide more clarity than looking solely at the number of barrels of oil consumed. Example of a hidden variable Not an example of a hidden variable

Not an example of a hidden variable GDP is likely correlated with oil consumption. To determine whether there is a hidden variable, first identify two variables that are not fundamentally related to each other, and then identify a third “hidden” variable that is correlated with each. In this example, what would the two variables be? One would be oil, but there is no second variable proposed that is fundamentally unrelated to oil.

A researcher finds a positive correlation between the number of traffic lights in a town or city and the number of crimes committed each month in that town. The hidden variable is population. Cities with a greater number of people have more traffic and thus need more traffic lights. These cities also have more people who can commit crimes (and be victims of crimes), and more crimes are committed. * Example of a hidden variable * Not an example of a hidden variable

**Example of a hidden variable** To determine whether there is a hidden variable, first identify two variables that are not fundamentally related to each other, and then identify a third variable that is correlated with each. In this case, the two variables are number of traffic lights and number of crimes. The third variable, population, is related to both. Population is related to traffic lights; higher populations lead to more traffic, which in turn leads to the need for more lights. Population is also related to number of crimes. Even if we hold the crime rate constant, as the population increases, the number of criminals, and thus number of crimes, increase. Traffic lights, however, do not lead to crime or vice-versa.

A retail store owner offers a small discount on the same-day delivery service she offers for her store’s products. In the week following the discount offer, sales via the delivery service jumped by 50%. The hidden variable is weather; it rained throughout that week and more people opted for delivery rather than going to the store. Example of a hidden variable Not an example of a hidden variable

Not an example of a hidden variable To determine whether there is a hidden variable, first identify two variables that are not fundamentally related to each other, and then identify a third variable that is correlated with each. Although the weather is probably correlated with the increase in same-day delivery, it is not related to the discount, and so does not function as a hidden variable between weather and the discount.

Time Series vs. Cross-Sectional Data:

**Time Series:** Time series data contain data about a given subject in temporal order, measured at regular time intervals (e.g. minutes, months, or years). U.S. oil consumption from 2002 through 2012 is an example of a time series. Managers collect and analyze time series to identify trends and predict future outcomes. **Cross-Sectional:** Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period. The average oil consumption of ten countries in 2012 is an example of cross-sectional data. Managers use cross-sectional data to compare metrics across multiple groups.

Determine whether it would be better to analyze cross-sectional or time series data. We want to know the current average height and weight of citizens in each country that belongs to the European Union.

Cross-Sectional Since we are interested in the average height and weight of citizens living in different countries in the European Union at a specific point in time (“currently”), we should analyze a cross-section of citizens.

We want to compare the final exam scores of students this semester. * Cross-Sectional * Time series

Cross-Sectional * Since we are interested in final exam scores for a single point in time (this semester), we should analyze cross-sectional data of this year’s results.

We want to know if rates of dementia in the U.S. have decreased. * Cross-Sectional * Time series

Time series * To determine whether rates of dementia have decreased, we must compare dementia rates over time. Therefore, we should analyze time series data.

An internet marketing firm compiled a data set of the number of seconds website visitors stay on one of its client’s homepage before abandoning the site. The firm presented the summary statistics for the data set to the client. The client asked why the mean of the data set is so much larger than the median. Why?

**The distribution of the data is skewed to the right** When the distribution of data is skewed to the right, the mean is most likely greater than the median. The extreme values in the right tail pull the mean towards them.