Notes Flashcards

Remember and understand concepts, and know how to solve problems

1
Q

One of the most useful and commonly used graphical representations of data is

A

A histogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does a histogram display?

A

Frequency, or number, of data points (often called observations) that fall within specified bins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the advantages of histograms?

A

Allow us to quickly discern trends or patterns in a data set and are easy to construct using programs such as Excel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Key concepts of a histogram

A

On the horizontal axis, are a series of single values, each of which represents a bin, or range of possible values.
On the vertical axis, is the frequency of the observations in each bin.
By convention, Excel includes in the range the number represented by the bin label. For example, bin 1 includes all countries with oil consumption less than or equal to 1 million barrels per day (x<=1); bin 2 includes all countries with oil consumption greater than 1 but less than or equal to 2 million barrels per day (1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What impact do the bins have on that a histogram reveals about the underlying data?

A

Using larger bins simplifies our graph, but provides less detail about the distribution. Large bins can prevent us from seeing interesting trends in the data.
Very small bins can create graphs that show such low frequencies that it can also be difficult to discern patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does it mean that the histogram is skewed?

A

It means that the histogram has a tail that extends out to one side. The tail is the part of a graph that appears long or “flattens”, and has bins with lower frequencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does skewness measure?

A

The degree of asymmetry of a distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Definition of “right-tailed” and “left-tailed”

A

If the right tail is longer, we say the distribution is skewed to the right or “right-tailed.”
If the left tail is longer, we say the distribution is skewed to the left or “left-tailed.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Definition of outlier

A

Data points that fall far from the rest of the data.
Or
A data point is more than a specific distance below the lower quartile or above the upper quartile of a data set.
Or
A data point is less than Q1 - 1.5(IQR) or greater than Q3 + 1.5(IQR).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why does outlier exist?

A
  1. An unusual but valid data point
  2. Data entry error
  3. Outlier was collected in a different manner / at a different time than the rest of the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Three approaches to deal with outliers

A
  1. Leave it as is
  2. Change it to a corrected value
  3. Remove it from the data set (very rarely)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The lower quartile

A

Q1, the 25th percentile–by definition, 25% of all observations fall below Q1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The upper quartile

A

Q3, the 75th percentile–by definition, 75% of all observations fall below Q3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The interquartile range (IQR)

A

The distance between the upper and lower quartiles.

IQR = Q3 - Q1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The appropriate range

A

1.5(IQR) = 1.5(Q3-Q1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are graphs very useful for providing insight into?

A

A data set’s patterns, trends and outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Descriptive statistics (summary statistics)

A

Summary a data set numerically.
Describe the data with just one or two numbers.
Provide a quick overview of a data set without showing every data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

“Central tendency” of a data set

A

An indication of where the “center” of the data set lie.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The most common measurement of central tendency

A

Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Mean (average)

A

The “average” of a set of numbers.

The sum of all of the data points in a set, divided by n, the number of data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Median

A

The middle value of a data set.

The same number of data points fall above and below the median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to find the median?

A

First arrange the values in order of magnitude. If the total number of data points is odd, the median is the value that lies in the middle. If the total number is even, the median is the average of the two middle values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Mode

A

The value that occurs most frequently in a data set.

If a data set has more than one value with the highest frequency, that data set has more than one mode.

24
Q

Bimodal

A

If the distribution has two clearly defined peaks (two points with very high frequency).
The two peaks may have equal frequency and hence be true modes, or one peak may be a mode and the other peak may simply have a very high (but not the highest) frequency.

25
Q

Multimodal

A

Distributions with multiple peaks.

26
Q

Conditional mean

A

The mean of a specific subset of data.

We apply a condition and calculate the mean for values that meet that condition.

27
Q

Calculate a conditional mean in Excel

A

=AVERAGE(range, criteria, [average_range])

  • range contains the one or more cells to which we want to apply the criteria or condition.
  • criteria is the condition that is to be applied to the range.
  • [average_range] is the range of cells containing the data we wish to average.
28
Q

Percentiles

A

The value beneath which a certain percentage of the data lie.

29
Q

The 25th percentile (the first quartile)

A

The smallest value that is greater than or equal to 25% of the data points.

30
Q

What percentile does the mean represent?

A

The answer cannot be determined without further information.
The mean’s location depends upon the distribution of the data set. Recall how the location of the mean differs for a symmetrical distribution and a skewed distribution. Therefore, there is no way to determine the percentile of the mean without more information about the data set.

31
Q

What percentile does the median represent?

A

50%
Half of a distribution’s data points are less than or equal to the median. Therefore, the median is equal to the 50th percentile, because 50% of the data points are equal to or below this value.

32
Q

What percentile does the mode represent?

A

The answer cannot be determined without further information.
The mode’s location depends upon the distribution of the data set. Therefore, there is no way to determine the percentile of the mode without more information about the data set.

33
Q

Find a percentile in Excel

A

=PERCENTILE.INC(array, k)

  • array is the range of data for which we want to calculate a given percentile.
  • k is the percentile value. For example, if we want to know the 95th percentile, k would be 0.95.
34
Q

One of the simplest measures of variability, or spread, is

A

The range.
Range = Maximum value - Minimum value
=MAX(number 1, [number 2], …)-MIN(number 1, [number 2], …)
=MAX(A2:A11)-MIN(A2:A11)

35
Q

Calculate the variance of a sample in Excel

A

=VAR.S(number 1, [number 2]…)

  • number 1 is the first number, cell reference, or range of cells for which to calculate the specified value.
  • [number 2],… represents additional numbers, cell references, or range of cells. The square brackets indicate that the argument is optional.
36
Q

Calculate the standard deviation of a sample in Excel

A

=STDEV.S(number 1, [number 2]…)
The “S” in VAR.S and STDEV.S indicates sample.
=SQRT(number)
Number is the variance.

37
Q

Descriptive Statistics in Excel

A

Step 1. From the Data menu, select Data Analysis, then select Descriptive Statistics.
Step 2. Enter the appropriate Input Range:
* The Input Range in column A with its label, A1:A11.
* Make sure to include A1, the cell containing the label, when inputting your range, and check the Label in first row box, as this ensures that output table will be appropriately labeled.
Step 3. Enter the appropriate Output Range.
Step 4. Select Summary Statistics.

38
Q

Standard error (the standard deviation of the mean)

A

Estimates how close the mean of the sample is to the mean of the overall population.
Calculated by dividing the standard deviation of the sample by the square root of the total number of data points.
=STDEV.S(number 1, [number 2], …)/SQRT(COUNT(number 1, [number 2], …))
=STDEV.S(A2:A11)/SQRT(COUNT(A2:A11))

39
Q

What does Kurtosis (峭度) measure?

A

Flatness or sharpness of a distribution.

A flat distribution has low kurtosis; a very sharp distribution has high kurtosis.

40
Q

The coefficient of variation (CV)

A

The ratio of the standard deviation to the mean.
Coefficient of Variation = Standard Deviation / Mean
To compare variation in two data sets.

41
Q

A scatter plot

A
To visualize the relationship between two variables.
One variable ("independent variable") is plotted on the horizontal axis (x-axis), and the other ("dependent variable") is plotted on the vertical axis (y-axis).
42
Q

Create a scatter plot

A

Step 1. From the Insert menu, select Scatter, then select Scatter With Only Markers.
Step 2. Enter the appropriate Input Y Range and Input X Range:
* The Input Y Range is … data in column C with its label, C1:C11.
* The Input X Range is … data in column B with its label, B1:B11.
* Make sure to include the cells containing labels when inputting ranges and check the Labels in first row box, as this ensures that scatter plot will be appropriately labeled.

43
Q

What does the correlation coefficient measure?

A
  • The strength of the linear relationship between two variables.
  • The extent to which the data points on the scatter plot create a line, on a scale from -1 to +1.
44
Q

What happens when the correlation coefficient is 0?

A

A relationship between two variables might exist–just not a linear one. The relationship may appear more like a curve.

45
Q

What does the value of the correlation coefficient tell you about the strength and nature of the relationship between two variables?

A

1) Range: Correlation coefficients include all values, and only values, from -1 to 1.
2) Magnitude: Correlations are stronger for coefficients that are closer to -1 or 1; correlations are stronger as the coefficient value moves farther from 0.
3) Directionality: A positive correlation coefficient indicates a positive relationship, meaning that as one variable increases, the other variable increases.
A negative correlation coefficient indicates a negative relationship, meaning that as one variable increases, the other variable decreases.
4) Non-linearity: Correlation coefficients measure only linear relationships; they may not provide insight into other types of relationships.
Two variables with a correlation close to 0 or equal to 0 have little or no linear relationship; they may have no relationship at all or may have another type of relationship.
A non-linear relationship may be visible in a scatterplot.

46
Q

Find the correlation coefficient in Excel

A

=CORREL(array 1, array 2)

  • array 1 is a set of numerical variables or cell references containing data for one variable of interest.
  • array 2 is a set of numerical variables or cell references containing data for the other variable of interest.
  • Note that the number of observations in array 1 must be equal to the number in array 2.
47
Q

A hidden variable

A

A variable that is correlated with each of two variables that are not fundamentally related to each other. That is, there is no reason to think that a change in one variable will lead to a change in the other; in fact, the correlation between the two variables may seem surprising until the hidden variable is considered.
Although there is no direct relationship between these two variables, they are mathematically correlated because each is correlated individually with a third “hidden” variable. Therefore, for a variable to act as a hidden variable, there must be three variables, all of which are mathematically correlated (either directly or indirectly).

See a correlation between weight gain and grades, driven by the hidden variable, worry. Students couldn’t just eat more food and expect their grade to improve, nor could they make a point of doing poorly in their courses just to lose weight. These two variables are not fundamentally related.

48
Q

“Mediating variable”

A

Variables which are affected by one variable, and then affect another variable in turn.

For example, being worried about grades

  1. may cause a student to study harder, and thus get better grades, but we wouldn’t consider studying to be a hidden variable linking worry and getting better grades. Those two variables ARE fundamentally related, in that the worry is leading to the better grades. If students are more worried, they may study harder and get even better grades.
  2. may cause a student to stress eat and gain weight, but we wouldn’t consider eating to be a hidden variable linking worry and weight gain. Those two variables ARE fundamentally related, in that the worry is leading to the weight gain. If students are more worried, they may gain even more weight.
49
Q

A hidden variable, such as GDP, may explain variation in oil consumption across various countries, and provide more clarity than looking solely at the number of barrels of oil consumed.

A

Not an example of a hidden variable

GDP is likely correlated with oil consumption. To determine whether there is a hidden variable, first identify two variables that are not fundamentally related to each other, and then identify a third “hidden” variable that is correlated with each. In this example, what would the two variables be? One would be oil, but there is no second variable proposed that is fundamentally unrelated to oil.

50
Q

A researcher finds a positive correlation between the number of traffic lights in a town or city and the number of crimes committed each month in that town. The hidden variable is population. Cities with a greater number of people have more traffic and thus need more traffic lights. These cities also have more people who can commit crimes (and be victims of crimes), and more crimes are committed.

A

Example of a hidden variable

In this case, the two variables are number of traffic lights and number of crimes. The third variable, population, is related to both. Population is related to traffic lights; higher populations lead to more traffic, which in turn leads to the need for more lights. Population is also related to number of crimes. Even if we hold the crime rate constant, as the population increases, the number of criminals, and thus number of crimes, increase. Traffic lights, however, do not lead to crime or vice-versa.

51
Q

Market researchers at a corporation assess the sales and revenue for the corporation’s hot dog subsidiary, but do not pay attention to the fact that many people in their market are vegetarians. The researchers’ lack of understanding about the dietary habits of the market is a hidden variable.

A

Not an example of a hidden variable

Here there are not two variables that are correlated; there is only one: hot dog sales. Although dietary habits may be hidden from the researchers in a conversational sense, it is not a hidden variable in the statistical meaning of the term.

52
Q

A retail store owner offers a small discount on the same-day delivery service she offers for her store’s products. In the week following the discount offer, sales via the delivery service jumped by 50%. The hidden variable is weather; it rained throughout that week and more people opted for delivery rather than going to the store.

A

Not an example of a hidden variable

Although the weather is probably correlated with the increase in same-day delivery, it is not related to the discount, and so does not function as a hidden variable between weather and the discount.

53
Q

A student finds that there is a positive correlation between the volume of music and the prevalence of acne. The hidden variable is age; teenagers tend to listen to louder music and have more acne.

A

Example of a hidden variable

In this case, first two variables are acne and music volume. The third variable, age, is related to both. Age is related to acne; with acne decreasing once a person passes adolescence. In addition, age is related to music volume, with younger people tending to listen to louder music. Loud music does not lead to acne or vice-versa.

54
Q

A time series

A

A data set in which one of the variables is time.
Time series data contain data about a given subject in temporal order, measured at regular time intervals (e.g. minutes, months, or years). Managers collect and analyze time series to identify trends and predict future outcomes.

55
Q

Cross-Sectional

A

Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period. Managers use cross-sectional data to compare metrics across multiple groups.

56
Q

For each of the following scenarios, determine whether it would be better to analyze cross-sectional or time series data.

A

We want to compare the daily sales of stores in a mall during a day-long mall-wide event. (Cross-Sectional)
Since we are interested in the sales of different stores on a single day (a single point in time), we should analyze a cross-section of the stores in the mall.

We want to see if the Red Sox performance changes over the course of the baseball season. (Time Series)
Since we are interested in comparing the Red Sox performance at different points in time during the baseball season, we should analyze time series data.

We want to know the current average height and weight of citizens in each country that belongs to the European Union. (Cross-Sectional)
Since we are interested in the average height and weight of citizens living in different countries in the European Union at a specific point in time (“currently”), we should analyze a cross-section of citizens.

We want to know if a company’s profits have increased after it started advertising more. (Time series)
To determine whether profits have increased during a period of time, we must compare profits over time. Therefore, we should analyze time series data.

We want to compare the final exam scores of students this semester. (Cross-Sectional)
Since we are interested in final exam scores for a single point in time (this semester), we should analyze cross-sectional data of this year’s results.

We want to know if rates of dementia in the U.S. have decreased. (Time series)
To determine whether rates of dementia have decreased, we must compare dementia rates over time. Therefore, we should analyze time series data.