ECO 446 Flashcards

1
Q

chapter 17

Sampling

A

statistical inference:
involves using the sample to draw conclusions about the characteristics of the population from which the sample came

Population: entire group of items that interests the
researcher

Sample: part of the population that we actually
observe

u is the population mean, x-bar is the sample mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between SAMPLE STATISTICS and POPULATION PARAMETERS?

A

Sample statistics are obtained from estimates using the sample data. Population statistics require knowledge of the entire population.
In general, we rarely know the true values of our population parameters!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Probability Distributions:

A

•Random variable: a variable X whose outcome is determined by chance, the outcome of a random phenomenon
Discrete: has a countable number of possibilities (coin flips, rolls of a dice)
Continuous: can take on any value in any interval (height, temperature)

•Probability distribution: a probability
distribution P[Xi] for a discrete random
variable assigns probabilities to the possible
values X1, X2 …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The Normal Distribution

A

Real world data often conform to a normal
distribution, and many probability distributions
converge to a normal distribution when they
are cumulated

Central limit theorem: If Z is a standardized
sum of N independent, identically distributed
(discrete or continuous) random variables with
a finite, nonzero standard deviation, then the
probability distribution of Z approaches the
normal distribution as N increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bias

A

If estimated parameters are UNBIASED, it means
the expected value of the sample statistic is equal to the Population parameter!

If estimated parameters are BIASED, it is likely due to a biased sample. Some examples of sampling bias:
• selection: when a sample systematically excludes or under-represents certain groups

self-selection: when respondents choose to be in a particular group (examining physical fitness of joggers)

  • survivor: when a sample follows individuals over time, yet only studies those who survive (medical studies, stock market)
  • non-response: systematic refusal of some groups to participate in a study

unbias: E[x-bar] = u

A sample statistic is an unbiased estimator of a population parameter if the mean of the sampling distribution of this statistic is equal to the value of
the population parameter

however, we will never know u

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

chapter 17 exercise from the textbook

  1. Write the meaning of each of the following terms without referring to the book (or your notes), and compare your definition with the version in the text for each.
    a. probability distribution
    b. random variable
    c. standardized random variable
    d. sample
    e. sampling distribution
    f. population mean
    g. sample average
    h. population standard deviation
    i. sample standard deviation
    j. degrees of freedom
    k. confidence interval
A

a. probability distribution
A probability distribution P[Xi] for a discrete random variable X assigns probabilities to the possible values X1, X2, and so on. Probability distributions are scaled so that the total area inside the rectangles is equal to 1.

b. random variable
A random variable X is a variable whose numerical value is determined by chance, the outcome of a random phenomenon.

c. standardized random variable
To standardize a random variable X, we subtract its mean u and then divide by its standard deviation std (or sigma):
Z = (X - u)/std
To standardize a random variable X, we subtract its mean 0 and then divide by its standard deviation 1

The standardized variable Z measures how many standard deviations X is above or below its mean. If X is equal to its mean, Z is equal to 0. If X is one standard deviation above its mean, Z is equal to 1. If X is two standard deviations below its mean, Z is equal to -2

d. sample:
part of the population that we actually observe

e. sampling distribution
The sampling distribution of a statistic, such as x-bar, is the probability distribution or density curve that describes the population of all possible values of this statistic. It can be shown mathematically that if the individual observations are drawn from a normal distribution, then the sampling distribution for x-bar is also normal. Even if the population does not have a normal distribution, the sampling distribution of x-bar will approach a normal distribution as the sample size increases

g. sample average
The sample average (also called the sample mean) is the simple arithmetic average of N observations :

h. population standard deviation
The standard deviation of the sampling distribution depends on the value of population standard deviation sigma, a parameter that is unknown but can be estimated. The most natural estimator of sigma, the standard deviation of the population is s, the standard deviation of the sample data. The sample variance of
N observations is the average squared deviation of these observations about the sample average

The sample standard deviation s is the square root of the variance: s = square(sample variance)

Standard error of X-bar = s/squre(N)

In 1908, W. S. Gosset figured out how to handle this increased uncertainty. Gosset was a statistician employed by the Irish brewery Guinness, which encouraged statistical research but not publication. Because of the importance of his findings, he was able to persuade Guinness to allow his work to be published under the pseudonym “Student” and his calculations became known as the Student’s t-distribution

Student’s t-distribution. When the mean of a sample from a normal distribution is standardized by subtracting the mean of its sampling distribution and dividing by the standard deviation of its sampling distribution, the resulting Z variable
Z = (X-bar - u)/(u/squre(N))
Gosset determined the sampling distribution of
the variable that is created when the mean of a sample from a normal distribution is standardized by subtracting and dividing by its standard error:
t = (X-bar - u)/(s/squre(N))

t-distributions that are identified by the number of degrees of freedom:

degree of freedom = number of observation - number of parameters that must be estimate

Here, we calculate s by using N observations and one estimated parameter X-bar; therefore, there are degrees of freedom N-1

k. confidence interval page 24
Now we are ready to use the t-distribution and the standard error of to measure the reliability of our estimate of the population mean price of homes in
Diamond Bar. If we specify a probability, such as we can use Table B-1 to find the t-value such that there is a probability that the value of t will exceed , a probability that the value of t will be less than , and a probability that the value of -t will be in the interval to t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. The heights of U.S. females between the age of 25 and 34 are approximately normally distributed with a mean of 66 inches and a standard deviation of 2.5 inches. What fraction of the U.S. female population
    in this age bracket is taller than 70 inches, the height of the average adult U.S. male of this age?
A

Z = [70 - 66]/2.5 = 1.6
查The normal distrubution 表 p547
P(z=1.6) = 0.0548

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A stock’s price-earnings (P/E) ratio is the per-share price of its stock divided by the company’s annual profit per share. The P/E ratio for the stock market as a whole is used by some analysts as a measure of whether stocks are cheap or expensive, in comparison with other historical periods. Here are some annual P/E ratios for the S&P 500:

Year P/E
1980 7.90
1981 8.36
1982 8.62
1983 12.45
1984 9.98
1985 12.32
1986 16.42
1987 18.25
1988 12.48
1989 13.48
1990 15.46
1991 20.88
1992 23.70
1993 22.42
1994 17.15
1995 16.42
1996 19.08
1997 21.88
1998 28.90
1999 31.55

Calculate the mean and standard deviation. Was the 1999 price-earnings ratio of 31.55 more than one standard deviation above the mean P/E for 1980–1999? Was it more than two standard deviations above the mean?

A

mean: 16.886
standard deviation: 6.43
z-scored for 1999: 2.2798
it is more than one standard deviation but more than two standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. Which has a higher mean and which has a higher standard deviation:
    a standard six-sided die or a four-sided die with the numbers 1 through 4 printed on the sides? Explain your reasoning, without doing any calculations
A

Because of the numbers on each side are equally likely, we can reason directly that a six-sided die has an expected value of 3.5 and a four-sided die has an expected value of 2.5. Because the possibilities are more spread out on the six-sided die has the larger standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. A nationwide test has a mean of 75 and a standard deviation of 10. Convert the following raw scores to standardized Z values: X = 94, 75, and 67. What raw score corresponds to Z = 1.5?
A
mean = 75
std = 10
X = 94
Z-scored = (94 - 75)/10 = 1.9
X = 75
Z-scored = (75 - 75)/10 = 0
X = 67
Z-scored = (67 - 75)/10 = -0.8

So, none of the raw score coresponds to Z = 1.5?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. A woman wrote to Dear Abby, saying that she had been pregnant for 310 days before giving birth. Completed pregnancies are normally distributed with a mean of 266 days and a standard deviation of 16
    days. Use Table B-7 to determine the probability that a completed pregnancy lasts at least 310 days.
A

Table B-5 = B-7

mean(u) = 266
standard deviation (std) = 16
The z values and normal probabilities are:

P[x>310] = P[(x-u)/std > (310-266)/16] = P[z-score > 2.75] = 0.003

Therefore, p= 1-0.9970 or =0.003. There is a 0.3% chance that pregnancy lasts 310 days.

question: in the answer for ch 17, what is the 270 stand for. typo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. Calculate the mean and standard deviation of this probability distribution for housing prices:
    Price X (dollars) Number of Houses Probability P[X]
    400,000 15,000 0.30
    500,000 20,000 0.40
    600,000 15,000 0.30
A

E(x) = 400000 × 0.3 + 500000 × 0.4 + 600000 ×0.3 = 500000

V(x) = E(x^2) - [E(x)]^2
= 400000^2 × 0.3 + 500000^2 × 0.4 + 600000^2 ×0.3 - 500000^2

std = squre(V(x)) = 77459.66692

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. Explain why you think that high-school seniors who take the Scholastic Aptitude Test (SAT) are not a random sample of all high-school seniors. If we were to compare the 50 states, do you think that a state’s
    average SAT score tends to increase or decrease as the fraction of the state’s seniors who take the SAT increases?
A

The high-school seniors intend to take take the exam because they aim to get a better college offer and some students always have the above-average score. With the fraction of the state’s seniors who take the SAT increase, the state’s average SAT score decrease and vice versa. Because the weaker students join the SAT exam and pull the average score down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. American Express and the French tourist office sponsored a survey that found that most visitors to France do not consider the French to be especially unfriendly. The sample consisted of “1,000 Americans
    who have visited France more than once for pleasure over the past two years.” Why is this survey biased?
A

The survey may be biased. First of all, the sample only selected Americans who have visited France more than once for pleasure over the past two years. It excluded those visitors who visited France over the past three years or more. Secondly, some of the visitors who in the sample may not respond to the survey. These reasons may cause the x-bar differents population. So, that is why the survey is unbiased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. The first American to win the Nobel prize in physics was Albert Michelson (1852–1931), who was given the award in 1907 for developing and using optical precision instruments. His October 12–November
    14, 1882 measurements of the speed of light in air (in kilometers per second) were as follows:

299,883 299,796 299,611 299,781 299,774 299,696 299,748 299,809 299,816 299,682 299,599 299,578 299,820 299,573 299,797 299,723 299,778 299,711 300,051 299,796 299,772 299,748 299,851
Assuming that these measurements were a random sample from a normal distribution, does a 99 percent confidence interval include the value 299,710.5 that is now accepted as the speed of light?

A

looking for B-1 tow-tailed

sample mean(x-bar): 299756
sample standard deviation(std): 107.114
degree of freedom(n): 22
t-value for 99%CI (check the table B-1)(t): 2.819

x-bar - t*std/sqrt(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. A Wall Street Journal (July 6, 1987) poll asked 35 economic forecasters to predict the interest rate on three-month Treasury bills in June 1988. These 35 forecasts had a mean of 6.19 and a variance of 0.47. Assuming these to be a random sample, give a 95 percent confidence interval for the mean prediction of all economic forecasters and then explain why each of these interpretations is or is not correct:
    a. There is a 0.95 probability that the actual Treasury bill rate on June 1988 will be in this interval.
    b. Approximately 95 percent of the predictions of all economic forecasters are in this interval.
A

mean: 6.19
variance: 0.47

a statement is incorrect, b statement is correct.

Because the confidence interval or certain interval should approximately include 95 percent probability of mean prediction of all forecasters. But a statement said the actual rate should be in the interval. Compare with the a statement, b statement is correct.

17
Q
  1. The earlobe test was introduced in a letter to the prestigious New England Journal of Medicine, in which Dr. Sanders Frank reported that 20 of his male patients with creases in their earlobes had many of the risk factors (such as high cholesterol levels, high blood pressure, and heavy cigarette usage) associated with heart disease. For instance, the average cholesterol level for his patients with noticeable earlobe creases was 257 (mg per 100 ml), compared to an average of 215 with a standard deviation of 10 for healthy middle-aged men. If these 20 patients were a random sample from a population with a mean of 215 and a standard deviation of 10, what is the probability their average cholesterol level would be 257 or higher? Explain why these 20 patients may, in fact, not be a random sample.
A

looking for B-1 tow-tailed

If x is N[215, 10] then for a random sample of size N = 20

P[x-bar <= 257] = P[(x-bar - mean)/[std/sqrt(N)]>=(257-215)/(10/sqrt(20))] = P[z>=18.8] may = 0

Dr. Frank’s patients may choose to be medical patients because they have heart problems. Any trait they happen to share will then seemingly explain the heart disease; however, the standard statistical tests are not valid if these are not a random sample from the population of all people with earlobe creases.

18
Q

chapter 1: Introduction to Regression Analysis

notes from ppt

A

Three uses of econometrics

  1. Describing reality
  2. Testing hypotheses about economic and business related issues
  3. Forecasting future activities

Steps necessary for any kind of quantitative research

  1. Specify relationship to be tested
  2. Collect the data needed to quantify the results
  3. Quantify models with the data

What is the regression analysis?
Regression analysis:
a statistical technique that attempts to “explain” movements in one variable the dependent variable, as a function of movements in a set of other variables, called the independent (or explanatory) variables

Y = f (X)
Y is the dependent variable because it is the variable
that we are trying to explain.
X’s are the independent variables
Economic theory tells us that X’s are useful in predicting Y . In other words, Y depends upon the values of X.

Simplest Linear Regression Model
Y = β0  + β1X
β’s are population coefficients (parameters)
β0 is the constant or intercept
β1 is the slope coefficient
What is β1 ?
β1 shows the change in Y for a one-unit
change in X

Linear in parameters (coefficients) vs. Linear in variables
To use regression analysis, our equation must be
linear in coefficients
Linear in variables: if drawing the function in
terms of X and Y generates a straight line
Y = β0 + β1X
Linear in parameters (coefficients): if the coefficients appear in the simplest form—The β’s are not raised to any power, multiplied, or divided by other coefficients, and do not themselves include any sort of function

Regression results cannot prove causality!
For example, if variables A and B are related
statistically, then:
-A might “cause” B.
-B might “cause” A.
-Some third factor might “cause” both.
-The relationship might have happened by
chance.

Use 1: Describing economic reality
Econometrics can quantify and measure
marginal effects and estimate numbers for
theoretical equations.
For example, consumer demand for a product
often can be thought of as a relationship
between the quantity demanded (Q) and its
price (P), the price of a substitute (Ps
), and disposable income (Yd) 可支配收入

Use 2: Testing hypotheses about
economic theory and policy.

Much of economics involves building
theoretical models and testing them against
evidence.
Hypothesis testing is a vital part of that
process.
You could test the hypothesis that the product
in Equation (1.1) is a normal good.
Q = β0 + β1P + β2Ps + β3Yd

Use 3: Forecasting future economic activity
•Economists use econometrics to forecast a
variety of variables (GDP, sales, inflation, etc.)
•Accuracy of forecasts depends in large measure on
the degree to which the past is a good guide to the future
•To the extent econometrics can shed light on the future, leaders will be better equipped to make decisions

Adding the error term
Even if our regression does a great job, we will
never be able to perfectly predict our dependent
variable–there will be some sources of error. Stochastic error term: a term added to the regression equation to introduce all the variation in Y that cannot be explained by the X’s
A typical regression equation: Y = β0 + β1X + ε
Has both a deterministic part, E(Y|X) = β0 + β1X
and a stochastic element, ε

Four sources of ε
Remember to think of this ε as a source of variation in Y that is not explained by changes in our X’s.
There are 4 possible sources for this variation:
1. Other important variables are omitted
2. Measurement error occurs with either the dependent or independent variables
3. The underlying process has a different functional form
4. Purely random variation

Extending the regression notation
Yi = β0 + β1X1i + β2X2i + … + βiXni + ε

X1i : the ith observation for the first independent variable
X2i : the ith observation for the second independent
variable
The coefficients do not change from observation to
observation, but the values of X, Y, and ε do.

β’s represent the estimated coefficients
Yi-hat is the estimated value of Yi and it represents the value of Y that you would get if you predicted Y using the parameters from the estimated regression equation
The closer Yi-hat is to Yi, the better the fit of the regression

Residual: difference between the observed Y and the estimated regression line Yi and Yi-hat : ei = Yi - Yi-hat

Error term: difference between the observed Y and the trueregression equation: errori = Yi - E(Yi|Xi)

Understand your data
Three main types of datasets:
1. Cross-section: all observations are from the same
point in time and represent different entities:
Social indicators from different countries
House prices and house characteristics for different houses
Quantities of a product sold in different stores
2. Time series: observations of the same variable in
different time periods
Annual GDP
Daily sales of a product
Income over a lifetime
3. Panel: combination of both time series and cross section: information on a group of individuals over time—will not focus on panel data in this introductory course.

Height-Weight Example of Regression Analysis
Theoretical Model:
Relationship to be tested: weight is a function of height. More specifically, taller people tend to weigh more than shorter people
Yi = f(Xi) = β0 +β1Xi + errori
Yi = the weight in pounds of the ith customer
Xi = the height (in inches above 5 feet) of the ith customer i
errori = the value of the unknown stochastic error term for the ith customer
You gather data on 20 individuals and use it to estimate a regression equation:
Each value of i represents an individual in the
sample.
If you select four individuals (Woody, Lesley,
Bruce, and Mary), then you could write out an
equation for each:
Each individual has their own height and weight.
Random events impact people differently.
To account for these random differences each
individual needs their own value of the error term
(εi).
Note that the regression coefficients (the β’s)
don’t vary by individual.
Rather, the β’s apply to the whole sample.

19
Q
  1. Write the meaning of each of the following terms without referring to the book (or your notes), and compare your definition with the version in the text for each:
    a. constant or intercept (p. 7)
    b. cross-sectional (p. 21)
    c. dependent variable (p. 5)
    d. estimated regression equation (p. 14)
    e. expected value (p. 9)
    f. independent (or explanatory) variable (p. 5)
    g. linear (p. 8)
    h. multivariate regression model (p. 12)
    i. regression analysis (p. 5)
    j. residual (p. 15)
    k. slope coefficient (p. 7)
    l. stochastic error term (p. 8)
A

a. constant or intercept

20
Q
  1. Use your own computer’s regression software and the weight (Y) and height (X) data from Table 1.1 to see if you can reproduce the estimates in Equation 1.19. There are two ways to load the data: You can type in the data yourself or you can download datafile HTWT1
    (in Stata, EViews, Excel, or ASCII formats) from the text’s website: http://www.pearsonhighered.com/studenmund. Once the data file is loaded, run Y = f1X2, and your results should match Equation 1.19. Different programs require different commands to run a regression. For help in how to do this with Stata or EViews, either see the answer
    to this question in Appendix A or read Appendix 1.7
A

http://www.pearsonhighered.com/studenmund

21
Q

Homework #3

  1. Not all regression coefficients have positive expected signs. For example, a Sports Illustrated article by Jaime Diaz reported on a study of golfing putts of various lengths on the Professional Golfers’ Association (PGA) Tour.11 The article included data on the percentage of putts made 1Pi2 as a function of the length of the putt in feet 1Li2. Since the longer the putt, the less likely even a professional is to make
    it, we’d expect Li to have a negative coefficient in an equation explaining Pi. Sure enough, if you estimate an equation on the data in the article, you obtain:
    PNi = 83.6 - 4.1Li (1.22)
    a. Carefully write out the exact meaning of the coefficient of Li
    .
    b. Suppose someone else took the data from the article and estimated:
    Pi = 83.6 - 4.1Li + ei
    Is this the same result as that of Equation 1.22? If so, what definition do you need to use to convert this equation back to Equation 1.22?
    c. Use Equation 1.22 to determine the percent of the time you’d expect a PGA golfer to make a 10-foot putt. Does this seem realistic? How about a 1-foot putt or a 25-foot putt? Do these seem as realistic?
    d. Your answer to part c should suggest that there’s a problem in applying a linear regression to these data. What is that problem?
A

A. according to the model, Li is the putt’s length in feet
, and Pi is the percentage of putts made. The coefficient of Li shows the relationship between the length of putts in feet (Li) and the percentage of Golf putts made (Pi-hat). For the exact meaning, it is “when the length of the putt (Li) decrease one unit (foot), the Pi-hat(percentage of gulf putt) increase by 4.1 percent. The change of 4.1 percent is caused by the coefficient of -4.1. The negative sign also shows the negative relationship between the Pi-hat and Li.

b. Suppose someone else took the data from the article and estimated:
Pi = 83.6 - 4.1Li + ei
Is this the same result as that of Equation 1.22? If so, what definition
do you need to use to convert this equation back to Equation 1.22?

The result is different from Equation 1.22. The new Equation is an estimated regression of the putts made percentage. The difference between is the term error(ei). Because the new equation put the error terms in it. The error terms from some sources. We can convert this new equation back to Equation 1.22 when the error term equals to zero. Even if our regression does a great job, we will never be able to perfectly predict our dependent variable. There will be some sources of error. So, we need the error terms to help us to predict data more precisely. Error terms are stochastic, and it added to the regression equation to introduce all the variation in Pi that cannot be explained by the Li.
There are 4 possible sources for this variation:
1. Other important variables are omitted
2. Measurement error occurs with either the dependent or independent variables
3. The underlying process has a different functional form
4. Purely random variation

c. Use Equation 1.22 to determine the percent of the time you’d expect a PGA golfer to make a 10-foot putt. Does this seem realistic? How about a 1-foot putt or a 25-foot putt? Do these seem as realistic?

10-foot putt:
Li = 10
Pi = 83.6 - 4.1 x 10 = 42.6
It seems realistic.

1-foot putt:
Li = 10
Pi = 83.6 - 4.1 x 1 = 79.5
It seems realistic.

25-foot putt:
Li = 10
Pi = 83.6 - 4.1 x 25 = -18.9
It is unrealistic.

Summary: For the 10-foot putt, the estimated time percentage seems to be realistic. For the 1-foot putt, it seems also realistic. Because they are all positive. But the percentage of time for the 25-foot putt is positive. In real life, the positive percentage of time seems to be impossible.

d. Your answer to part c should suggest that there’s a problem in applying linear regression to these data. What is that problem?

When we play golf, the longer the Putt, the more to make the putt. So, the percentage of time would be higher. The relationship between the Putt length and time percentage should be positive. The problem in the linear regression has a negative coefficient. In the new equation, the term error (ei) can be a hurdle to make the regression to be more realistic.

22
Q

Homework #5

  1. If an equation has more than one independent variable, we have to be careful when we interpret the regression coefficients of that equation. Think, for example, about how you might build an equation to
    explain the amount of money that different states spend per pupil on public education. The more income a state has, the more they probably spend on public schools, but the faster enrollment is growing, the less there would be to spend on each pupil. Thus, a reasonable equation for per-pupil spending would include at least two variables: income and enrollment growth:

Si = β0 + β1Yi + β2Gi + ei (1.24)

where: Si = educational dollars spent per public school student in the ith state
Yi = per capita income in the ith state (in dollars)
Gi = the percent growth of public school enrollment in
the ith state

a. State the economic meaning of the coefficients of Y and G. (Hint: Remember to hold the impact of the other variable constant.)
b. If we were to estimate Equation 1.24, what signs would you expect the coefficients of Y and G to have? Why?
c. Silva and Sonstelie estimated a cross-sectional model of per student spending by state that is very similar to Equation 1.24:

Si-hat = -183 + 0.1422Yi - 5926Gi (1.25)
N = 49
Do these estimated coefficients correspond to your expectations? Explain Equation 1.25 in common sense terms.

d. The authors measured G as a decimal, so if a state had a 10 percent growth in enrollment, then G equaled .10. What would Equation 1.25 have looked like if the authors had measured G in percentage points, so that if a state had 10 percent growth, then G would have
equaled 10? (Hint: Write out the actual numbers for the estimated coefficients.)

A

a. State the economic meaning of the coefficients of Y and G. (Hint: Remember to hold the impact of the other variable constant.)
a. State the economic meaning of the coefficients of Y and G. (Hint: Remember to hold the impact of the other variable constant.)

According to the given information and the linear regression, the dependent variable is Si (educational dollars spent per public school student in the ith state), and the independent variables are Yi (per capita income in the ith state (in dollars)) and Gi (the percent growth of public school enrollment in the ith state).

When the Gi variables hold constant, the increases in per capita income (Yi) cause the increases or decreases in the education sector spending in the state government (Si). The meaning of β1 (the coefficient of Yi) is that what degree of Si would change if Yi change. If the sign of the β1 (the coefficient of Yi) is positive and Gi holds constant, the Yi increases one unit, which causes the Si to increase β1. If the sign of the β1 (the coefficient of Yi) is negative and Gi holds constant, the Yi increases one unit, which causes the Si to decrease β1.

If Yi holds constant, the increases in public school enrollment (Gi) cause the decreases in the education sector spending in the state government (Si). The meaning of β2 (the coefficient of Gi) is that what degree of Si would change if Gi change. If the sign of the β2 (the coefficient of Gi) is positive and Yi holds constant, the Yi increases one unit, which causes the Si to increase β2. If the sign of the β2 (the coefficient of Gi) is negative and Yi holds constant, the Gi increases one unit, which causes the Si to decrease β2.

b. If we were to estimate Equation 1.24, what signs would you expect the coefficients of Y and G to have? Why?

For the sign of the β1, it would be estimated as positive. As the Gi holds constant, the more education dollar would be spent for each public school student (Si), if the state income increase (Yi).

For the sign of the β1, it would be estimated as negative. As the Yi holds constant, the education spending (Si) would be affected by the number of public school enrollment. If the growth of enrollment (Gi) increases, the education spending (Si) would be declined. Because the money would be allocated for each public student.

c. Silva and Sonstelie estimated a cross-sectional model of per student spending by state that is very similar to Equation 1.24:

Si-hat = -183 + 0.1422Yi - 5926Gi (1.25)
N = 49
Do these estimated coefficients correspond to your expectations? Explain Equation 1.25 in common sense terms

For the equation 1.25, the estimated coefficients correspond to the theoretical expecteations. But the
The coefficients of Yi is positive. If per capita income in state (Yi) increase one dollar, as the Gi holds constant, the education dollars for each public school student (Si) will increase to 0.1422 dollar. If the percentage growth rate of public school enrollment in state (Gi) increase one dollar, as the Yi holds constant, the education dollars for each public school student (Si) will decrease to 5926 dollar. So, the slope of Yi coefficient is positive, and of Gi is negative.

For the value (-183) of slope, when the Gi and Yi equal zero, it should be the government expenses for some educational purpose. Generally, there should be some minimal expense for government.

d. The authors measured G as a decimal, so if a state had a 10 percent growth in enrollment, then G equaled .10. What would Equation 1.25 have looked like if the authors had measured G in percentage points, so that if a state had 10 percent growth, then G would have
equaled 10? (Hint: Write out the actual numbers for the estimated coefficients.)

If the authors had measured in percentage points and a state had 10 percent growth, then the coefficient of G would equal 0.1. Becasue the 10 percent is equal to 0.1, and the G would be measured as percentage points. The regression function would be Si-hat = -183 +0.1422Yi - 59

23
Q

Homework #7

  1. Let’s return to the wage determination example of Section 1.2. In that example, we built a model of the wage of the ith worker in a particular field as a function of the work experience, education, and gender
    of that worker:
    WAGEi = β0 + β1EXPi + β2EDUi + β3GENDi + ei (1.10)
    where:
    Yi = WAGEi = the wage of the ith worker
    X1i = EXPi = the years of work experience of the ith worker
    X2i = EDUi = the years of education beyond high school of the ith worker
    X3i = GENDi = the gender of the ith worker (1 = male and 0 = female)
    a. What is the real-world meaning of β2? (Hint: If you’re unsure where to start, review Section 1.2.)

b. What is the real-world meaning of β3? (Hint: Remember that GEND is a dummy variable.)

c. Suppose that you wanted to add a variable to this equation to measure whether there might be discrimination against people of color.
How would you define such a variable? Be specific.

d. Suppose that you had the opportunity to add another variable to the equation. Which of the following possibilities would seem best? Explain your answer.
i. the age of the ith worker
ii. the number of jobs in this field
iii. the average wage in this field
iv. the number of “employee of the month” awards won by the ith worker
v. the number of children of the ith worker

A

a. What is the real-world meaning of β2? (Hint: If you’re unsure where to start, review Section 1.2.)

When the years of work experience of the ith worker (EXPi or X1i), the gender of the ith worker (GENDi or X3i) hold constants, the real-world meaning of β2 is the changes in the wage of the ith worker (WAGEi or Yi), after the years of education beyond the high school of the ith worker (X2i or EDUi) change. The sign of the coefficient of X2i is positive, which means the additional years of education beyond high school (X2i) would higher the wage (Yi). In a company, those employees with more educational background are expected to be more efficient.

b. What is the real-world meaning of β3? (Hint: Remember that GEND is a dummy variable.)

Because GEND is a dummy variable (which shows the different wages between males and females). When other factors held constant, the β3 means that the male employee always has a higher wage than the female employee, because of the positive sign of the coefficient (and 1 = male and 0 = female). Discrimination exists in the workplace.

c. Suppose that you wanted to add a variable to this equation to measure whether there might be discrimination against people of color.
How would you define such a variable? Be specific.

Setting the dummy variable as COLORi (1 = the ith individual is white; 0 = the ith individual is black)
When the sign of COLORi is positive, the discrimination against people of color, as the other factor holds constant. It is unequal that the Whites earns more money when the job experience, educational background, and gender are the same.

d. Suppose that you had the opportunity to add another variable to the equation. Which of the following possibilities would seem best? Explain your answer.
i. the age of the ith worker
ii. the number of jobs in this field
iii. the average wage in this field
iv. the number of “employee of the month” awards won by the ith worker
v. the number of children of the ith worker

For iii, the relationship between the average wage in this field and the wage may be positive. The average wage in this field can determine the employee’s wage. When the average wage increase, the wage must increase. This would be the best.

For i, the relationship between the age of the ith worker and wage depends on the human resources manager of the different company.

For ii, iv, and v, the relationship between those variables and the wage seem not related.

24
Q

Chapter 2

notes from PPT

A

Ordinary Least Squares
Ordinary Least Squares (OLS from now on): is a regression estimation technique that chooses to minimize the sum of squared residuals

OLS is an estimator, the sum of ei^2

Why minimize the sum of squared residuals?
OLS is relatively easy to use
 Most other techniques involve iterative
non-linear estimation, second order
derivatives, stuff like Hessian matrices, etc
 You could actually do OLS with one variable
by hand!

 Minimizing 2
 i
e is intuitive
 Remember that the residuals tell us how close
our predicted value is to our actual value—the
closer we are, the better
 Why do we square it—so positive and negative
deviation share the same weight!
 Squaring the residuals also puts a heavier weight
on the outliers–observations that lie far away
from the group

 The estimated regression line goes through
the means of Y and X, Y and X
 The sum of the residuals is exactly zero
 Can be shown to be the “best” Linear
Unbiased Estimator (under somewhat
restrictive assumptions)

Total, Explained, and Residual Sum of Squares

TSS = ESS + RSS

Estimating Multivariate
Regression Models with OLS

Evaluating the Quality of a Regression Equation

  1. Is the model supported by theory?
  2. How well does the regression fit the data as a whole?
  3. Is the dataset reasonably large and accurate?
  4. Is OLS the best estimator?
  5. How well do the estimated coefficients correspond to the researcher’s previous expectations?
  6. Are all important variables included?
  7. Has the best functional form been chosen?
  8. Is the regression free of major problems?

Describing the Overall Fit of the Estimated Model
The simplest commonly used measure of fit is R2, or
the coefficient of determination. R2 is the ratio of explained sum of squares to the total sum of squares.

The higher R2 is, the closer the estimated regression
equation fits the sample. R2 must lie in the interval 0 and 1.

The higher R2 is, the closer the estimated regression
equation fits the sample. R2 must lie in the interval 0 and 1.

A major problem with R2 is adding another
independent variable to an equation can never
decrease R2
.
Recall Equation (2.14):
Adding a variable will not change TSS.
Adding a variable will, in most cases, decrease
RSS and increase R2
.
Even if the added variable is nonsensical, R2 will
increase unless the new coefficient is exactly zero.

The inclusion of the post office box variable
requires the estimation of a coefficient.
This lessons the degrees of freedom, or the
excess of the number of observations (N) over the
coefficients (including the intercept) estimated
(K+1).
The lower the degrees of freedom, the less
reliable the estimates are likely to be.
Thus, the increase in the quality of fit needs to be
compared to the decrease in the degrees of
freedom.
was developed for this purpose.

Warnings about R2
•It gives a measure of the proportion of
the variance in Y explained by the regression
•It is not a statistical test
•Cannot compare R2 for different models
•May be very high in time series and lower
in cross sectional data, but that doesn’t
mean that the cross sectional regressions
are “bad”

Adjusted R2
•If we increase the number of explanatory
variables, TSS does not change.
•If we add another explanatory variable, ESS will
either increase or stay the same
•Therefore, R2 will only increase as we add
explanatory variables
• takes into account degrees of freedom

25
Q

Do Exercises 3, 5, and 7 from Chapter 2

A

A