1.5: Point Estimates, Confidence intervals, and resampling Flashcards
The two branches of statistical inference
hypothesis testing and estimation
hypothesis testing
seeks to find if the value of a parameter equals some specific value
Estimation
seeks to find the value of the parameter
Estimators
the formulas used to calculate the sample statistics
Estimates
are the particular values derived from these estimators
An unbiased estimator
one whose expected value equals the parameter it is estimating
An efficient unbiased estimator
has the smallest sampling distribution variance for a given sample size
ex:
–> Estimator A is efficient because its estimates are tightly grouped around the true value of μ (smaller standard error).
–> Estimator B is inefficient because its estimates are more spread out from the true value of μ (larger standard error)
A consistent estimator
gets closer to the population parameter’s value as the sample size increases
As the sample size approaches infinity, the standard error will approach zero, and the distribution will fully concentrate over the true population value
ex:
–> Estimator A is consistent because its standard error significantly narrows down when sample size increases.
–> Estimator B is inconsistent. Increasing sample size barely improves the accuracy of the estimate
A point estimate is unlikely to exactly equal the population parameter due to sampling error
what should we use then?
An interval estimate
A 100(1−α)% confidence interval
is a range that has a 1−α probability of containing the parameter, where α is the significance level
ex: using a 5% significance level creates a 95% confidence interval around the sample mean. We can be 95% confident that the population mean falls somewhere in this interval
A 100(1−α)% confidence interval is calculated by:
Point Estimate ± Reliability Factor × Standard Error
The 100(1−α)% confidence interval for a population mean from a normally distributed population with known variance is:
what does this do?
X¯ ± z(of)(α/2) * (σ/√n)
This produces a confidence interval with upper and lower bounds with a total of α
probability that the population mean is outside the confidence interval
z(of)(α/2) is used because α/2 represents what percent would be in each tail.
When the population variance is unknown, as is often the case, it is appropriate to use the sample standard deviation as a substitute for the population standard deviation.
what is the formula?
X¯ ± z(of)(α/2) * (s/√n)
the t-distribution
used for confidence intervals when the population variance is unknown
This is valid even when the sample size is small
Since it is more conservative (i.e., the reliability factor is bigger), the confidence interval will be wider
The confidence interval for the population mean can use the t-distribution when the variance is unknown provided the sample is large, or the population is approximately normally distributed.
what is the formula to do so?
X¯ ± t(of)(α/2) * (s/√n)
degrees of freedom: n - 1
we have to use the t table and see where the level of confidence intersects with the degrees of freedom on the table to
which do we use between z and t distributions for:
large sample size
Unknown population variance
t is better
z is acceptable
which do we use between z and t distributions for:
large sample size
known population variance
z
which do we use between z and t distributions for:
small sample size
not a normal distribution
not available
which do we use between z and t distributions for:
small sample size
normal distribution
known population variance
z
which do we use between z and t distributions for:
small sample size
normal distribution
unknown population variance
t
A point estimate is most accurately described as:
A
an expected value.
B
an expected value and a standard error.
C
an expected value and a confidence interval.
A
an expected value.
A sampling model that produces an expected value of 5.0% for the equity risk premium is most likely considered to be an unbiased estimator if:
A
the population mean equity risk premium is 5.0%.
B
the standard error of the sample mean decreases as the sample size increases.
C
the standard error of the sample mean couldn’t get any smaller without increasing the sample size.
A
the population mean equity risk premium is 5.0%.
An analyst reports that the equity risk premium is estimated to be 3.0%, with a 95% probability of being between 2% and 4%. The reliability factor is most likely:
A
1%.
B
95%.
C
1.96.
C
1.96.
The reliability factor (RF)
the thing that is equal the level of confidence divided by 2 on the z table
Resampling
a process that allows analysts to repeatedly draw samples from the original data set
when is resampling important?
important when the sample size is too small to accurately estimate the population parameter
two techniques for resampling
bootstrap resampling
jackknife resampling.
bootstrap resampling
usually requires computer simulation
Using this method, each sample drawn is being replaced with an identical element for the next draw, so the sample size stays the same after each draw.
The size of each resample is also same as the size of the original sample.
Boostrap is able to determine the standard error and confidence intervals for statistics such as the median.
In addition, it produces accurate estimates without relying on any analytical formula
ex:
an analyst may want to estimate the population mean using the mean of one set of sample
The analyst may construct the distribution of the sample mean by creating multiple resamples from this single sample set
These resamples will then form a distribution that can approximate the true sampling distribution
the standard error of the sample mean formula when using bootstrap resampling
sX¯ = √(1/(B−1) * ∑(θb^ − θ¯)^2)
B: number of resamples drawn from the original sample
θb^: mean of the resample
θ¯: mean of all resample means
From this formula, the greater the number of resamples, the smaller the estimated standard error of the sample mean.
jackknife resampling
draws samples by leaving out one observation at a time (without replacement)
commonly used to reduce the bias of an estimator
main differences between bootstrap resampling and jackknife resampling:
Results for each run:
- Bootstrap: Different because of random sampling
- Jackknife: Similar due to its computation procedure
Number of repetitions:
- Bootstrap: Flexible depending on circumstances
- Jackknife: Same as the original sample size (e.g., 4 runs for a sample of 4)
Data snooping (or data mining)
refers to overusing the same data
A model is built by searching diligently for any statistically significant patterns.
Researchers tend to focus on the small number of significant patterns they find and rarely publish their many statistically insignificant results.
As noted by economist Ronald Coase, “If you torture the data long enough, it will confess.”
To identify data snooping bias, analysts may split the data into which three separate sets?
Training dataset
Validation dataset
Test dataset
Training dataset
Used to model and fit parameters
Validation dataset
Used to evaluate model fit and tune parameters
Test dataset
Used to evaluate the final model fit
with data snooping (data mining), where is a genuine relationship found
it should be found in the out-of-sample test
with data snooping (data mining), when is a model successful?
a model is only successful if it works in the future
intergenerational data mining
what is it and how come it is used?
what is a bias or a con with this?
using results from previous studies
many researchers use the same data sets
This often leads analysts to study the same anomalies and thus exaggerate the importance
Sample selection bias
occurs if certain assets or time periods are excluded from the data
ex: survivorship bias
sometimes occurs when stock price and accounting data are used
–> For example, many studies have shown the stocks of companies with low price-to-book ratios tend to outperform in future periods. This could be because companies that fail are excluded from the studies
Delisting a company’s stock from an exchange can also cause bias because it is difficult to track subsequent performance.
–> Usually, delisting occurs because of poor performance
survivorship bias
occurs if only funds still in existence are included in the study
This can even occur when studying international indices if economies that do not survive are excluded
why does a Hedge fund performance has a significant self-selection bias?
because the hedge fund managers voluntarily share information
–> Only managers with positive results are inclined to include results in databases
when are Investors are also influenced by implicit selection bias?
when there is a threshold that enables self-selection
–> example, the NYSE has higher stock listing requirements than other smaller exchanges
–> The NYSE-listed stock investors may implicitly believe their stocks are of higher quality than those in other exchanges, although the higher listing requirements do not translate into higher expected returns
Backfill bias
another variation of selection bias
When a new fund is added to an index, its past performance may be backfilled into the index’s database
–> This can inflate the index return because new funds are normally added only after they have good performance
Look-ahead bias
occurs if the information is used that would not have been available on the test data
For example:
accounting information such as book value will not be available for some time after the end of the period
–> It can arise implicitly if future data is inappropriately used without realizing it.
to mitigate look-ahead bias, what can analysts use?
point-in-time (PIT) data
when they are available
point-in-time (PIT) data
contain information that is available at the time of recording/publication
a time-period bias
Longer time periods are generally preferred but may include data from different structural periods
An analyst collects a sample of 12 monthly return datapoints that have been drawn from a larger population. Wanting to reduce the bias of the expected value based on this small sample size, the analyst decides to resample the data using the jackknife method. Which of the following statements regarding this resampling process is most accurate?
A
Each repetition will include 11 observations
B
The process will be completed in 11 repetitions
C
The sample for each of the 12 repetitions will be drawn with replacement
A
Each repetition will include 11 observations
An analyst is conducting a market liquidity study. After studying a stratified sample of dividend-paying stocks, the analyst concludes that the economy is sufficiently liquid and that the stock market may be undervalued. The analyst’s conclusion is most likely affected by:
A
time-period bias.
B
data-mining bias.
C
sample selection bias.
C
sample selection bias.
n analyst randomly samples 100 small-cap stocks and 100 large-cap stocks that have been part of a broad equity index for at least 10 years and concludes that small-cap stocks have outperformed large-cap stocks on a risk-adjusted basis over the past decade and considers whether this asset class can generate positive excess returns over the next 5 five years. The analyst’s conclusion is most likely affected by:
A
look-ahead bias.
B
time-period bias.
C
survivorship bias.
C
survivorship bias.