1.5: Point Estimates, Confidence intervals, and resampling Flashcards by Cesar A. Contla

The two branches of statistical inference

hypothesis testing and estimation

How well did you know this?

Not at all

Perfectly

hypothesis testing

seeks to find if the value of a parameter equals some specific value

How well did you know this?

Not at all

Perfectly

Estimation

seeks to find the value of the parameter

How well did you know this?

Not at all

Perfectly

Estimators

the formulas used to calculate the sample statistics

How well did you know this?

Not at all

Perfectly

Estimates

are the particular values derived from these estimators

How well did you know this?

Not at all

Perfectly

An unbiased estimator

one whose expected value equals the parameter it is estimating

How well did you know this?

Not at all

Perfectly

An efficient unbiased estimator

has the smallest sampling distribution variance for a given sample size

ex:

–> Estimator A is efficient because its estimates are tightly grouped around the true value of μ (smaller standard error).

–> Estimator B is inefficient because its estimates are more spread out from the true value of μ (larger standard error)

How well did you know this?

Not at all

Perfectly

A consistent estimator

gets closer to the population parameter’s value as the sample size increases

As the sample size approaches infinity, the standard error will approach zero, and the distribution will fully concentrate over the true population value

ex:

–> Estimator A is consistent because its standard error significantly narrows down when sample size increases.

–> Estimator B is inconsistent. Increasing sample size barely improves the accuracy of the estimate

How well did you know this?

Not at all

Perfectly

A point estimate is unlikely to exactly equal the population parameter due to sampling error

what should we use then?

An interval estimate

How well did you know this?

Not at all

Perfectly

A 100(1−α)% confidence interval

is a range that has a 1−α probability of containing the parameter, where α is the significance level

ex: using a 5% significance level creates a 95% confidence interval around the sample mean. We can be 95% confident that the population mean falls somewhere in this interval

How well did you know this?

Not at all

Perfectly

A 100(1−α)% confidence interval is calculated by:

Point Estimate ± Reliability Factor × Standard Error

How well did you know this?

Not at all

Perfectly

The 100(1−α)% confidence interval for a population mean from a normally distributed population with known variance is:

what does this do?

X¯ ± z(of)(α/2) * (σ/√n)

This produces a confidence interval with upper and lower bounds with a total of α
probability that the population mean is outside the confidence interval

z(of)(α/2) is used because α/2 represents what percent would be in each tail.

How well did you know this?

Not at all

Perfectly

When the population variance is unknown, as is often the case, it is appropriate to use the sample standard deviation as a substitute for the population standard deviation.

what is the formula?

X¯ ± z(of)(α/2) * (s/√n)

How well did you know this?

Not at all

Perfectly

the t-distribution

used for confidence intervals when the population variance is unknown

This is valid even when the sample size is small

Since it is more conservative (i.e., the reliability factor is bigger), the confidence interval will be wider

How well did you know this?

Not at all

Perfectly

The confidence interval for the population mean can use the t-distribution when the variance is unknown provided the sample is large, or the population is approximately normally distributed.

what is the formula to do so?

X¯ ± t(of)(α/2) * (s/√n)

degrees of freedom: n - 1

we have to use the t table and see where the level of confidence intersects with the degrees of freedom on the table to

How well did you know this?

Not at all

Perfectly

which do we use between z and t distributions for:

large sample size

Unknown population variance

t is better

z is acceptable

How well did you know this?

Not at all

Perfectly

which do we use between z and t distributions for:

large sample size

known population variance

How well did you know this?

Not at all

Perfectly

which do we use between z and t distributions for:

small sample size

not a normal distribution

not available

How well did you know this?

Not at all

Perfectly

which do we use between z and t distributions for:

small sample size

normal distribution

known population variance

How well did you know this?

Not at all

Perfectly

which do we use between z and t distributions for:

small sample size

normal distribution

unknown population variance

How well did you know this?

Not at all

Perfectly

A point estimate is most accurately described as:

A
an expected value.

B
an expected value and a standard error.

C
an expected value and a confidence interval.

Study These Flashcards

A
an expected value.

A sampling model that produces an expected value of 5.0% for the equity risk premium is most likely considered to be an unbiased estimator if:

A
the population mean equity risk premium is 5.0%.

B
the standard error of the sample mean decreases as the sample size increases.

C
the standard error of the sample mean couldn’t get any smaller without increasing the sample size.

Study These Flashcards

A
the population mean equity risk premium is 5.0%.

An analyst reports that the equity risk premium is estimated to be 3.0%, with a 95% probability of being between 2% and 4%. The reliability factor is most likely:

A
1%.

B
95%.

C
1.96.

Study These Flashcards

C
1.96.

The reliability factor (RF)

Study These Flashcards

the thing that is equal the level of confidence divided by 2 on the z table

Resampling

a process that allows analysts to repeatedly draw samples from the original data set

when is resampling important?

important when the sample size is too small to accurately estimate the population parameter

two techniques for resampling

bootstrap resampling jackknife resampling.

bootstrap resampling

usually requires computer simulation Using this method, each sample drawn is being replaced with an identical element for the next draw, so the sample size stays the same after each draw. The size of each resample is also same as the size of the original sample. Boostrap is able to determine the standard error and confidence intervals for statistics such as the median. In addition, it produces accurate estimates without relying on any analytical formula ex: an analyst may want to estimate the population mean using the mean of one set of sample The analyst may construct the distribution of the sample mean by creating multiple resamples from this single sample set These resamples will then form a distribution that can approximate the true sampling distribution

the standard error of the sample mean formula when using bootstrap resampling

sX¯ = √(1/(B−1) * ∑(θb^ − θ¯)^2) B: number of resamples drawn from the original sample θb^: mean of the resample θ¯: mean of all resample means From this formula, the greater the number of resamples, the smaller the estimated standard error of the sample mean.

jackknife resampling

draws samples by leaving out one observation at a time (without replacement) commonly used to reduce the bias of an estimator

main differences between bootstrap resampling and jackknife resampling:

Results for each run: - Bootstrap: Different because of random sampling - Jackknife: Similar due to its computation procedure Number of repetitions: - Bootstrap: Flexible depending on circumstances - Jackknife: Same as the original sample size (e.g., 4 runs for a sample of 4)

Data snooping (or data mining)

refers to overusing the same data A model is built by searching diligently for any statistically significant patterns. Researchers tend to focus on the small number of significant patterns they find and rarely publish their many statistically insignificant results. As noted by economist Ronald Coase, "If you torture the data long enough, it will confess."

To identify data snooping bias, analysts may split the data into which three separate sets?

Training dataset Validation dataset Test dataset

Training dataset

Used to model and fit parameters

Validation dataset

Used to evaluate model fit and tune parameters

Test dataset

Used to evaluate the final model fit

with data snooping (data mining), where is a genuine relationship found

it should be found in the out-of-sample test

with data snooping (data mining), when is a model successful?

a model is only successful if it works in the future

intergenerational data mining what is it and how come it is used? what is a bias or a con with this?

using results from previous studies many researchers use the same data sets This often leads analysts to study the same anomalies and thus exaggerate the importance

Sample selection bias

occurs if certain assets or time periods are excluded from the data ex: survivorship bias sometimes occurs when stock price and accounting data are used --> For example, many studies have shown the stocks of companies with low price-to-book ratios tend to outperform in future periods. This could be because companies that fail are excluded from the studies Delisting a company's stock from an exchange can also cause bias because it is difficult to track subsequent performance. --> Usually, delisting occurs because of poor performance

survivorship bias

occurs if only funds still in existence are included in the study This can even occur when studying international indices if economies that do not survive are excluded

why does a Hedge fund performance has a significant self-selection bias?

because the hedge fund managers voluntarily share information --> Only managers with positive results are inclined to include results in databases

when are Investors are also influenced by implicit selection bias?

when there is a threshold that enables self-selection --> example, the NYSE has higher stock listing requirements than other smaller exchanges --> The NYSE-listed stock investors may implicitly believe their stocks are of higher quality than those in other exchanges, although the higher listing requirements do not translate into higher expected returns

Backfill bias

another variation of selection bias When a new fund is added to an index, its past performance may be backfilled into the index's database --> This can inflate the index return because new funds are normally added only after they have good performance

Look-ahead bias

occurs if the information is used that would not have been available on the test data For example: accounting information such as book value will not be available for some time after the end of the period --> It can arise implicitly if future data is inappropriately used without realizing it.

to mitigate look-ahead bias, what can analysts use?

point-in-time (PIT) data when they are available

point-in-time (PIT) data

contain information that is available at the time of recording/publication

a time-period bias

Longer time periods are generally preferred but may include data from different structural periods

An analyst collects a sample of 12 monthly return datapoints that have been drawn from a larger population. Wanting to reduce the bias of the expected value based on this small sample size, the analyst decides to resample the data using the jackknife method. Which of the following statements regarding this resampling process is most accurate? A Each repetition will include 11 observations B The process will be completed in 11 repetitions C The sample for each of the 12 repetitions will be drawn with replacement

A Each repetition will include 11 observations

An analyst is conducting a market liquidity study. After studying a stratified sample of dividend-paying stocks, the analyst concludes that the economy is sufficiently liquid and that the stock market may be undervalued. The analyst's conclusion is most likely affected by: A time-period bias. B data-mining bias. C sample selection bias.

C sample selection bias.

n analyst randomly samples 100 small-cap stocks and 100 large-cap stocks that have been part of a broad equity index for at least 10 years and concludes that small-cap stocks have outperformed large-cap stocks on a risk-adjusted basis over the past decade and considers whether this asset class can generate positive excess returns over the next 5 five years. The analyst's conclusion is most likely affected by: A look-ahead bias. B time-period bias. C survivorship bias.

C survivorship bias.

1.5: Point Estimates, Confidence intervals, and resampling Flashcards

(51 cards)