QM PREREQ5 Sampling and estimation Flashcards
What is an estimator?
A formula used to estimate a statistic (ie variance)
What are the desirable properties for estimators?
- Unbiasedness: an unbiased estimator is one whose expected value (the mean of its sampling distribution) equals the parameter it is intended to estimate
An unbiased estimator would be one where xbar = sum of xsubi / n
xbar = sum of xsubi / (n-1) would be biased upwards because it would increase the estimate of the mean upwards by 1
- Efficiency: an unbiased estimator is efficient if no other unbiased estimator has a sampling distribution with smaller variance
A more efficient estimator will have a taller head and thinner tails (even though both are unbiased
- Consistency: a consistent estimator is one for which the probability of estimates close to the value of the population parameter increase as the sample size increases
For example, if our estimation of Standard Error was SE = S/sqrt(n)
this would be a consistent estimator. Because as n increases standard error should decrease
What is a confidence interval?
A range for which one can assert with a given probability (1-alpha), called the degree of confidence, that it will contain the parameter it is intended to estimate
I.e., lower limit <- xbar -> upper limit
This is a two sided confidence interval
What is a point estimate?
An estimate for what a parameter is
What are the two interpretations of a confidence interval?
- Probabilistic: in repeated sampling, 95% (for example) of such CIs will in the long run include or bracket the population mean
- Practical: 95% confident that a given CI contains the population mean
How do we construct a CI?
Take the point estimate (xbar)
Add or substract the reliability factor, multiplied by the standard error
The reliability factor can be based on a z value or a t value
The standard error is sigma / sqrt(n) or s / sqrt(n) if you only have sample variance
If you multiply reliability factor x standard error by 2 you get the confidence interval, as it is plus minus
What are the most common reliability factors?
90% confidence interval: 1.65 rf
95%: 1.96
99%: 2.58
Do we use z or t to find our confidence interval if we have a large sample with variance unknown?
z, because as sample size increases t increases
i.e., if n=400 we would just use z
The reading tends to say over n=30 we would stop using t, but over 200 or 300 is where they converge. A “large sample size” is not really 50.
You can never be WRONG when using the t value because of the convergence
How do we find t-value in excel?
=T.INV(probability, degrees of freedom)
gives you the t value or the negative t value
Under what conditions would we use the z value?
NAME?
How do we determine what sample size will be required to obtain a confidence interval of 1% can be created?
Let’s call this E:
xbar +/- ( t x s/sqrt(n) )
The width of the confidence interval will be 2E
Thus we can rearrange to:
n = [ (t x s) / E]^2
We would not expect standard deviation for the sample to change as n changes, but we would expect standard error to change.
What is a data snooping bias?
The bias of searching a data set for statistical patterns or relationships. This is also known as data mining.
If alpha = 5%, testing 100 different variables, on average, will produce 5 significant relationships
Data snooping is typically not theory-driven, and lacks an economic rationale behind it.
How do we minimise or avoid data snooping bias?
- To combat data snooping bias we must have a clear, well-formulated hypothesis. It must have an economic rationale and accompanying theory behind it.
- We split our data set into a training data set, a validation data set, and test data.
- The training data is used to build and fit a model
- The validation data set is sed to fit and tune the model.
- The test data is used as an out-of-sample test to evaluate model fit. If data snooping is present, there will be insignificant model fit!
What is sample selection bias?
Excluding some observations or time periods (basically choosing non-random samples)
i.e., survivorship bias: historical data may only include data for companies that survived
This would overstate the performance.
Another example would be using hedge fund indexes. Since they self-report, only well-performing funds may opt to report.
What is look ahead bias?
Using information that was not available on the observation date.
I.e., models that use price and accounting data from the historical record, when the accounting data may not have been available on the same date.
For example, we can observe the price on Dec 31st, and book value on Dec 31st, but in fact BV may not have been reported until mid February. Linking BV and price on Dec 31st would be look ahead bias.
What is time period bias?
Results in one time period may be specific to that time period.
Time period bias is typical of SHORT time series
However, time series that are too long risk including more than one regime or distribution