Reading 11 - Sampling and Estimation Flashcards

Question

**A random sample size, n, and degrees of freedom**

Answer 1

A random sample of size, n, is said to have n‐1 degrees of freedom. Basically, there are n‐1 independent deviations from the mean on which the estimate can be based.

Answer 2

As the degrees of freedom increase, the t‐distribution curve becomes more peaked and its tails become thinner (bringing it closer to a normal curve). As a result, for a given significance level, the confidence interval for a random variable that follows the t‐distribution will become narrower when the degrees of freedom increase. We will be *more* confident that the population mean will lie within the calculated interval as more data is concentrated towards the middle (as demonstrated by the higher peak) and less data is in the tails (thinner tails).  

Answer 3

- It is used to construct confidence intervals for a *normally* (or approximately normally) distributed population whose variance is *unknown* when the sample size is small (n \< 30).   - It may also be used for a *non**‐**normally* distributed population whose variance is *unknown* if the sample size is *large* (n ≥ 30). In this case, the central limit theorem is used to assume that the sampling distribution of the sample mean is approximately normal.  

Answer 4

For a 90% confidence interval we use z0.05 = 1.65   For a 95% confidence interval we use z0.025 = 1.96   For a 99% confidence interval we use z0.005 = 2.58  

Answer 5

* Probabilistic interpretation*: After repeatedly taking samples of 36 SAT candidates’ scores on the mock exam, and then constructing confidence intervals based on each sample’s mean, 99% of the confidence intervals will include the population mean over the long run.   * Practical interpretation*: We can be 99% confident that the average population score for the actual SAT exam is between 1663 and 1836.  

Answer 6

Recall that the critical t‐values or the reliability factor for constructing the confidence interval depends on the level of confidence desired, and on the sample size. Also recall that the t‐distribution has fatter or thicker tails relative to the normal distribution. Since more observations essentially lie in the tails of the distribution, a confidence interval for a given significance level will be *broader* for the t‐distribution compared to the z‐distribution.

Answer 7

Use the z‐statistic when the population variance is known.   Use the t‐statistic when the population variance is not known.  

Answer 8

- If the population variance is *known* and the sample size is large (n ≥ 30) we use the z‐statistic. This is because the central limit theorem tells us that the distribution of the sample mean is approximately normal when sample size is large.   - If the population variance is *not known* and sample size is large, we can use the z‐statistic or the t‐statistic. However, in this scenario the use of the t‐statistic is encouraged because it results in a more conservative measure.   This implies that we cannot construct confidence intervals for nonnormal distributions if sample size is less than 30.  

Answer 9

When the variance of a *normally* distributed population is *not known*, and the sample size is large we use the z‐distribution to construct confidence intervals:  

Answer 10

**The choice of test statistic:** A t‐statistic gives a wider confidence interval.   **The degree of confidence:** A higher desired level of confidence increases the size  of the confidence interval.  

Answer 11

The *larger* the standard error, the *wider* the confidence interval. The standard error, in turn, is a function of sample size. More specifically, a *larger* sample size results in a *smaller* standard error and *reduces* the width of the confidence interval. Therefore, large sample sizes are desirable as they increase the precision with which we can estimate a population parameter. However, in practice two considerations may work against increasing the sample size:   Increasing the size of the sample may result in drawing observations from a different population.   Increasing the sample size may involve additional expenses that outweigh the benefit of increased accuracy of estimates.  Other than the risk of sampling from more than one population, there are a variety of challenges to valid sampling. If the sample is biased in any way, estimates and conclusions drawn from sample data will be erroneous.  

Answer 12

Data mining bias Sample selection bias Survivorship bias Look-ahead bias Time-period bias

Answer 13

Data mining is the practice of developing a model by extensively searching through a data set for statistically significant relationships until a pattern “that works” is discovered. In the process of data mining, large numbers of hypotheses about a single data set are tested in a very short time by searching for combinations of variables that might show a correlation.   Given that enough hypotheses are tested, it is virtually certain that some of them will appear to be highly statistically significant, even on a data set with no real correlations at all. Researchers who use data mining techniques can be easily misled by these apparently significant results even though they are merely coincidences.  

Answer 14

Researchers have not formed a hypothesis in advance, and are therefore open to  any hypothesis suggested by the data.   When researchers narrow the data used in order to reduce the probability of the  sample refuting a specific hypothesis.  

Answer 15

* Too much digging warning sign*, which involves testing numerous variables until one that appears to be significant is discovered.   * No story/ no future warning sign*, which is indicated by a lack of an economic theory that can explain empirical results.  

Answer 16

The best way to avoid the data‐mining bias is to test the “apparently statistically significant relationships” on “out‐of‐sample” data to check whether they continue to hold.  

Answer 17

Sample-selection bias results from the exclusion of certain assets (such as bonds, stocks, or portfolios) from a study due to the unavailability of data.   Sample selection bias is even more severe in studies of hedge fund returns. This is because hedge funds are not required to publicly disclose their performance data. Only funds that performed well choose to disclose their performance, which leads to an overstatement of hedge fund returns.  

Answer 18

Some databases use historical information and may suffer from a type of sample selection bias known as survivorship bias. This bias is present in databases that only list companies or funds currently in existence, which means that those that have failed are not included in the database. As a result, the results obtained from the study may not accurately reflect the true picture.  

Answer 19

Look‐ahead bias arises when a study uses information that was not available on the test date. For example, consider a test on a trading rule based on the price‐to‐book value ratio of stocks. Stock prices are usually easily available, but year-end book values are not available till the first quarter of the next year (when financial statements are released).  

Answer 20

Time‐period bias arises if a test is based on a certain time period, which may make the results obtained from the study time‐period specific. If the selected time period is relatively short, results will reflect relationships that held only during that particular period. On the other hand, if the time period is too long, the study might fail to uncover any structural changes that occurred during the period.  

Answer 21

The candidate should be able to: - define simple random sampling and a sampling distribution; - explain sampling error; - distinguish between simple random and stratified random sampling; - distinguish between time-series and cross-sectional data; - explain the central limit theorem and its importance; - calculate and interpret the standard error of the sample mean; - identify and describe desirable properties of an estimator; - distinguish between a point estimate and a confidence interval estimate of a population parameter; - describe properties of Student’s *t*-distribution and calculate and interpret its degrees of freedom; - calculate and interpret a confidence interval for a population mean, given a normal distribution with 1) a known population variance, 2) an unknown population variance, or 3) an unknown variance and a large sample size; **-** describe the issues regarding selection of the appropriate sample size, data-mining bias, sample selection bias, survivorship bias, look-ahead bias, and time-period bias.

Answer 22

The set of rules used to select a sample.

Answer 23

A subset of a larger population created in such a way that each element of the population has an equal probability of being selected to the subset.

Answer 24

A procedure of selecting every *k*th member until reaching a sample of the desired size. The sample that results from this procedure should be approximately random.

Answer 25

The difference between the observed value of a statistic and the quantity it is intended to estimate.

Answer 26

The distribution of all distinct possible values that a statistic can assume when computed from samples of the same size randomly drawn from the same population.

Answer 27

Definition of Stratified Random Sampling. In stratified random sampling, the population is divided into subpopulations (strata) based on one or more classification criteria. Simple random samples are then drawn from each stratum in sizes proportional to the relative size of each stratum in the population. These samples are then pooled to form a stratified random sample.

Answer 28

An investment strategy in which an investor constructs a portfolio to mirror the performance of a specified index.

Answer 29

Bond indexing is one area in which stratified sampling is frequently applied. **Indexing** is an investment strategy in which an investor constructs a portfolio to mirror the performance of a specified index. In pure bond indexing, also called the full-replication approach, the investor attempts to fully replicate an index by owning all the bonds in the index in proportion to their market value weights. Many bond indexes consist of thousands of issues, however, so pure bond indexing is difficult to implement. In addition, transaction costs would be high because many bonds do not have liquid markets.

Answer 30

Because the major risk factors of fixed-income portfolios are well known and quantifiable, stratified sampling offers a more effective approach. In this approach, we divide the population of index bonds into groups of similar duration (interest rate sensitivity), cash flow distribution, sector, credit quality, and call exposure. We refer to each group as a stratum or cell (a term frequently used in this context).¹ Then, we choose a sample from each stratum proportional to the relative market weighting of the stratum in the index to be replicated.

Answer 31

Actions taken by a nation’s central bank to affect aggregate output and prices through changes in bank reserves, reserve requirements, or its target interest rate.

Answer 32

The average return in excess of the risk-free rate divided by the standard deviation of return; a measure of the average excess return earned per unit of standard deviation of return.

Answer 33

**Panel data** consist of observations through time on a single characteristic of multiple observational units. For example, the annual inflation rate of the Eurozone countries over a five-year period would represent panel data. **Longitudinal data** consist of observations on characteristic(s) of the same observational unit through time. Observations on a set of financial ratios for a single company over a 10-year period would be an example of longitudinal data. Both panel and longitudinal data may be represented by arrays (matrixes) in which successive rows represent the observations for successive time periods.

Answer 34

Given a population described by any probability distribution having mean μ and finite variance σ², the sampling distribution of the sample mean X-Bar computed from samples of size *n* from this population will be approximately normal with mean μ (the population mean) and variance σ²/*n* (the population variance divided by *n*) when the sample size *n* is large.

Answer 35

The lowest possible value of an option.

Answer 36

If *a* is the lower limit of a uniform random variable and *b* is the upper limit, then the random variable’s mean is given by (*a* + *b*)/2 and its variance is given by (*b* − *a*)²/12. The reading on common probability distributions fully describes continuous uniform random variables.

Answer 37

**-** The distribution of the sample mean *X*-BAR   will be approximately normal. - The mean of the distribution of *X*-BAR   will be equal to the mean of the population from which the samples are drawn. - The variance of the distribution of *X*-BAR  will be equal to the variance of the population divided by the sample size.

Answer 38

An estimation formula; the formula used to compute the sample mean and other sample statistics are examples of estimators.

Answer 39

The particular value calculated from sample observations using an estimator.

Answer 40

A single numerical estimate of an unknown quantity, such as a population parameter.

Answer 41

An unbiased estimator is one whose expected value (the mean of its sampling distribution) equals the parameter it is intended to estimate.

Answer 42

An unbiased estimator is efficient if no other unbiased estimator of the same parameter has a sampling distribution with smaller variance.

Answer 43

A consistent estimator is one for which the probability of estimates close to the value of the population parameter increases as sample size increases.

Answer 44

A confidence interval is a range for which one can assert with a given probability 1 − α, called the **degree of confidence**, that it will contain the parameter it is intended to estimate. This interval is often referred to as the 100(1 − α)% confidence interval for the parameter.

Answer 45

The probability that a confidence interval includes the unknown population parameter.

Answer 46

A lower one-sided confidence interval establishes a lower limit only. Associated with such an interval is an assertion that with a specified degree of confidence the population parameter equals or exceeds the lower limit. An upper one-sided confidence interval establishes an upper limit only; the related assertion is that the population parameter is less than or equal to that upper limit, with a specified degree of confidence. Investment researchers rarely present one-sided confidence intervals, however.

Answer 47

A 100(1 − α)% confidence interval for a parameter has the following structure.   Point estimate ± Reliability factor × Standard error   where Point estimate = a point estimate of the parameter (a value of a sample statistic) Reliability factor = a number based on the assumed distribution of the point estimate and the degree of confidence (1 − α) for the confidence interval Standard error = the standard error of the sample statistic providing the point estimate

Answer 48

The number of independent observations used.

Answer 49

Statistic for small sample size: z Statistic for large sample size: z

Answer 50

Statistic for small sample size: t Statistic for large sample size: t (or z)

Answer 51

Statistic for small sample size: Not available Statistic for large sample size: z

Answer 52

Statistic for small sample size: Not available Statistic for large sample size: t (or z)

Answer 53

The practice of determining a model by extensive searching through a dataset for statistically significant patterns. Also called *data snooping*.

Answer 54

A test of a strategy or model using a sample outside the time period on which the strategy or model was developed.

Answer 55

A form of data mining that applies information developed by previous researchers using a dataset to guide current research using the same or a related dataset.

Answer 56

Too much digging/too little confidence No story/no future

Answer 57

*Too much digging/too little confidence.* The testing of many variables by the researcher is the “too much digging” warning sign of a data-mining problem. Unfortunately, many researchers do not disclose the number of variables examined in developing a model. Although the number of variables examined may not be reported, we should look closely for verbal hints that the researcher searched over many variables. The use of terms such as “we noticed (or noted) that” or “someone noticed (or noted) that,” with respect to a pattern in a dataset, should raise suspicions that the researchers were trying out variables based on their own or others’ observations of the data.

Answer 58

*No story/no future.* The absence of an explicit economic rationale for a variable or trading strategy is the “no story” warning sign of a data-mining problem. Without a plausible economic rationale or story for why a variable should work, the variable is unlikely to have predictive power. In a demonstration exercise using an extensive search of variables in an international financial database, Leinweber (1997) found that butter production in a particular country remote from the United States explained 75 percent of the variation in US stock returns as represented by the S&P 500. Such a pattern, with no plausible economic rationale, is highly likely to be a random pattern particular to a specific time period.²⁶ What if we do have a plausible economic explanation for a significant variable? McQueen and Thorley caution that a plausible economic rationale is a necessary but not a sufficient condition for a trading strategy to have value. As we mentioned earlier, if the strategy is publicized, market prices may adjust to reflect the new information as traders seek to exploit it; as a result, the strategy may no longer work.

Answer 59

Bias introduced by systematically excluding some members of the population according to a particular attribute—for example, the bias introduced when data availability leads to certain observations being excluded from the analysis.

Answer 60

The bias resulting from a test design that fails to account for companies that have gone bankrupt, merged, or are otherwise no longer reported in a database.

Answer 61

A bias caused by using information that was unavailable on the test date.

Answer 62

The possibility that when we use a time-series sample, our statistical conclusion may be sensitive to the starting and ending dates of the sample.

Answer 63

To draw valid inferences from a sample, the sample should be random.

Answer 64

In simple random sampling, each observation has an equal chance of being selected. In stratified random sampling, the population is divided into subpopulations, called strata or cells, based on one or more classification criteria; simple random samples are then drawn from each stratum.

Answer 65

Stratified random sampling ensures that population subdivisions of interest are represented in the sample. Stratified random sampling also produces more-precise parameter estimates than simple random sampling.

Answer 66

Time-series data are a collection of observations at equally spaced intervals of time. Cross-sectional data are observations that represent individuals, groups, geographical regions, or companies at a single point in time.

Answer 67

The central limit theorem states that for large sample sizes, for any underlying distribution for a random variable, the sampling distribution of the sample mean for that variable will be approximately normal, with mean equal to the population mean for that random variable and variance equal to the population variance of the variable divided by sample size.

Answer 68

Based on the central limit theorem, when the sample size is large, we can compute confidence intervals for the population mean based on the normal distribution regardless of the distribution of the underlying population. In general, a sample size of 30 or larger can be considered large.

Answer 69

An estimator is a formula for estimating a parameter. An estimate is a particular value that we calculate from a sample by using an estimator.

Answer 70

Because an estimator or statistic is a random variable, it is described by some probability distribution. We refer to the distribution of an estimator as its sampling distribution. The standard deviation of the sampling distribution of the sample mean is called the standard error of the sample mean.

Answer 71

The desirable properties of an estimator are unbiasedness (the expected value of the estimator equals the population parameter), efficiency (the estimator has the smallest variance), and consistency (the probability of accurate estimates increases as sample size increases).

Answer 72

The two types of estimates of a parameter are point estimates and interval estimates. A point estimate is a single number that we use to estimate a parameter. An interval estimate is a range of values that brackets the population parameter with some probability.

Answer 73

A confidence interval is an interval for which we can assert with a given probability 1 − α, called the degree of confidence, that it will contain the parameter it is intended to estimate. This measure is often referred to as the 100(1 − α)% confidence interval for the parameter. A 100(1 − α)% confidence interval for a parameter has the following structure: Point estimate ± Reliability factor × Standard error, where the reliability factor is a number based on the assumed distribution of the point estimate and the degree of confidence (1 − α) for the confidence interval and where standard error is the standard error of the sample statistic providing the point estimate.

Answer 74

A 100(1 − α)% confidence interval for population mean μ when sampling from a normal distribution with known variance σ2 is given by x̅ ± zα/2\*(σ/√n), where zα/2 is the point of the standard normal distribution such that α/2 remains in the right tail

Answer 75

Student’s *t*-distribution is a family of symmetrical distributions defined by a single parameter, degrees of freedom.

Answer 76

A random sample of size *n* is said to have *n* − 1 degrees of freedom for estimating the population variance, in the sense that there are only *n* − 1 independent deviations from the mean on which to base the estimate.

Answer 77

The degrees of freedom number for use with the *t*-distribution is also *n* − 1.

Answer 78

The *t*-distribution has fatter tails than the standard normal distribution but converges to the standard normal distribution as degrees of freedom go to infinity.

Answer 79

Basically, only one standard normal distribution exists, but many *t*-distributions exist—one for every different number of degrees of freedom. The normal distribution and the *t*-distribution for a large number of degrees of freedom are practically the same. The lower the degrees of freedom, the flatter the *t*-distribution becomes. The *t*-distribution has less mass (lower probabilities) in the center of the distribution and more mass (higher probabilities) out in both tails. Therefore, the confidence intervals based on *t*-values will be wider than those based on the normal distribution. Stated differently, the probability of being within a given number of standard deviations (such as within ±1 standard deviation or ±2 standard deviations) is lower for the *t*-distribution than for the normal distribution.

Answer 80

A 100(1 − α)% confidence interval for the population mean μ when sampling from a normal distribution with unknown variance (a *t*-distribution confidence interval) is given by x̅ ± t*_α*_/2\*(*s/*√*n),* where *t*_α/2 is the point of the *t*-distribution such that α/2 remains in the right tail and *s* is the sample standard deviation. This confidence interval can also be used, because of the central limit theorem, when dealing with a large sample from a population with unknown variance that may not be normal.

Answer 81

We may use the confidence interval x̅ ± z*_α*_/2\*(*s/*√*n)* as an alternative to the *t*-distribution confidence interval for the population mean when using a large sample from a population with unknown variance. The confidence interval based on the *z-*statistic is less conservative (narrower) than the corresponding confidence interval based on a *t*-distribution.

Answer 82

Three issues in the selection of sample size are the need for precision, the risk of sampling from more than one population, and the expenses of different sample sizes.

Answer 83

Sample data in investments can have a variety of problems. Survivorship bias occurs if companies are excluded from the analysis because they have gone out of business or because of reasons related to poor performance. Data-mining bias comes from finding models by repeatedly searching through databases for patterns. Look-ahead bias exists if the model uses data not available to market participants at the time the market participants act in the model. Finally, time-period bias is present if the time period used makes the results time-period specific or if the time period used includes a point of structural change.

Reading 11 - Sampling and Estimation Flashcards

(113 cards)