Reading 11 - Sampling and Estimation Flashcards

1
Q

Sampling

A

In a simple random sample, each member of the population has the same probability
or likelihood of being included in the sample. For example, assume that our population consists of 10 balls labeled with numbers 1 to 10. Drawing a random sample of 3 balls from this population of 10 balls would require that each ball has an equal chance of being chosen in the sample, and each combination of balls has an identical chance of being the chosen sample as any other combination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Systematic sampling

A

In practice, random samples are generated using random number tables or computer random‐number generators. Systematic sampling is often used to generate approximately random samples. In systematic sampling, every kth member in the population list is selected until the desired sample size is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Systematic sampling

A

In practice, random samples are generated using random number tables or computer random‐number generators. Systematic sampling is often used to generate approximately random samples. In systematic sampling, every kth member in the population list is selected until the desired sample size is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sampling error

A

Sampling error is the error caused by observing a sample instead of the entire population to draw conclusions relating to population parameters. It equals the difference between
a sample statistic and the corresponding population parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sampling error of the mean

A

Sampling error of the mean = Sample mean − Population mean = x − μ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sampling distribution & sampling distribution of the mean

A

A sampling distribution is the probability distribution of a given sample statistic under repeated sampling of the population. Suppose that a random sample of 50 stocks is selected from a population of 10,000 stocks, and the average return on the 50‐stock sample is calculated. If this process were repeated several times with samples of the same size (50), the sample mean (estimate of the population mean) calculated will be different each time due to the different individual stocks making up each sample. The distribution of these sample means is called the sampling distribution of the mean.

Remember that all the samples drawn from the population must be random, and of the same size. Also note that the sampling distribution is different from the distribution of returns of each of the components of the population (each of the 10,000 stocks) and has different parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Stratification

A

Stratification is the process of grouping members of the population into relatively homogeneous subgroups, or strata, before drawing samples. The strata should be mutually exclusive i.e., each member of the population must be assigned to only one stratum. The strata should also be collectively exhaustive i.e., no population element should be excluded from the sampling process. Once this is accomplished, random sampling is applied within each stratum and the number of observations drawn from each stratum is based on the size of the stratum relative to the population. This often improves the representativeness of the sample by reducing sampling error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Time-series data

A
  • ‐series data consists of observations measured over a period of time, spaced at uniform intervals. The monthly returns on a particular stock over the last 5 years are an example of time‐series data. 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cross-sectional data

A

Cross‐sectional data refers to data collected by observing many subjects (such as individuals, firms, or countries/regions) at the same point in time. Analysis of cross‐sectional data usually consists of comparing the differences among the subjects. The returns of individual stocks over the last year are an example of cross‐sectional data. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data sets can have both timeseries and crosssectional data in them. Examples of such data sets are:

A

Longitudinal data, which is data collected over time about multiple characteristics of the same observational unit. The various economic indicators—unemployment levels, inflation, GDP growth rates (multiple characteristics) of a particular country (observational unit) over a decade (period of time) are examples of longitudinal data. 


Panel data, which refers to data collected over time about a single characteristic of multiple observational units. The unemployment rate (single characteristic) of a number of countries (multiple observational units) over time are examples of panel data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Central limit theorem

A

The central limit theorem allows us to make accurate statements about the population mean and variance using the sample mean and variance regardless of the distribution of the population, as long as the sample size is adequate. An adequate sample size is defined as one that has more than 30 observations (n ≥ 30). 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The important properties of the central limit theorem are:

A

1) Given a population with any probability distribution, with mean, μ, and variance, σ2, the sampling distribution of the sample mean x-bar, computed from sample size, n, will approximately be normal with mean, μ (the population mean), and variance, σ2/ n (population variance divided by sample size), when the sample size is greater than or equal to 30.
2) No matter what the distribution of the population, for a sample whose size is greater than or equal to 30, the sample mean will be normally distributed. 


x̅ ~ N(μ,( σ2/n))

3) The mean of the population (μ) and the mean of the distribution of sample means x are equal. 

4) The variance of the distribution of sample means equals σ2/n, or population variance divided by sample size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standard error

A

The standard deviation of the distribution of sample means is known as the standard error of the statistic. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When the population variance, σ 2, is known, the standard error of sample mean is calculated as:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Practically speaking, population variances are almost never known, so we estimate the standard error of the sample mean using the sample’s standard deviation:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Point estimate

A

A point estimate involves the use of sample data to calculate a single value (a statistic) that serves as an approximation for an unknown population parameter. For example, the sample mean, x, is a point estimate of the population mean, μ. The formula used to calculate a point estimate is known as an estimator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The estimator for the sample mean is given as:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Confidence interval

A

A confidence interval uses sample data to calculate a range of possible (or probable) values that an unknown population parameter can take, with a given of probability of
(1 – α). α is called the level of significance, and (1 – α) refers to the degree of confidence that the relevant parameter will lie in the computed interval. For example, a calculated interval between 100 and 150 at the 5% significance level implies that we can be 95% confident that the population parameter will lie between 100 and 150.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A (1 – α)% confidence interval has the following structure:

A

Point estimate ± (reliability factor * standard error)

where:


Point estimate = value of the sample statistic that is used to estimate the population parameter.

Reliability factor = a number based on the assumed distribution of the point estimate and the level of confidence for the interval (1 – α).

Standard error = the standard error of the sample statistic (point estimate).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When choosing between a number of possible estimators for a particular population parameter, we make use of the desirable statistical properties of an estimator to make the best possible selection. The desirable properties of an estimator are:

A

Unbiasedness

Efficiency

Consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Statistical property: Unbiasedness

A

Unbiasedness: An unbiased estimator is one whose expected value is equal to the parameter being estimated. The expected value of the sample mean equals the population mean [E(x) = μ]. Therefore, the sample mean, x, is an unbiased estimator of the population mean, μ . 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Statistical property: Efficiency

A

Efficiency: An efficient unbiased estimator is the one that has the lowest variance among all unbiased estimators of the same parameter. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Statistical property: Consistency

A

Consistency: A consistent estimator is one for which the probability of estimates close to the value of the population parameter increases as sample size increases. We have already seen that the standard error of the sampling distribution falls as sample size increases, which implies a higher probability of estimates close to the population mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Student’s tdistribution is a bellshaped probability distribution that has the following properties:

A
  • It is symmetrical. 

  • It is defined by a single parameter, the degrees of freedom (df), where degrees of 
freedom equal sample size minus one (n‐1). 

  • It has a lower peak than the normal curve, but fatter tails. 

  • As the degrees of freedom increase, the shape of the t‐distribution approaches the 
shape of the standard normal curve. 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

A random sample size, n, and degrees of freedom

A

A random sample of size, n, is said to have n‐1 degrees of freedom. Basically, there are n‐1 independent deviations from the mean on which the estimate can be based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What happens to the t-distribution curve as degrees of freedom increase?

A

As the degrees of freedom increase, the t‐distribution curve becomes more peaked
and its tails become thinner (bringing it closer to a normal curve). As a result, for a
given significance level, the confidence interval for a random variable that follows the t‐distribution will become narrower when the degrees of freedom increase. We will be more confident that the population mean will lie within the calculated interval as more data is concentrated towards the middle (as demonstrated by the higher peak) and less data is in the tails (thinner tails). 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

The tdistribution is used in the following scenarios:

A
  • It is used to construct confidence intervals for a normally (or approximately normally) distributed population whose variance is unknown when the sample size is small (n < 30). 

  • It may also be used for a nonnormally distributed population whose variance is unknown if the sample size is large (n ≥ 30). In this case, the central limit theorem is used to assume that the sampling distribution of the sample mean is approximately normal. 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

The confidence interval for the population mean when the population follows a normal distribution and its variance is known is calculated as follows: 
(NOTE: works if you have population standard deviation)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

The following reliability factors are used frequently when constructing confidence intervals based on the standard normal distribution:

A

For a 90% confidence interval we use z0.05 = 1.65 


For a 95% confidence interval we use z0.025 = 1.96 


For a 99% confidence interval we use z0.005 = 2.58 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

This confidence interval can be interpreted in two ways:

A
  • Probabilistic interpretation*: After repeatedly taking samples of 36 SAT candidates’ scores on the mock exam, and then constructing confidence intervals based on each sample’s mean, 99% of the confidence intervals will include the population mean over the long run. 

  • Practical interpretation*: We can be 99% confident that the average population score for the actual SAT exam is between 1663 and 1836. 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

When the variance of a normally distributed population is not known, we use the tdistribution to construct confidence intervals:

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

t-distribution vs. z-distribution

A

Recall that the critical t‐values or the reliability factor for constructing the confidence interval depends on the level of confidence desired, and on the sample size. Also recall that the t‐distribution has fatter or thicker tails relative to the normal distribution. Since more observations essentially lie in the tails of the distribution, a confidence interval for a given significance level will be broader for the t‐distribution compared to the z‐distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

When the population is normally distributed, when do we use z-statistic vs t-statistic?

A

Use the z‐statistic when the population variance is known. 


Use the t‐statistic when the population variance is not known.



How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

When the distribution of the population is nonnormal, the construction of an appropriate 
confidence interval depends on the size of the sample. 
When do we use z-statistic vs t-statistic?

A
  • If the population variance is known and the sample size is large (n ≥ 30) we use the z‐statistic. This is because the central limit theorem tells us that the distribution of the sample mean is approximately normal when sample size is large. 

  • If the population variance is not known and sample size is large, we can use the z‐statistic or the t‐statistic. However, in this scenario the use of the t‐statistic is encouraged because it results in a more conservative measure. 


This implies that we cannot construct confidence intervals for nonnormal distributions if sample size is less than 30. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

When do you use z-distribution to construct confidence intervals?

A

When the variance of a normally distributed population is not known, and the sample size is large we use the z‐distribution to construct confidence intervals: 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Criteria for Selecting Appropriate Test Statistic

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

From our discussion so far, we have understood that there are various factors that affect the width of a confidence interval: Name two.

A

The choice of test statistic: A t‐statistic gives a wider confidence interval. 


The degree of confidence: A higher desired level of confidence increases the size 
of the confidence interval. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

From our formula for the confidence interval, it is easy to see that the width of the interval is also a function of the standard error. 
Explain.

A

The larger the standard error, the wider the confidence interval. The standard error, in turn, is a function of sample size. More specifically, a larger sample size results in a smaller standard error and reduces the width of the confidence interval. Therefore, large sample sizes are desirable as they increase the precision with which we can estimate a population parameter. However, in practice two considerations may work against increasing the sample size: 


Increasing the size of the sample may result in drawing observations from a different population. 


Increasing the sample size may involve additional expenses that outweigh the benefit of increased accuracy of estimates. 
Other than the risk of sampling from more than one population, there are a variety of challenges to valid sampling. If the sample is biased in any way, estimates and conclusions drawn from sample data will be erroneous. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Name the types of biases

A

Data mining bias

Sample selection bias

Survivorship bias

Look-ahead bias

Time-period bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Data mining

A

Data mining is the practice of developing a model by extensively searching through a data set for statistically significant relationships until a pattern “that works” is discovered. In the process of data mining, large numbers of hypotheses about a single data set are tested in a very short time by searching for combinations of variables that might show a correlation. 


Given that enough hypotheses are tested, it is virtually certain that some of them will appear to be highly statistically significant, even on a data set with no real correlations at all. Researchers who use data mining techniques can be easily misled by these apparently significant results even though they are merely coincidences. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Datamining bias most commonly occurs when:

A

Researchers have not formed a hypothesis in advance, and are therefore open to 
any hypothesis suggested by the data. 


When researchers narrow the data used in order to reduce the probability of the 
sample refuting a specific hypothesis. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Warning signs that data mining bias might exist are:

A
  • Too much digging warning sign*, which involves testing numerous variables until one that appears to be significant is discovered. 

  • No story/ no future warning sign*, which is indicated by a lack of an economic theory that can explain empirical results. 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

The best way to avoid the datamining bias is to:

A

The best way to avoid the data‐mining bias is to test the “apparently statistically significant relationships” on “out‐of‐sample” data to check whether they continue to hold. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Sample-selection bias

A

Sample-selection bias results from the exclusion of certain assets (such as bonds, stocks, or portfolios) from a study due to the unavailability of data. 


Sample selection bias is even more severe in studies of hedge fund returns. This is because hedge funds are not required to publicly disclose their performance data. Only funds that performed well choose to disclose their performance, which leads to an overstatement of hedge fund returns. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Survivorship bias

A

Some databases use historical information and may suffer from a type of sample selection bias known as survivorship bias. This bias is present in databases that only list companies or funds currently in existence, which means that those that have failed are not included in the database. As a result, the results obtained from the study may not accurately reflect the true picture. 


How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Look-Ahead Bias

A

Look‐ahead bias arises when a study uses information that was not available on the test date. For example, consider a test on a trading rule based on the price‐to‐book value ratio of stocks. Stock prices are usually easily available, but year-end book values are not available till the first quarter of the next year (when financial statements are released). 


47
Q

Time-Period Bias

A

Time‐period bias arises if a test is based on a certain time period, which may make the results obtained from the study time‐period specific. If the selected time period is relatively short, results will reflect relationships that held only during that particular period. On the other hand, if the time period is too long, the study might fail to uncover any structural changes that occurred during the period. 


48
Q

Reading 11 – Sampling and estimation – Learning outcomes

A

The candidate should be able to:

  • define simple random sampling and a sampling distribution;
  • explain sampling error;
  • distinguish between simple random and stratified random sampling;
  • distinguish between time-series and cross-sectional data;
  • explain the central limit theorem and its importance;
  • calculate and interpret the standard error of the sample mean;
  • identify and describe desirable properties of an estimator;
  • distinguish between a point estimate and a confidence interval estimate of a population parameter;
  • describe properties of Student’s t-distribution and calculate and interpret its degrees of freedom;
  • calculate and interpret a confidence interval for a population mean, given a normal distribution with 1) a known population variance, 2) an unknown population variance, or 3) an unknown variance and a large sample size;

- describe the issues regarding selection of the appropriate sample size, data-mining bias, sample selection bias, survivorship bias, look-ahead bias, and time-period bias.

49
Q

Sampling plan

A

The set of rules used to select a sample.

50
Q

Simple random sample

A

A subset of a larger population created in such a way that each element of the population has an equal probability of being selected to the subset.

51
Q

Systematic Sampling

A

A procedure of selecting every kth member until reaching a sample of the desired size. The sample that results from this procedure should be approximately random.

52
Q

Sampling error

A

The difference between the observed value of a statistic and the quantity it is intended to estimate.

53
Q

Sampling distribution

A

The distribution of all distinct possible values that a statistic can assume when computed from samples of the same size randomly drawn from the same population.

54
Q

Stratified Random Sampling

A

Definition of Stratified Random Sampling. In stratified random sampling, the population is divided into subpopulations (strata) based on one or more classification criteria. Simple random samples are then drawn from each stratum in sizes proportional to the relative size of each stratum in the population. These samples are then pooled to form a stratified random sample.

55
Q

Indexing

A

An investment strategy in which an investor constructs a portfolio to mirror the performance of a specified index.

56
Q

Pure bond indexing

A

Bond indexing is one area in which stratified sampling is frequently applied. Indexing is an investment strategy in which an investor constructs a portfolio to mirror the performance of a specified index. In pure bond indexing, also called the full-replication approach, the investor attempts to fully replicate an index by owning all the bonds in the index in proportion to their market value weights. Many bond indexes consist of thousands of issues, however, so pure bond indexing is difficult to implement. In addition, transaction costs would be high because many bonds do not have liquid markets.

57
Q

How does stratified random indexing help pure bond indexing

A

Because the major risk factors of fixed-income portfolios are well known and quantifiable, stratified sampling offers a more effective approach. In this approach, we divide the population of index bonds into groups of similar duration (interest rate sensitivity), cash flow distribution, sector, credit quality, and call exposure. We refer to each group as a stratum or cell (a term frequently used in this context).1 Then, we choose a sample from each stratum proportional to the relative market weighting of the stratum in the index to be replicated.

58
Q

Monetary policy

A

Actions taken by a nation’s central bank to affect aggregate output and prices through changes in bank reserves, reserve requirements, or its target interest rate.

59
Q

Sharpe ratio

A

The average return in excess of the risk-free rate divided by the standard deviation of return; a measure of the average excess return earned per unit of standard deviation of return.

60
Q

The reader may also encounter two types of datasets that have both time-series and cross-sectional aspects:

A

Panel data consist of observations through time on a single characteristic of multiple observational units. For example, the annual inflation rate of the Eurozone countries over a five-year period would represent panel data.

Longitudinal data consist of observations on characteristic(s) of the same observational unit through time. Observations on a set of financial ratios for a single company over a 10-year period would be an example of longitudinal data. Both panel and longitudinal data may be represented by arrays (matrixes) in which successive rows represent the observations for successive time periods.

61
Q

The Central Limit Theorem

A

Given a population described by any probability distribution having mean μ and finite variance σ2, the sampling distribution of the sample mean X-Bar computed from samples of size n from this population will be approximately normal with mean μ (the population mean) and variance σ2/n (the population variance divided by n) when the sample size n is large.

62
Q

The estimate of s is given by the square root of the sample variance, s2, calculated as follows:

A
63
Q

Lower bound

A

The lowest possible value of an option.

64
Q

Random variables mean and variance equations

A

If a is the lower limit of a uniform random variable and b is the upper limit, then the random variable’s mean is given by (a + b)/2 and its variance is given by (ba)2/12. The reading on common probability distributions fully describes continuous uniform random variables.

65
Q

To summarize, according to the central limit theorem, when we sample from any distribution, the distribution of the sample mean will have the following properties as long as our sample size is large:

A

- The distribution of the sample mean X-BAR


will be approximately normal.

  • The mean of the distribution of X-BAR


will be equal to the mean of the population from which the samples are drawn.
  • The variance of the distribution of X-BAR

will be equal to the variance of the population divided by the sample size.
66
Q

Estimator

A

An estimation formula; the formula used to compute the sample mean and other sample statistics are examples of estimators.

67
Q

Estimate

A

The particular value calculated from sample observations using an estimator.

68
Q

Point Estimate

A

A single numerical estimate of an unknown quantity, such as a population parameter.

69
Q

Definition of Unbiasedness

A

An unbiased estimator is one whose expected value (the mean of its sampling distribution) equals the parameter it is intended to estimate.

70
Q

Definition of Efficiency

A

An unbiased estimator is efficient if no other unbiased estimator of the same parameter has a sampling distribution with smaller variance.

71
Q

Definition of Consistency

A

A consistent estimator is one for which the probability of estimates close to the value of the population parameter increases as sample size increases.

72
Q

Confidence Interval

A

A confidence interval is a range for which one can assert with a given probability 1 − α, called the degree of confidence, that it will contain the parameter it is intended to estimate. This interval is often referred to as the 100(1 − α)% confidence interval for the parameter.

73
Q

Degree of confidence

A

The probability that a confidence interval includes the unknown population parameter.

74
Q

It is also possible to define two types of one-sided confidence intervals for a population parameter. Explain.

A

A lower one-sided confidence interval establishes a lower limit only. Associated with such an interval is an assertion that with a specified degree of confidence the population parameter equals or exceeds the lower limit. An upper one-sided confidence interval establishes an upper limit only; the related assertion is that the population parameter is less than or equal to that upper limit, with a specified degree of confidence. Investment researchers rarely present one-sided confidence intervals, however.

75
Q

Construction of Confidence Intervals

A

A 100(1 − α)% confidence interval for a parameter has the following structure. 


Point estimate ± Reliability factor × Standard error 

where

Point estimate = a point estimate of the parameter (a value of a sample statistic)

Reliability factor = a number based on the assumed distribution of the point estimate and the degree of confidence (1 − α) for the confidence interval

Standard error = the standard error of the sample statistic providing the point estimate

76
Q

Degrees of freedom (df)

A

The number of independent observations used.

77
Q

Basis of Computing Reliability Factors: Normal distribution with known variance

A

Statistic for small sample size: z

Statistic for large sample size: z

78
Q

Basis of Computing Reliability Factors: Normal distribution with unknown variance

A

Statistic for small sample size: t

Statistic for large sample size: t (or z)

79
Q

Basis of Computing Reliability Factors: Nonnormal distribution with known variance

A

Statistic for small sample size: Not available

Statistic for large sample size: z

80
Q

Basis of Computing Reliability Factors: Nonnormal distribution with unknown variance

A

Statistic for small sample size: Not available

Statistic for large sample size: t (or z)

81
Q

Data mining

A

The practice of determining a model by extensive searching through a dataset for statistically significant patterns. Also called data snooping.

82
Q

Out-of-sample test

A

A test of a strategy or model using a sample outside the time period on which the strategy or model was developed.

83
Q

Intergenerational data mining

A

A form of data mining that applies information developed by previous researchers using a dataset to guide current research using the same or a related dataset.

84
Q

McQueen and Thorley presented two signs that can warn analysts about the potential existence of data mining:

A

Too much digging/too little confidence

No story/no future

85
Q

McQueen and Thorley presented two signs that can warn analysts about the potential existence of data mining: Too much digging/ too little confidence.

A

Too much digging/too little confidence. The testing of many variables by the researcher is the “too much digging” warning sign of a data-mining problem. Unfortunately, many researchers do not disclose the number of variables examined in developing a model. Although the number of variables examined may not be reported, we should look closely for verbal hints that the researcher searched over many variables. The use of terms such as “we noticed (or noted) that” or “someone noticed (or noted) that,” with respect to a pattern in a dataset, should raise suspicions that the researchers were trying out variables based on their own or others’ observations of the data.

86
Q

McQueen and Thorley presented two signs that can warn analysts about the potential existence of data mining: No story/no future.

A

No story/no future. The absence of an explicit economic rationale for a variable or trading strategy is the “no story” warning sign of a data-mining problem. Without a plausible economic rationale or story for why a variable should work, the variable is unlikely to have predictive power. In a demonstration exercise using an extensive search of variables in an international financial database, Leinweber (1997) found that butter production in a particular country remote from the United States explained 75 percent of the variation in US stock returns as represented by the S&P 500. Such a pattern, with no plausible economic rationale, is highly likely to be a random pattern particular to a specific time period.26 What if we do have a plausible economic explanation for a significant variable? McQueen and Thorley caution that a plausible economic rationale is a necessary but not a sufficient condition for a trading strategy to have value. As we mentioned earlier, if the strategy is publicized, market prices may adjust to reflect the new information as traders seek to exploit it; as a result, the strategy may no longer work.

87
Q

Sample selection bias

A

Bias introduced by systematically excluding some members of the population according to a particular attribute—for example, the bias introduced when data availability leads to certain observations being excluded from the analysis.

88
Q

Survivorship bias

A

The bias resulting from a test design that fails to account for companies that have gone bankrupt, merged, or are otherwise no longer reported in a database.

89
Q

Look-ahead bias

A

A bias caused by using information that was unavailable on the test date.

90
Q

Time-period bias

A

The possibility that when we use a time-series sample, our statistical conclusion may be sensitive to the starting and ending dates of the sample.

91
Q

How do you draw a valid inference from a sample?

A

To draw valid inferences from a sample, the sample should be random.

92
Q

How observations selected in simple random sampling & stratified random sampling?

A

In simple random sampling, each observation has an equal chance of being selected. In stratified random sampling, the population is divided into subpopulations, called strata or cells, based on one or more classification criteria; simple random samples are then drawn from each stratum.

93
Q

Why is stratified random sampling better than simple random sampling?

A

Stratified random sampling ensures that population subdivisions of interest are represented in the sample. Stratified random sampling also produces more-precise parameter estimates than simple random sampling.

94
Q

Time-series data vs. Cross-sectional data

A

Time-series data are a collection of observations at equally spaced intervals of time. Cross-sectional data are observations that represent individuals, groups, geographical regions, or companies at a single point in time.

95
Q

Central limit theorem

A

The central limit theorem states that for large sample sizes, for any underlying distribution for a random variable, the sampling distribution of the sample mean for that variable will be approximately normal, with mean equal to the population mean for that random variable and variance equal to the population variance of the variable divided by sample size.

96
Q

Based on the central limit theorem, when the sample size is large, what can we do?

A

Based on the central limit theorem, when the sample size is large, we can compute confidence intervals for the population mean based on the normal distribution regardless of the distribution of the underlying population. In general, a sample size of 30 or larger can be considered large.

97
Q

Estimator

A

An estimator is a formula for estimating a parameter. An estimate is a particular value that we calculate from a sample by using an estimator.

98
Q

Because an estimator or statistic is a random variable, how can it be described?

A

Because an estimator or statistic is a random variable, it is described by some probability distribution. We refer to the distribution of an estimator as its sampling distribution. The standard deviation of the sampling distribution of the sample mean is called the standard error of the sample mean.

99
Q

What are the desirable properties of an estimator?

A

The desirable properties of an estimator are unbiasedness (the expected value of the estimator equals the population parameter), efficiency (the estimator has the smallest variance), and consistency (the probability of accurate estimates increases as sample size increases).

100
Q

What are the two types of estimates of a parameter?

A

The two types of estimates of a parameter are point estimates and interval estimates. A point estimate is a single number that we use to estimate a parameter. An interval estimate is a range of values that brackets the population parameter with some probability.

101
Q

Confidence interval

A

A confidence interval is an interval for which we can assert with a given probability 1 − α, called the degree of confidence, that it will contain the parameter it is intended to estimate. This measure is often referred to as the 100(1 − α)% confidence interval for the parameter.

A 100(1 − α)% confidence interval for a parameter has the following structure: Point estimate ± Reliability factor × Standard error, where the reliability factor is a number based on the assumed distribution of the point estimate and the degree of confidence (1 − α) for the confidence interval and where standard error is the standard error of the sample statistic providing the point estimate.

102
Q

A 100(1 − α)% confidence interval for population mean μ when sampling from a normal distribution with known variance σ2is given by:

A

A 100(1 − α)% confidence interval for population mean μ when sampling from a normal distribution with known variance σ2 is given by x̅ ± zα/2*(σ/√n), where zα/2 is the point of the standard normal distribution such that α/2 remains in the right tail

103
Q

Student’s t-distribution

A

Student’s t-distribution is a family of symmetrical distributions defined by a single parameter, degrees of freedom.

104
Q

Degrees of freedom

A

A random sample of size n is said to have n − 1 degrees of freedom for estimating the population variance, in the sense that there are only n − 1 independent deviations from the mean on which to base the estimate.

105
Q

The degrees of freedom number for use with the t-distribution

A

The degrees of freedom number for use with the t-distribution is also n − 1.

106
Q

t-distribution description

A

The t-distribution has fatter tails than the standard normal distribution but converges to the standard normal distribution as degrees of freedom go to infinity.

107
Q

Compare the standard normal distribution and Student’s t-distribution

A

Basically, only one standard normal distribution exists, but many t-distributions exist—one for every different number of degrees of freedom. The normal distribution and the t-distribution for a large number of degrees of freedom are practically the same. The lower the degrees of freedom, the flatter the t-distribution becomes. The t-distribution has less mass (lower probabilities) in the center of the distribution and more mass (higher probabilities) out in both tails. Therefore, the confidence intervals based on t-values will be wider than those based on the normal distribution. Stated differently, the probability of being within a given number of standard deviations (such as within ±1 standard deviation or ±2 standard deviations) is lower for the t-distribution than for the normal distribution.

108
Q

A 100(1 − α)% confidence interval for the population mean μ when sampling from a normal distribution with unknown variance (a t-distribution confidence interval) is given by:

A

A 100(1 − α)% confidence interval for the population mean μ when sampling from a normal distribution with unknown variance (a t-distribution confidence interval) is given by x̅ ± tα/2*(s/n), where tα/2 is the point of the t-distribution such that α/2 remains in the right tail and s is the sample standard deviation. This confidence interval can also be used, because of the central limit theorem, when dealing with a large sample from a population with unknown variance that may not be normal.

109
Q

We may use the confidence interval _____ as an alternative to the t-distribution confidence interval for the population mean when using a large sample from a population with unknown variance.

A

We may use the confidence interval x̅ ± zα/2*(s/n) as an alternative to the t-distribution confidence interval for the population mean when using a large sample from a population with unknown variance. The confidence interval based on the z-statistic is less conservative (narrower) than the corresponding confidence interval based on a t-distribution.

110
Q

Three issues in the selection of sample size

A

Three issues in the selection of sample size are the need for precision, the risk of sampling from more than one population, and the expenses of different sample sizes.

111
Q

Sample data in investments can have a variety of problems. Name and describe the biases.

A

Sample data in investments can have a variety of problems. Survivorship bias occurs if companies are excluded from the analysis because they have gone out of business or because of reasons related to poor performance. Data-mining bias comes from finding models by repeatedly searching through databases for patterns. Look-ahead bias exists if the model uses data not available to market participants at the time the market participants act in the model. Finally, time-period bias is present if the time period used makes the results time-period specific or if the time period used includes a point of structural change.

112
Q
A
113
Q
A