Reading 11 - Sampling and Estimation Flashcards
Sampling
In a simple random sample, each member of the population has the same probability or likelihood of being included in the sample. For example, assume that our population consists of 10 balls labeled with numbers 1 to 10. Drawing a random sample of 3 balls from this population of 10 balls would require that each ball has an equal chance of being chosen in the sample, and each combination of balls has an identical chance of being the chosen sample as any other combination.
Systematic sampling
In practice, random samples are generated using random number tables or computer random‐number generators. Systematic sampling is often used to generate approximately random samples. In systematic sampling, every kth member in the population list is selected until the desired sample size is reached.
Systematic sampling
In practice, random samples are generated using random number tables or computer random‐number generators. Systematic sampling is often used to generate approximately random samples. In systematic sampling, every kth member in the population list is selected until the desired sample size is reached.
Sampling error
Sampling error is the error caused by observing a sample instead of the entire population to draw conclusions relating to population parameters. It equals the difference between a sample statistic and the corresponding population parameter.
Sampling error of the mean
Sampling error of the mean = Sample mean − Population mean = x − μ
Sampling distribution & sampling distribution of the mean
A sampling distribution is the probability distribution of a given sample statistic under repeated sampling of the population. Suppose that a random sample of 50 stocks is selected from a population of 10,000 stocks, and the average return on the 50‐stock sample is calculated. If this process were repeated several times with samples of the same size (50), the sample mean (estimate of the population mean) calculated will be different each time due to the different individual stocks making up each sample. The distribution of these sample means is called the sampling distribution of the mean.
Remember that all the samples drawn from the population must be random, and of the same size. Also note that the sampling distribution is different from the distribution of returns of each of the components of the population (each of the 10,000 stocks) and has different parameters.
Stratification
Stratification is the process of grouping members of the population into relatively homogeneous subgroups, or strata, before drawing samples. The strata should be mutually exclusive i.e., each member of the population must be assigned to only one stratum. The strata should also be collectively exhaustive i.e., no population element should be excluded from the sampling process. Once this is accomplished, random sampling is applied within each stratum and the number of observations drawn from each stratum is based on the size of the stratum relative to the population. This often improves the representativeness of the sample by reducing sampling error.
Time-series data
- ‐series data consists of observations measured over a period of time, spaced at uniform intervals. The monthly returns on a particular stock over the last 5 years are an example of time‐series data.
Cross-sectional data
Cross‐sectional data refers to data collected by observing many subjects (such as individuals, firms, or countries/regions) at the same point in time. Analysis of cross‐sectional data usually consists of comparing the differences among the subjects. The returns of individual stocks over the last year are an example of cross‐sectional data.
Data sets can have both time‐series and cross‐sectional data in them. Examples of such data sets are:
Longitudinal data, which is data collected over time about multiple characteristics of the same observational unit. The various economic indicators—unemployment levels, inflation, GDP growth rates (multiple characteristics) of a particular country (observational unit) over a decade (period of time) are examples of longitudinal data.
Panel data, which refers to data collected over time about a single characteristic of multiple observational units. The unemployment rate (single characteristic) of a number of countries (multiple observational units) over time are examples of panel data.
Central limit theorem
The central limit theorem allows us to make accurate statements about the population mean and variance using the sample mean and variance regardless of the distribution of the population, as long as the sample size is adequate. An adequate sample size is defined as one that has more than 30 observations (n ≥ 30).
The important properties of the central limit theorem are:
1) Given a population with any probability distribution, with mean, μ, and variance, σ2, the sampling distribution of the sample mean x-bar, computed from sample size, n, will approximately be normal with mean, μ (the population mean), and variance, σ2/ n (population variance divided by sample size), when the sample size is greater than or equal to 30.
2) No matter what the distribution of the population, for a sample whose size is greater than or equal to 30, the sample mean will be normally distributed.
x̅ ~ N(μ,( σ2/n))
3) The mean of the population (μ) and the mean of the distribution of sample means x are equal.
4) The variance of the distribution of sample means equals σ2/n, or population variance divided by sample size.
Standard error
The standard deviation of the distribution of sample means is known as the standard error of the statistic.
When the population variance, σ 2, is known, the standard error of sample mean is calculated as:
Practically speaking, population variances are almost never known, so we estimate the standard error of the sample mean using the sample’s standard deviation:
Point estimate
A point estimate involves the use of sample data to calculate a single value (a statistic) that serves as an approximation for an unknown population parameter. For example, the sample mean, x, is a point estimate of the population mean, μ. The formula used to calculate a point estimate is known as an estimator.
The estimator for the sample mean is given as:
Confidence interval
A confidence interval uses sample data to calculate a range of possible (or probable) values that an unknown population parameter can take, with a given of probability of (1 – α). α is called the level of significance, and (1 – α) refers to the degree of confidence that the relevant parameter will lie in the computed interval. For example, a calculated interval between 100 and 150 at the 5% significance level implies that we can be 95% confident that the population parameter will lie between 100 and 150.
A (1 – α)% confidence interval has the following structure:
Point estimate ± (reliability factor * standard error)
where:
Point estimate = value of the sample statistic that is used to estimate the population parameter.
Reliability factor = a number based on the assumed distribution of the point estimate and the level of confidence for the interval (1 – α).
Standard error = the standard error of the sample statistic (point estimate).
When choosing between a number of possible estimators for a particular population parameter, we make use of the desirable statistical properties of an estimator to make the best possible selection. The desirable properties of an estimator are:
Unbiasedness
Efficiency
Consistency
Statistical property: Unbiasedness
Unbiasedness: An unbiased estimator is one whose expected value is equal to the parameter being estimated. The expected value of the sample mean equals the population mean [E(x) = μ]. Therefore, the sample mean, x, is an unbiased estimator of the population mean, μ .
Statistical property: Efficiency
Efficiency: An efficient unbiased estimator is the one that has the lowest variance among all unbiased estimators of the same parameter.
Statistical property: Consistency
Consistency: A consistent estimator is one for which the probability of estimates close to the value of the population parameter increases as sample size increases. We have already seen that the standard error of the sampling distribution falls as sample size increases, which implies a higher probability of estimates close to the population mean.
Student’s t‐distribution is a bell‐shaped probability distribution that has the following properties:
- It is symmetrical.
- It is defined by a single parameter, the degrees of freedom (df), where degrees of freedom equal sample size minus one (n‐1).
- It has a lower peak than the normal curve, but fatter tails.
- As the degrees of freedom increase, the shape of the t‐distribution approaches the shape of the standard normal curve.
A random sample size, n, and degrees of freedom
A random sample of size, n, is said to have n‐1 degrees of freedom. Basically, there are n‐1 independent deviations from the mean on which the estimate can be based.
What happens to the t-distribution curve as degrees of freedom increase?
As the degrees of freedom increase, the t‐distribution curve becomes more peaked and its tails become thinner (bringing it closer to a normal curve). As a result, for a given significance level, the confidence interval for a random variable that follows the t‐distribution will become narrower when the degrees of freedom increase. We will be more confident that the population mean will lie within the calculated interval as more data is concentrated towards the middle (as demonstrated by the higher peak) and less data is in the tails (thinner tails).
The t‐distribution is used in the following scenarios:
- It is used to construct confidence intervals for a normally (or approximately normally) distributed population whose variance is unknown when the sample size is small (n < 30).
- It may also be used for a non‐normally distributed population whose variance is unknown if the sample size is large (n ≥ 30). In this case, the central limit theorem is used to assume that the sampling distribution of the sample mean is approximately normal.
The confidence interval for the population mean when the population follows a normal distribution and its variance is known is calculated as follows: (NOTE: works if you have population standard deviation)
The following reliability factors are used frequently when constructing confidence intervals based on the standard normal distribution:
For a 90% confidence interval we use z0.05 = 1.65
For a 95% confidence interval we use z0.025 = 1.96
For a 99% confidence interval we use z0.005 = 2.58
This confidence interval can be interpreted in two ways:
- Probabilistic interpretation*: After repeatedly taking samples of 36 SAT candidates’ scores on the mock exam, and then constructing confidence intervals based on each sample’s mean, 99% of the confidence intervals will include the population mean over the long run.
- Practical interpretation*: We can be 99% confident that the average population score for the actual SAT exam is between 1663 and 1836.
When the variance of a normally distributed population is not known, we use the t‐distribution to construct confidence intervals:
t-distribution vs. z-distribution
Recall that the critical t‐values or the reliability factor for constructing the confidence interval depends on the level of confidence desired, and on the sample size. Also recall that the t‐distribution has fatter or thicker tails relative to the normal distribution. Since more observations essentially lie in the tails of the distribution, a confidence interval for a given significance level will be broader for the t‐distribution compared to the z‐distribution.
When the population is normally distributed, when do we use z-statistic vs t-statistic?
Use the z‐statistic when the population variance is known.
Use the t‐statistic when the population variance is not known.
When the distribution of the population is nonnormal, the construction of an appropriate confidence interval depends on the size of the sample. When do we use z-statistic vs t-statistic?
- If the population variance is known and the sample size is large (n ≥ 30) we use the z‐statistic. This is because the central limit theorem tells us that the distribution of the sample mean is approximately normal when sample size is large.
- If the population variance is not known and sample size is large, we can use the z‐statistic or the t‐statistic. However, in this scenario the use of the t‐statistic is encouraged because it results in a more conservative measure.
This implies that we cannot construct confidence intervals for nonnormal distributions if sample size is less than 30.
When do you use z-distribution to construct confidence intervals?
When the variance of a normally distributed population is not known, and the sample size is large we use the z‐distribution to construct confidence intervals:
Criteria for Selecting Appropriate Test Statistic
From our discussion so far, we have understood that there are various factors that affect the width of a confidence interval: Name two.
The choice of test statistic: A t‐statistic gives a wider confidence interval.
The degree of confidence: A higher desired level of confidence increases the size of the confidence interval.
From our formula for the confidence interval, it is easy to see that the width of the interval is also a function of the standard error. Explain.
The larger the standard error, the wider the confidence interval. The standard error, in turn, is a function of sample size. More specifically, a larger sample size results in a smaller standard error and reduces the width of the confidence interval. Therefore, large sample sizes are desirable as they increase the precision with which we can estimate a population parameter. However, in practice two considerations may work against increasing the sample size:
Increasing the size of the sample may result in drawing observations from a different population.
Increasing the sample size may involve additional expenses that outweigh the benefit of increased accuracy of estimates. Other than the risk of sampling from more than one population, there are a variety of challenges to valid sampling. If the sample is biased in any way, estimates and conclusions drawn from sample data will be erroneous.
Name the types of biases
Data mining bias
Sample selection bias
Survivorship bias
Look-ahead bias
Time-period bias
Data mining
Data mining is the practice of developing a model by extensively searching through a data set for statistically significant relationships until a pattern “that works” is discovered. In the process of data mining, large numbers of hypotheses about a single data set are tested in a very short time by searching for combinations of variables that might show a correlation.
Given that enough hypotheses are tested, it is virtually certain that some of them will appear to be highly statistically significant, even on a data set with no real correlations at all. Researchers who use data mining techniques can be easily misled by these apparently significant results even though they are merely coincidences.
Data‐mining bias most commonly occurs when:
Researchers have not formed a hypothesis in advance, and are therefore open to any hypothesis suggested by the data.
When researchers narrow the data used in order to reduce the probability of the sample refuting a specific hypothesis.
Warning signs that data mining bias might exist are:
- Too much digging warning sign*, which involves testing numerous variables until one that appears to be significant is discovered.
- No story/ no future warning sign*, which is indicated by a lack of an economic theory that can explain empirical results.
The best way to avoid the data‐mining bias is to:
The best way to avoid the data‐mining bias is to test the “apparently statistically significant relationships” on “out‐of‐sample” data to check whether they continue to hold.
Sample-selection bias
Sample-selection bias results from the exclusion of certain assets (such as bonds, stocks, or portfolios) from a study due to the unavailability of data.
Sample selection bias is even more severe in studies of hedge fund returns. This is because hedge funds are not required to publicly disclose their performance data. Only funds that performed well choose to disclose their performance, which leads to an overstatement of hedge fund returns.
Survivorship bias
Some databases use historical information and may suffer from a type of sample selection bias known as survivorship bias. This bias is present in databases that only list companies or funds currently in existence, which means that those that have failed are not included in the database. As a result, the results obtained from the study may not accurately reflect the true picture.