Chapter 2: Statistics Revisited Flashcards
- What is inferential statistics?
- Why is the Normal Distribution so important?
- What is an i.i.d random sample?
- How does sample size impact the confidence interval? What is a paired t-test?
- What is the OLS estimator all about?
What is descriptive statistics and inferential statistics?
Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample (e.g., mean and standard deviation). Taken from all data
for randomness and drawing inferences about the larger population.
Sample
Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. Taken from a sample
These inferences may take the form of:
- estimates of numerical characteristics (estimation)
- answers to yes/no questions (hypothesis testing),
- forecasting of future observations (forecasting),
- descriptions of association (correlation), or
- modeling of relationships (regression).
Data Mining is sometimes referred to as exploratory statistics generating new hypotheses.
What are random variables?
π is a random variable if it represents a random draw from some population, and is associated with a probability distribution.
- a discrete random variable can take on only selected values (e.g., Binomial or Poisson distributed), Person height
- a continuous random variable can take on any value in a real interval (e. g., uniform, Normal or Chi-Square distributions) 0-180 Grad
For example, a Normal distribution, with mean π and variance π2 is written as π(ΞΌ, Ο2) has a pdf of
f(x) = (1 / Ο sqrt(2Ο)e)-(x-ΞΌ)^2/2Ο^2
The Standard Normal
Any random variable can be βstandardizedβ by subtracting the mean, π, and dividing by the standard deviation, π , so
πΈπ =0,ππππ =1.
Thus, the standard normal, π 0,1 , has probability density function (pdf):

Statistical Estimation
Populiation with parameters -every member of the population has the same chance to be selected-> Random sample
Random sample -estimation-> Population
Expected Value of X: Population Mean E(X)
- The expected value is a probability weighted average of π
- πΈ(π) is the mean or expected value of the distribution of π, denoted by uπ₯
- Let π(π₯π) be the (discrete) probability that X = π₯π, then
- ux= πΈ(π)=(n bis i=1)Ξ£xi f(xi)
- Law of large numbers: the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
Sampling Distribution of the Mean
- We can say something about the distribution of sample statistics (such as the sample mean)
- The sample mean is a random variable, and consequently it has its own distribution and variance
- The distribution of sample means for different samples of a population is centered on the population mean
- The mean of the sample means is equal to the population mean
- If the population is normally distributed or when the sample size is large, sample means are distributed normally
Examples of Estimators
- Suppose we want to estimate the population mean
- Suppose we use the formula for πΈ(π), but substitute 1/π for π(π₯π) as the probability weight since each point has an equal chance of being included in the sample, then we can calculate the sample mean:
- π describes the random variable for the arithmetic mean of the sample, while π₯ is the mean of a particular realization of a sample.

Estimators should be Unbiased
An estimator (e.g., the arithmetic sample mean) is a statistic (a function of the observable sample data) that is used to estimate an unknown population parameter (e.g., the expected value)
Standard Error of the Mean: Standard Deviation of Sample Means
The standard deviation of the sample means is equal to the standard deviation of the population divided by the square root of the sample size.
Ο / sqrt(n)
Rule:Var[aX + b] a2 Var[X]
Random Samples and Sampling
- For a random variable π, repeated draws from the same population can be labeled as π1, π2, . . . , ππ
- If every combination of π sample points has an equal chance of being selected, this is a random sample
- A random sample is a set of independent, identically distributed (i.i.d) random variables
Central Limit Theorem
- The central limit theorem states that the standardized average of any population of i.i.d. random variables ππ with mean ππ and variance π2 is asymptotically ~π(0,1), or
- Asymptotic Normality implies that π(π < π§) Ξ¦(π§) as
π β> unendlich , ππ π(π < π§) β Ξ¦(π§) - In other words:
- π1, β¦ , ππ be π i.i.d. random variables with mean ΞΌ and standard deviation Ο.
- If π is sufficiently large, the sample mean X is approximately
- Normal with mean ΞΌ and standard deviation π/βπ
- i.e., the mean of the sample means is equal to the population mean
- i.e., the standard deviation of the sample means is equal to the standard deviation of the population divided by the square root of the sample size
- Normal with mean ΞΌ and standard deviation π/βπ

Statistical Estimation
- Population with mean: ΞΌ= ? β>
- A simple random sample of π elements is selected from the population. β>
- The sample data provide a value for the sample mean π₯ β>
- The value of π₯ is used to make inferences about the value of ΞΌ.
Studentβs t-Distribution
- When the population standard deviation is not known, or when the sample size is small, the Studentβs t-distribution should be used
- This distribution is similar to the Normal distribution, but more spread out for small samples
- The formula for standardizing the distribution of sample means to the t-distribution is similar, except that the sample standard deviation π is used
Student t-Distribution

Statistical Estimation (Types)
- Point estimate
- sample mean
- sample proportion
- Point estimate
- sample mean
- sample proportion
- Point estimate is always within the interval estimate
Confidence Interval (CI)
Provide us with a range of values that we believe, with a given level of confidence, contains a population parameter CI for the population means:
Pr(X - 1.96SD <= Β΅ <= X + 1.96SD) = 0.95
lower bound and upper bound.
There is a 95% chance that your interval contains π.
Example: Standard Normal Distribution
Suppose sample of π=100 persons mean = 215, standard deviation = 20
95% CI = X +- 1.96s / sqrt(n)
- Lower Limit: 215 β 1.96*20/10
- Upper Limit: 215 + 1.96*20/10
- = (211, 219)
βWe are 95% confident that the interval 211-219 contains πβ
Effect of Sample Size
Suppose we had only 10 observations What happens to the confidence interval?
X +- 1.96s / sqrt(n)
- For n = 100, 215 1.96(20) / 100 (211,219)
- For n = 10, 215 1.96(20) / 10 (203,227)
- Larger sample size = smaller interval
Suppose we use a 90% interval
What happens to the confidence interval?
X +- 1.645s / sqrt(n)
90%: 215 1.645(20) / sqrt(100) = (212,218)
Lower confidence level = smaller interval (A 99% interval would use 2.58 as multiplier and the interval would be larger)
Effect of Standard Deviation
Suppose we had a SD of 40 (instead of 20) What happens to the confidence interval?
X 1.96s/ sqrt(n)
215 +- 1.96(40)/ sqrt(100) = (207,223)
More variation = larger interval
Statistical Inference
- Formulate hypothesis
- Collect data to test hypothesis <β Systematic error
- Accept hypothesis
- Reject hypothesis
Random error (chance) can be controlled by statistical significance or by confidence interval
Hypothesis Testing
- State null and alternative hypothesis (Ho and Ha)
- Ho usually a statement of no effect or no difference between groups
- Choose Ξ± level (related to confidence level) } how much do
- Probability of falsely rejecting Ho (Type I error), typically 0.05 or 0.01
- Calculate test statistic, find p-value (p)
- Measures how far data are from what you expect under null hypothesis
- State conclusion:
- π β€ πΌ, reject Ho
- π > πΌ, insufficient evidence to reject Ho
Possible Results of Tests
- Null true and Reject Null = Type I (alpha error)
- Null true and Fail to reject Null = Correct
- Null false and Reject Null = Correct
- Null false and Fail to reject Null = Type II error (Γ)
Hypothesis Testing
Hypothesis: A statement about parameters of population or of a model (π = 200 ?)
Test: Does the data agree with the hypothesis? (sample mean 220)
Simple random sample from a normal population (or n large enough for CLT)
Ho: π = ππ
Ha : π =ΜΈ ππ , pick πΌ
Z-Test
- Der Einstichproben-t-Test (auch Einfacher t-Test; engl. One-sample t-Test) prΓΌft anhand des Mittelwertes einer Stichprobe, ob der Mittelwert einer Grundgesamtheit sich von einem vorgegebenen Sollwert unterscheidet. Dabei wird vorausgesetzt, dass die Daten der Stichprobe einer normalverteilten Grundgesamtheit entstammen bzw. es einen genΓΌgend groΓen Stichprobenumfang gibt, so dass der zentrale Grenzwertsatz erfΓΌllt ist.
- Der Zweistichproben-t-Test (auch Doppelter t-Test; engl. Two-sample t-Test) prΓΌft anhand der Mittelwerte zweier unabhΓ€ngiger Stichproben, wie sich die Mittelwerte zweier Grundgesamtheiten zueinander verhalten. Dabei wird vorausgesetzt, dass die Daten der Stichproben einer normalverteilten Grundgesamtheit entstammen bzw. es genΓΌgend groΓe StichprobenumfΓ€nge gibt, so dass der zentrale Grenzwertsatz erfΓΌllt ist. Der klassische t-Test setzt voraus, dass beide Stichproben aus Grundgesamtheiten mit gleicher Varianz entstammen. Der Welch-Test oder t-Test nach Satterthwaite ist eine Variante, die die Gleichheit der Varianzen nicht voraussetzt.
CI and 2-Sided Tests
- A level 2-sided test rejects H0: = π0 exactly when the value π0 falls outside a level 1 β alpha confidence interval for.
- Calculate 1 β πΌ level confidence interval, then
- if 0 falls within the interval, do not reject the null hypothesis, π‘ < π‘ /2
- Otherwise, |π‘| β₯ π‘ /2 =>reject the null hypothesis.
Definition of a p-Value
The p-value describes the probability of having t=3.1 (or larger), given the null hypothesis. The smaller the p-value, the more unlikely the null hypothesis seems.
p-value and significance level are the same.
Unpaired Samples - p-value
2 independent samples:
Does the amount of credit card debt differ between households in rural areas compared to households in urban areas?
- Population 1: All Rural Households π1
- Population 2: All Urban Households π2
- Null Hypothesis: H0 : π1 = π2
- Alternate Hypothesis: HA : π1 =ΜΈ π2.
Population 1: All Rural Households π1
- Take random sample: n1 = X(arth mean)
Population 2: All Urban Households π2:
- Take random sample: n2 = X(arth mean)
Are the sample means consistent with H0?
Summary Rural:
- x1 = 6299
- s1 = 3412
Summary Urban:
- x2 = 7034
- s2 = 2467
Difference in means = β¬735 We have heteroscedasticity.
How likely is it to get a difference of β¬735 or greater if Ho is true? =>This probability is the p-value.
If small then reject Ho.
Selected Statistical Tests
- Parametric Tests
- F-test
- Compares the equivalence of variances of two samples
- Often used as a pre-test for the t-test
- The family of t-tests
- Compares two sample means or tests a single mean
- F-test
- Non-parametric Tests
- Wilcoxon signed-rank test
- Independence of two means for 2 paired i.i.d samples, when normality cannot be assumed.
- Mann-Whitney-U test is used for 2 independent samples
- ANOVA
- Equivalence of multiple means in case of several i.i.d samples (normally distributed)
- Kruskal-Wallis-Test
- Equivalence of multiple means in case of several i.i.d non-normally distributed samples
- Wilcoxon signed-rank test
- Tests of the Probability Distribution
- Kolmogorov-Smirnov and Chi-square test
- used to determine whether two underlying probability distributions differ, or whether an underlying probability distribution differs from a hypothesized distribution
- Kolmogorov-Smirnov and Chi-square test