Chapter 2: Statistics Revisited Flashcards

1
Q
  • What is inferential statistics?
  • Why is the Normal Distribution so important?
  • What is an i.i.d random sample?
  • How does sample size impact the confidence interval? What is a paired t-test?
  • What is the OLS estimator all about?
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is descriptive statistics and inferential statistics?

A

Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample (e.g., mean and standard deviation). Taken from all data

for randomness and drawing inferences about the larger population.

Sample

Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. Taken from a sample

These inferences may take the form of:

  • estimates of numerical characteristics (estimation)
  • answers to yes/no questions (hypothesis testing),
  • forecasting of future observations (forecasting),
  • descriptions of association (correlation), or
  • modeling of relationships (regression).

Data Mining is sometimes referred to as exploratory statistics generating new hypotheses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are random variables?

A

𝑋 is a random variable if it represents a random draw from some population, and is associated with a probability distribution.

  • a discrete random variable can take on only selected values (e.g., Binomial or Poisson distributed), Person height
  • a continuous random variable can take on any value in a real interval (e. g., uniform, Normal or Chi-Square distributions) 0-180 Grad

For example, a Normal distribution, with mean πœ‡ and variance 𝜎2 is written as 𝑁(ΞΌ, Οƒ2) has a pdf of

f(x) = (1 / Οƒ sqrt(2Ο€)e)-(x-ΞΌ)^2/2Οƒ^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The Standard Normal

A

Any random variable can be β€œstandardized” by subtracting the mean, πœ‡, and dividing by the standard deviation, 𝜎 , so
𝐸𝑍 =0,π‘‰π‘Žπ‘Ÿπ‘ =1.

Thus, the standard normal, 𝑁 0,1 , has probability density function (pdf):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Statistical Estimation

A

Populiation with parameters -every member of the population has the same chance to be selected-> Random sample

Random sample -estimation-> Population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Expected Value of X: Population Mean E(X)

A
  • The expected value is a probability weighted average of 𝑋
  • 𝐸(𝑋) is the mean or expected value of the distribution of 𝑋, denoted by uπ‘₯
  • Let 𝑓(π‘₯𝑖) be the (discrete) probability that X = π‘₯𝑖, then
    • ux= 𝐸(𝑋)=(n bis i=1)Ξ£xi f(xi)
  • Law of large numbers: the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Sampling Distribution of the Mean

A
  • We can say something about the distribution of sample statistics (such as the sample mean)
  • The sample mean is a random variable, and consequently it has its own distribution and variance
  • The distribution of sample means for different samples of a population is centered on the population mean
  • The mean of the sample means is equal to the population mean
  • If the population is normally distributed or when the sample size is large, sample means are distributed normally
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examples of Estimators

A
  • Suppose we want to estimate the population mean
  • Suppose we use the formula for 𝐸(𝑋), but substitute 1/𝑛 for 𝑓(π‘₯𝑖) as the probability weight since each point has an equal chance of being included in the sample, then we can calculate the sample mean:
  • 𝑋 describes the random variable for the arithmetic mean of the sample, while π‘₯ is the mean of a particular realization of a sample.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Estimators should be Unbiased

A

An estimator (e.g., the arithmetic sample mean) is a statistic (a function of the observable sample data) that is used to estimate an unknown population parameter (e.g., the expected value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Standard Error of the Mean: Standard Deviation of Sample Means

A

The standard deviation of the sample means is equal to the standard deviation of the population divided by the square root of the sample size.

Οƒ / sqrt(n)

Rule:Var[aX + b] a2 Var[X]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Random Samples and Sampling

A
  • For a random variable 𝑋, repeated draws from the same population can be labeled as 𝑋1, 𝑋2, . . . , 𝑋𝑛
  • If every combination of 𝑛 sample points has an equal chance of being selected, this is a random sample
  • A random sample is a set of independent, identically distributed (i.i.d) random variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Central Limit Theorem

A
  • The central limit theorem states that the standardized average of any population of i.i.d. random variables 𝑋𝑖 with mean πœ‡π‘‹ and variance 𝜎2 is asymptotically ~𝑁(0,1), or
  • Asymptotic Normality implies that 𝑃(𝑍 < 𝑧) Ξ¦(𝑧) as
    𝑛 –> unendlich , π‘œπ‘Ÿ 𝑃(𝑍 < 𝑧) β‰ˆ Ξ¦(𝑧)
  • In other words:
  • 𝑋1, … , 𝑋𝑛 be 𝑛 i.i.d. random variables with mean ΞΌ and standard deviation Οƒ.
  • If 𝑛 is sufficiently large, the sample mean X is approximately
    • Normal with mean ΞΌ and standard deviation 𝜎/βˆšπ‘›
      • i.e., the mean of the sample means is equal to the population mean
      • i.e., the standard deviation of the sample means is equal to the standard deviation of the population divided by the square root of the sample size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Statistical Estimation

A
  • Population with mean: ΞΌ= ? –>
  • A simple random sample of 𝑛 elements is selected from the population. –>
  • The sample data provide a value for the sample mean π‘₯ –>
  • The value of π‘₯ is used to make inferences about the value of ΞΌ.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Studentβ€˜s t-Distribution

A
  • When the population standard deviation is not known, or when the sample size is small, the Studentβ€˜s t-distribution should be used
  • This distribution is similar to the Normal distribution, but more spread out for small samples
  • The formula for standardizing the distribution of sample means to the t-distribution is similar, except that the sample standard deviation 𝒔 is used
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Student t-Distribution

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Statistical Estimation (Types)

A
  • Point estimate
    • sample mean
    • sample proportion
  • Point estimate
    • sample mean
    • sample proportion
  • Point estimate is always within the interval estimate
17
Q

Confidence Interval (CI)

A

Provide us with a range of values that we believe, with a given level of confidence, contains a population parameter CI for the population means:

Pr(X - 1.96SD <= Β΅ <= X + 1.96SD) = 0.95

lower bound and upper bound.

There is a 95% chance that your interval contains πœ‡.

18
Q

Example: Standard Normal Distribution

A

Suppose sample of 𝑛=100 persons mean = 215, standard deviation = 20

95% CI = X +- 1.96s / sqrt(n)

  • Lower Limit: 215 – 1.96*20/10
  • Upper Limit: 215 + 1.96*20/10
  • = (211, 219)

β€œWe are 95% confident that the interval 211-219 contains πœ‡β€

19
Q

Effect of Sample Size

A

Suppose we had only 10 observations What happens to the confidence interval?

X +- 1.96s / sqrt(n)

  • For n = 100, 215 1.96(20) / 100 (211,219)
  • For n = 10, 215 1.96(20) / 10 (203,227)
  • Larger sample size = smaller interval
20
Q

Suppose we use a 90% interval
What happens to the confidence interval?

A

X +- 1.645s / sqrt(n)

90%: 215 1.645(20) / sqrt(100) = (212,218)

Lower confidence level = smaller interval (A 99% interval would use 2.58 as multiplier and the interval would be larger)

21
Q

Effect of Standard Deviation

A

Suppose we had a SD of 40 (instead of 20) What happens to the confidence interval?

X 1.96s/ sqrt(n)
215 +- 1.96(40)/ sqrt(100) = (207,223)

More variation = larger interval

22
Q

Statistical Inference

A
  1. Formulate hypothesis
  2. Collect data to test hypothesis <– Systematic error
    1. Accept hypothesis
    2. Reject hypothesis

Random error (chance) can be controlled by statistical significance or by confidence interval

23
Q

Hypothesis Testing

A
  • State null and alternative hypothesis (Ho and Ha)
    • Ho usually a statement of no effect or no difference between groups
  • Choose Ξ± level (related to confidence level) } how much do
    • Probability of falsely rejecting Ho (Type I error), typically 0.05 or 0.01
  • Calculate test statistic, find p-value (p)
    • Measures how far data are from what you expect under null hypothesis
  • State conclusion:
    • 𝑝 ≀ 𝛼, reject Ho
    • 𝑝 > 𝛼, insufficient evidence to reject Ho
24
Q

Possible Results of Tests

A
  • Null true and Reject Null = Type I (alpha error)
  • Null true and Fail to reject Null = Correct
  • Null false and Reject Null = Correct
  • Null false and Fail to reject Null = Type II error (ß)
25
Q

Hypothesis Testing

A

Hypothesis: A statement about parameters of population or of a model (πœ‡ = 200 ?)

Test: Does the data agree with the hypothesis? (sample mean 220)

Simple random sample from a normal population (or n large enough for CLT)

Ho: πœ‡ = πœ‡π‘œ
Ha : πœ‡ =ΜΈ πœ‡π‘œ , pick 𝛼

26
Q

Z-Test

A
  • Der Einstichproben-t-Test (auch Einfacher t-Test; engl. One-sample t-Test) prΓΌft anhand des Mittelwertes einer Stichprobe, ob der Mittelwert einer Grundgesamtheit sich von einem vorgegebenen Sollwert unterscheidet. Dabei wird vorausgesetzt, dass die Daten der Stichprobe einer normalverteilten Grundgesamtheit entstammen bzw. es einen genΓΌgend großen Stichprobenumfang gibt, so dass der zentrale Grenzwertsatz erfΓΌllt ist.
  • Der Zweistichproben-t-Test (auch Doppelter t-Test; engl. Two-sample t-Test) prΓΌft anhand der Mittelwerte zweier unabhΓ€ngiger Stichproben, wie sich die Mittelwerte zweier Grundgesamtheiten zueinander verhalten. Dabei wird vorausgesetzt, dass die Daten der Stichproben einer normalverteilten Grundgesamtheit entstammen bzw. es genΓΌgend große StichprobenumfΓ€nge gibt, so dass der zentrale Grenzwertsatz erfΓΌllt ist. Der klassische t-Test setzt voraus, dass beide Stichproben aus Grundgesamtheiten mit gleicher Varianz entstammen. Der Welch-Test oder t-Test nach Satterthwaite ist eine Variante, die die Gleichheit der Varianzen nicht voraussetzt.
27
Q

CI and 2-Sided Tests

A
  • A level 2-sided test rejects H0: = πœ‡0 exactly when the value πœ‡0 falls outside a level 1 βˆ’ alpha confidence interval for.
  • Calculate 1 βˆ’ 𝛼 level confidence interval, then
    • if 0 falls within the interval, do not reject the null hypothesis, 𝑑 < 𝑑 /2
    • Otherwise, |𝑑| β‰₯ 𝑑 /2 =>reject the null hypothesis.
28
Q

Definition of a p-Value

A

The p-value describes the probability of having t=3.1 (or larger), given the null hypothesis. The smaller the p-value, the more unlikely the null hypothesis seems.

p-value and significance level are the same.

29
Q

Unpaired Samples - p-value

A

2 independent samples:

Does the amount of credit card debt differ between households in rural areas compared to households in urban areas?

  • Population 1: All Rural Households π‘š1
  • Population 2: All Urban Households π‘š2
  • Null Hypothesis: H0 : π‘š1 = π‘š2
  • Alternate Hypothesis: HA : π‘š1 =ΜΈ π‘š2.

Population 1: All Rural Households π‘š1

  • Take random sample: n1 = X(arth mean)

Population 2: All Urban Households π‘š2:

  • Take random sample: n2 = X(arth mean)

Are the sample means consistent with H0?

Summary Rural:

  • x1 = 6299
  • s1 = 3412

Summary Urban:

  • x2 = 7034
  • s2 = 2467

Difference in means = €735 We have heteroscedasticity.

How likely is it to get a difference of €735 or greater if Ho is true? =>This probability is the p-value.

If small then reject Ho.

30
Q

Selected Statistical Tests

A
  • Parametric Tests
    • F-test
      • Compares the equivalence of variances of two samples
      • Often used as a pre-test for the t-test
    • The family of t-tests
      • Compares two sample means or tests a single mean
  • Non-parametric Tests
    • Wilcoxon signed-rank test
      • Independence of two means for 2 paired i.i.d samples, when normality cannot be assumed.
      • Mann-Whitney-U test is used for 2 independent samples
    • ANOVA
      • Equivalence of multiple means in case of several i.i.d samples (normally distributed)
    • Kruskal-Wallis-Test
      • Equivalence of multiple means in case of several i.i.d non-normally distributed samples
  • Tests of the Probability Distribution
    • Kolmogorov-Smirnov and Chi-square test
      • used to determine whether two underlying probability distributions differ, or whether an underlying probability distribution differs from a hypothesized distribution