Week 2: Normal Distribution, Inference, and Confidence Intervals Flashcards

1
Q

What is a normal distribution?

A

A normal (or Gaussian) distribution is a continuous, bell-shaped, symmetric distribution where the mean, median, and mode coincide at the centre.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the main properties of a normal curve?

A
  • Bell-shaped and symmetric
  • Mean = median = mode
  • Continuous: the variable can take any value
  • Asymptotic: Tails approach but never touch the x-axis
  • Total area under the curve = 1 (area in each half of the distribution is 0.5)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the Empirical Rule for a normal distribution.

A
  • 68% of data lie within 1 standard deviation (SD) of the mean
  • 95% within 2 SDs
  • 99.7% within 3 SDs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a standard normal distribution?

A

It’s a normal (or z) distribution with a mean of 0 and a standard deviation of 1. Its units are denoted by z scores (also called standard units/scores)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define z-score and its purpose.

A

Tables for probabilities of z-scores can be used to find areas (i.e., probabilities) under the standard normal curve. In real world applications, data may be normally distributed but unlikely to have mean = 0 and SD = 1. We can standardise normal distributions to Z distributions. A z-score measures how many standard deviations a data point is from the mean. It allows comparison across different distributions by converting values to a standard scale. Z scores can be negative or positive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the z-score formula?

A

z = x−μ / σ
Where x is a value from the dataset, μ is the mean, and σ is the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do positive and negative z-scores represent?

A

A positive z-score means the data point is above the mean, while a negative z-score indicates it is below the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does a z-score table provide?

A

It shows the probability (area under the curve) for a given z-score, useful for finding the proportion of data above, below, or between values in a normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain inferential statistics.

A

Inferential statistics make educated guesses about a population parameter based on a sample statistic, allowing conclusions beyond the immediate data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Differentiate between parameter and statistic.

A
  • Parameter: A characteristic of a population (e.g., population mean μ).
  • Statistic: A characteristic of a sample, used to estimate a parameter (e.g., sample mean).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is sampling error?

A

Sampling error is the difference between a sample statistic and the actual population parameter, often due to chance variation in sampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define confidence interval (CI) and confidence level.

A

A CI is an interval within which the population parameter is expected to fall, expressed with a specific level of confidence (e.g., 95%).
A confidence level refers to the probability that the parameter is within a certain range that includes our sample statistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you calculate a confidence interval?

A

For a mean, use x̅ ± z × SE where x̅ is the sample mean, z is the z-score for the confidence level, and SE is the standard error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Central Limit Theorem (CLT)?

A

The CLT states that the sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population distribution shape. This holds true especially for sample sizes over 30 where the sample mean x̅ is approximately normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is the Central Limit Theorem important?

A

It enables the use of normal probability for sample means, even if the population is not normally distributed, facilitating confidence interval and hypothesis testing. The mean of a sampling distribution of the mean is an unbiased estimate of the population mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a standard error (SE)?

A

SE is the standard deviation of the sampling distribution of the sample mean, indicating how much a sample mean varies from the population mean. We use SEs to express the closeness of a statistic to a parameter and are used to construct a confidence interval. For example, 95% of the time a sample mean should lie within 1.96 SDs of the (true) mean (SE). We can work out these intervals because we can calculate the mean and the SE of the sample (as we derive the SE from the SD of the sample).
Note: Averages are less variable than individual observations (SE<SD by a factor of √n). Sampling variability decreases with larger sample sizes.

17
Q

How does sample size affect the standard error?

A

As the sample size increases, the standard error decreases, meaning the sample mean is likely closer to the population mean.

18
Q

z-value for significance set at p < 0.05

A

P(Z > 1.96) = 0.025 (often used for 95% CI)

19
Q

What factors affect the width of a confidence interval?

A
  • Confidence level: Higher confidence increases width.
  • Standard error: Higher SE widens the CI.
  • Sample size: Larger samples narrow the CI.
20
Q

Describe point estimate vs. interval estimate.

A

Point estimate: A single best guess for a parameter (e.g., sample mean).
Interval estimate: A range within which the parameter likely falls, adding context to the point estimate with a margin of error.

21
Q

How do you interpret a 95% confidence interval?

A

A 95% CI means that if we took many samples, 95% of the intervals calculated would contain the true population mean.

22
Q

How does skewness affect normality assumptions?

A

Skewed data violate normality; logarithmic or square root transformations may help approximate normality for statistical analysis.

23
Q

Explain the sampling distribution of a proportion.

A

For large samples, the sampling distribution of the sample proportion is approximately normal, centred at the population proportion, with SE based on √p(1-p)/n

24
Q

Imagine a z distribution and a z score of 1.282. Find the area for the following areas under the distribution:
i. Area between the mean and z = 1.282
ii. Area below the mean
iii. Area below z = 1.282
iv. Area above 1.282

A

We need to look up the probabilities table and find the cell corresponding with 1.2 and 0.08 (0.8997). Now we can calculate the areas.
i. 0.8997 - 0.5 = 0.3997
ii. 0.5
iii. 0.8997
iv. 1 - 0.8997 = 0.1003

25
Q

What is the purpose of the standard normal distribution?

A

It is a mathematical construct, but we can use it to calculate the probability of a data point occurring by using z -scores (if data are normally distributed).
If data is approximately normally distributed, we can determine the percentile rank of any data point if we know the mean and the SD.
If data is not normally distributed, we can transform it mathematically so it becomes normally distributed.

Note: Logarithm and square root transformations are the most common in skewed distributions.

26
Q

Example: A student gets 56 for an exam. The class mean is 52 with a SD of 3.7. How does the student stand in relation to the rest of the class?

A

𝑧=(𝑥−μ)/σ
(56-52) / 3.7 = 1.08
The student’s score is 1.08 SD above average.
Using the lookup table, a z-score of 1.08 means that 85.99% of scores were below that data point (i.e., below 56).
The student is in the top 14% of the class.

27
Q

What do these symbols represent?
x̅, s, s2, μ, σ, σ2

A

Sample statistic:
x̅ - mean
s - SD
s2 - variance
Population parameter:
μ - mean
σ - SD
σ2 - variance

28
Q

What is the formula for confidence intervals?

A

x̅ +- z-score*SE

29
Q

Example: You take a sample of the population of women in the paid UK labour force (n=949) and calculate their average income.
x̅ = £271.90 σ = £416.14. Go through the steps to estimate the ‘true’ mean.

A

SE = 416.14 / √949 = 13.51
1.96 SEs = 1.96 x 13.51 = £26.41
We can be 95% confident that the ‘true’ mean is within £26.48 of the sample mean (£271.90 95% CI £245.42-£298.38).
If we were to draw several independent samples from the same population, and calculate the 95% CI for each, on average 19 of every 20 such CIs would contain the true population mean.

30
Q

What’s the formula to calculate SE?

A

SE = σ / √n
σ = SD
n = sample size

31
Q

What would happen if we take many repeated random samples from the population, and calculate CI for mean cholesterol on each?

A

In the long run, 95% of these CIs will contain the population mean cholesterol.