Statistics Theory L10 = Assumptions Of T-Distributions Flashcards

1
Q

Assumptions of t-distributions? (5)

A
  • Random sampling.
  • Independent observations.
  • Normality of the population distribution.
  • Unknown population variance.
  • Df is based on sample size.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Aspects of the performance of t-distributions when assumptions are not met (in real-world situations)? (2)

A
  • Robustness of the 2-sample t-tools.
  • Resistance of the 2-sample t-tools.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Robustness of the 2-sample t-tools?

A

= refers to a statistic being robust to departures from particular assumptions if it is valid even if assumptions are not met.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Valid?

A

= if uncertainty measure is approximately equal to the stated rates (eg, 95% CIs capture true value 95% of the time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2 departures that t-distributions have to be robust to?

A
  • Departures from normality.
  • Departures from independence.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Departures from normality?

A

Handled by the Central Limit Theorem, where for large n (sample size), the sampling distribution is normal even if the population distribution is not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The Central Limit Theorem is based on statistical theory and says 4 things?

A

If you have:

  • 2 populations/groups with similar SD (σ1 = σ2), shape of distribution is similar & n1 = n2, then: validity is affected by long tails & little affected by skewness.
  • 2 populations/groups with similar SD (σ1 = σ2), similar shape of distribution but n1 ≠ n2, then: validity is affected moderately by long tails, substantially by skewness.
  • Skewness differs considerably, results from t-tools will be severely misleading.
  • Equal samples = little effect of differing σ’s; unequal samples = larger effect of differing σ’s; unequal samples & larger effect of differing σ’s = intervals capture more or fewer true values than the nominal rate.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Departures from independence?

A

= are due to either the cluster effect or the serial effect or both.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cluster effect?

A

= when samples that naturally occur in groupings tend to be more similar to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Eg of Cluster effect?

A

Siblings, family.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Serial effect?

A

= when observations made closer together in time or space tend to be more similar.

  • Similar to autocorrelation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Consequences of both effects? (3)

A
  • We overestimate n: evidence for a difference appears stronger than it really is.
  • SEs & 95% CIs are too narrow.
  • t-ratios are too large & p is too small.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Resistance of 2-sample t-tools?

A

= deals with outliers & resistant statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Outlier?

A

= observation that is far from the average of the group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Outlier attributes? (2)

A
  • Produces long tails (t-test is unreliable).
  • Could be caused by contamination.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Resistant statistic?

A

= value that doesn’t change much when a small part of the data changes, perhaps drastically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Eg of a resistant statistic?

18
Q

NB for these values? (3)

A
  • T-tools based on averages are not resistant to these sorts of extreme values.
  • We want the averages to be good representations of the groups.
  • We don’t want 1-2 values to drive the outcome.
19
Q

Strategies for the 2-sample problems (i.e., robustness & resistance)? (4)

A
  • Consider serial or cluster effects. Groups? Repeated measures? Spatial/temporal dependence?
  • Use plots, compare samples, i.e., evaluate the suitability of the t-tools.
  • Consider transformations.
  • Consider alternative methods that don’t depend on normality.
20
Q

How do we deal with outliers? (2)

A

(1) Data recording/entry error?

(2) Is the outlier genuinely weird?

21
Q

Data recording/entry error? (2)

A
  • If on paper: cross & correct (don’t erase or tippex).
  • If electronic: make a new data file & correct mistakes.
22
Q

Is the outlier genuinely weird? (2)

A
  • Use resistant analyses.

OR

  • Provide a defensible reason to leave in/dropout the outlier.
23
Q

Eg of how to deal with outliers?

A

Agent Orange example.

24
Q

Agent Orange example? (6)

A

(1) Identify outliers visually. Remove outliers one-at-a-time & replot after each removal.

(2) Compute relevant statistical measures (p-values in this case), with & without outliers; full dataset, 646 removed, & 645 and 646 removed:

  • Full dataset: p = 0.40.
  • 1 case removed: p = 0.48.
  • 2 cases removed: p = 0.54.

Therefore, no change in conclusion with or without the outliers.

(3) If conclusion is not affected by the outliers, report analysis with full data set.

(4) But examine outliers carefully to see what else can be learned.

(5) If analyses give different answers? Examine the suspects & consider what to do with those cases.

(6) Do they belong to the statistical population?

  • Yes: report without suspects, report reason for dropping suspects.
  • No: use resistant analyses, report both analyses (don’t use t-tests).
25
Q

Transformation of data?

A

= taking the natural log (ln) of data.

26
Q

When is the transformation of data useful? (2)

A

Useful if:

  • The ratio of the max to the min value in the data is >10.
  • Both groups are skewed & the group with the larger average also has the larger spread (i.e., small mean = small spread; large mean = large spread).
27
Q

Why use ln for a graph?

A

We use log to nicely distribute the data.

  • Afterward, carry on using t-tools (on the nice data).
28
Q

Transformation of data: Randomised experiments.

What’s the goal of transformation of data in Randomised experiments?

A

To estimate a treatment effect δ in Y2 = Y1 + δ.

29
Q

Transformation of data: Randomised experiments attributes? (3)

A
  • Multiplicative treatment effect.
  • If we need to log-transfrm our data we end up with log (Y2) = log (Y1) + δ.
  • Interpreted as: Treatment 1 = gives outcome of Y1; Treatment 2 = gives outcome of Y1e^δ.
30
Q

Transformation of data: Randomised experiments.

What would the interpretation be?

A

Suppose z = log(Y); estimated response of experimental units to treatment z will be exp(z2 - z1) times as large as its response to treatment 1 (multiplicative change instead of difference).

31
Q

Transformation of data: Randomised experiments NB?

A

When talking about the outcome, you need to back-transform & analyse that data.

32
Q

Eg of Transformation of data: Randomised experiments?

A

Cloud seeding & log transformation example.

33
Q

Cloud seeding & log transformation example? (4)

A

[1] Log-transform rainfall; inspect data (first 5 rows).

[2] Use the 2-sample t-tools on log(rainfall)

(i) Difference calculated with log-transformed data (Ŷ2 - Ŷ1):

Ŷ2 - Ŷ1 = 5.13 - 3.99 = 1.14.

(ii) Pooled SD (sp):

sp = √[(26-1)(1.60^2) + (26-1)(1.64^2)] / (26+26-2)

= 1.6208.

(iii) SE of the estimates:

SE (Ŷ2 - Ŷ1) = 1.6208 √[(1/26) + (1/26)]

= 0.4495.

(iv) t-statistic & p-value:

t = 1.14 / 0.4495 = 2.5444.

From R: 1 - pt (2.5444, 50) = 0.007 (1-sided p-value).

[3] Back-transform estimate & CI:

(i) qt (0.975, 50 = 2.0085)

(ii) 95% CI:
1.14 ± (2.0085)(0.4495)
= (0.2409, 2.0467).

(iii) exp (1.14) = 3.14.

(iv) exp (0.2409, 2.0467)
= (exp (0.2409), exp (2.0467))
= (1.27, 7.74).

[4] State the conclusions on the original scale:

There is convincing evidence that seeding increased rainfall (1-sided p = 0.007). Volume of rainfall produced by a seeded cloud is estimated to be 3.14 (95% CI: 1.27, 7.74) times as large as an unseeded cloud. Scope of inference?

34
Q

Transformation of data: Observational study attributes? (6)

A
  • Interpret by using the ratio of population medians.
  • mean (logY2) - mean (logY1) ≠ log (mean(Y2)) - log (mean(Y1)).
  • Cannot back-transform a long estimate & get the correct mean.

so,

  • We have to use the median as value is preserved when switching between “natural scale” & “log scale”.
  • If z = log (Y): ẑ2 - ẑ1 estimates by log [median (Y2) / median (Y1)] and exp (ẑ2 - ẑ1) estimates median (Y2) / median (Y1).
  • Interpreted in terms of the means & medians.
35
Q

Transformation of data: Observational study.

What would the interpretation be?

A

Median for population 2 is exp (ẑ2 - ẑ1) times as large as the median for population 1.

36
Q

Other transformations besides log/natural log? (4)

A
  • √Y
  • 1/y
  • arcsin (√Y)
  • log (y/ 1-y)
37
Q

√Y transformation?

A

= used for counts.

38
Q

1/y transformation?

A

= time to event to transform to a rate or speed.

39
Q

arcsin (√Y) transformation?

A

= useful for proportions (as they don’t follow a nice distribution).

40
Q

log (y/ 1-y) transformation?

A

= useful for proportions as well (logit).