Statistics Flashcards

1
Q

Population mean (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Population variance (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sample mean (Formula)

A

The sample mean estimator is unbiased

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sample variance - unbiased (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Generalisability (Def)

A

Results from statistical inference are generalisable when estimates obtained from a sample are reflective of the target population’s parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sampling distribution (Def)

A

If we take several samples from a population, the sample estimates will differ due to sampling variation. The sampling distribution is the distribution of the sampling estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Standard error (Def)

A

The standard dev of the sampling distribution is a measure of the sampling variation and it’s called Standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sampling methods - non-probability (Types)

A
  • Convenience
  • Systematic
  • Purposive
  • Quota
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sampling methods - probability (Types)

A
  • Random
  • Cluster
  • Stratified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Convenience sampling

A

A type of non-probability sampling.

Sampling based on how convenient the subjects are to find.

Pro:

  • Affordable, easy and quick
  • Works ok if the population is homogeneous

Con:
- Not representative if the population is heterogeneous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Purposive sampling

A

A type of non-probability sampling.

The researcher relies on their own knowledge when choosing members of the population.

Pro:
- Beneficial when we want to access a subset of the population

Con:

  • Requires domain knowledge
  • Might not be representative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Quota sampling

A

A type of non-probability sampling.

Tailors the sample to be in proportion to some known characteristic of the population.

Pro:

  • Affordable, easy and quick
  • Accounts for differences in groups (strata)

Con:

  • Selection bias if convenience sampling
  • Needs prior knowledge to know the strata
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Systematic sampling

A

A type of non-probability sampling.

Sampling at regular intervals, one every k=n/N

Pro:
- Can extend the sampling procedure to whole population (i.e. more representative)

Con:

  • Needs knowledge of the whole population
  • The order of the units can cause systematic bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bias of an estimator (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Precision of an estimator (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Mean Squared Error (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Population total (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Variance of the sample mean, for finite populations (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Stratified sampling (Def)

A

If the population of interest is heterogeneous with respect to the characteristic (parameter) of interest, one sampling procedure that can increase PRECISION is Stratified sampling.

Def: Partitioning a population into non-overlapping groups and sampling within each group. Each group is called a STRATUM.

If the sampling is done randomly, it’s called Stratified random sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Stratified sampling (Principles)

A

(1) Strata should be non-overlapping
(2) Strata should form a partition of the total pop
(3) Units within a stratum should be more similar to each other than others w/ respect to the characteristic of interest
(4) We should aim for homogeneity within strata, relative to the pop
(5) Success depends on the choice of characteristic used to partition the pop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Sampling fraction of the stratum (Formula)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Proportionate stratification (Def)

A

Each strata is represented in the sample in proportion to (see pic)

W_i = N_i / N

is the proportion of the Pop. within stratum i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Strata estimator of the population mean (Formula)

A

The estimator is unbiased.

W_i = N_i / N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Variance of the Strata estimator of the population mean (Formula)

A

Where:

W_i = N_i / N
f_i = n_i / N_i
S_i is the population standard dev WITHIN the stratum

25
Q

Cost of sampling

A

Where c_i is the cost of each unit in stratum i, and n_i is the sample size of stratum i

26
Q

Strat sampling - Proportional allocation (Formula)

A

Choosing size of each stratum’s sample size based on the stratum’s proportion with respect to the population size

27
Q

Strat sampling - Optimal (Neyman) allocation (Formula)

A

Note: requires you to know the pop. standard deviation of the stratum upfront!

28
Q

Stratified sampling (When and Pros/Cons)

A

Use when:

  • The target population is heterogeneous
  • Subgroups (strata) can be defined
  • We want a sample to be representative of these groups

Pros:

  • Works well if a pop. contains a wide variety of characteristics that may be used to group units
  • Gives smaller standard errors and greater PRECISION vs. SRS (the more heterogeneity between strata, the greater precision)

Cons:
- Need prior knowledge to form strata

29
Q

Cluster sampling (Def)

A

If it is not feasible to access all units, then Cluster sampling may be used.

Def: Cluster sampling is a sampling procedure where we sample units within a population using a sampling method where sampling units are clusters.

Assumption: units within clusters are heterogeneous, clusters are homogeneous

30
Q

Cluster sampling (Principles)

A

(1) Divide the pop. in natural clusters based on some rule (e.g. geographic area)
(2) Treat each cluster as a sampling unit
(3) Sample clusters based on a sampling method
(4) Collect all info from all sampling units within a cluster

31
Q

Cluster sampling (Pros and cons)

A

Pros:
- Do not need to access all units of a population (e.g. if we have no info or it’s too expansive)

Cons:
- Clusters may not reflect the true diversity of the population

32
Q

Constructing clusters - Equal size

A

If we sample n clusters from N equally sized clusters and y-bar_i is the sample mean in cluster i, the mean of cluster means (see pic) is an unbiased estimator of the pop mean y-bar

33
Q

Constructing clusters - unequal size

A
N = no. of clusters
M_i = units within cluster i
M = pop. size
M-bar = M/N, avg. number of units in clusters
n = no. of clusters sampled
y-bar*N = avg. value of y across clusters
34
Q

Summarising numerical data (3 Key aspects)

A

The distribution of a dataset can be summarised by:

  • Location: the centre of the distribution (e.g. mean, median)
  • Spread: variation/range of the data
  • Shape: shape of the distribution
35
Q

Median (Defs)

A

If there’s an even number of observations, it’s the average of the two middle values.

  • Range: max value minus min value (not robust, i.e influenced by outliers)
  • LQ: at depth 1/4 * (n+1)
  • UQ: at depth 3/4 * (n+1)
  • IQR: UQ-LQ (robust)
36
Q

Common plots

A
  • Dotplot (discrete data)
  • Histogram (cont data)
  • Bar chart (categorical data)
  • Stem and leaf diagram (cont data)
  • Boxplot (cont data)
  • Scatterplot (compare 2 cont vars)
37
Q

Boxplot - Formula for outliers

A

Lower fence = LQ - 1.5 * IQR

UPPER fence = UQ + 1.5 * IQR

38
Q

Parameter vs. Statistic

A

Parameter: single number summarising the variable of interest in the population

Statistic: same as above, but within a sample. It’s a FUNCTION of the data in the sample

39
Q

P-value (Def)

A

It’s the probability of obtaining a value for your TEST STATISTIC that is at best as extreme as the observed value, assuming H_0 is true

40
Q

Interpretation of Confidence Intervals (95%)

A

Under repeated sampling and recalculation, 95% of CIs would contain the true population value

41
Q

Pivotal function (Def)

A

A Pivotal function is a function of the data, X, and a parameter of interest, θ, which when regarded as a r.v. calculated at θ_T (true value of θ), has a probability distribution whose form does not depend on any unknown parameter. We denote it by PIV(θ_T, X)

42
Q

Pivotal function for t-test

A
43
Q

Hypothesis testing framework

A

(1) Specify H_0 and H_1
(2) Define a test statistic (TS)
(3) Compute the observed value of the TS from the data

We reject H_0 in favour of H_1 when |t| (or TS) is too large to be consistent with H_0. For example, adopting a significance level of α=0.05, we reject H_0 if:

|t| > t_0.975(n-1)

This is a one-sample t-test

44
Q

One-sample t-test (Assumptions)

A
  • Our data x_1, … , x_n have arisen from a normal distribution
  • Our data points are independent from one another
45
Q

Paired t-test

A

If the data come from the same individuals (e.g. 2 measurements), the independence assumption does not stand. If appropriate, we can take the difference of the measurements, reducing it to a One-sample test.

In this case, the differences are assumed to be independent of each other

46
Q

Non-parametric tests (Explanation)

A

Non-parametric tests make fewer assumptions (e.g. normality is not required), so they can be used instead of e.g. t-test. For example, the Wilcoxon signed ranks test performs inference on the median

47
Q

Two-sample t-test (Assumptions)

A

(1) All values are independent
(2) The distribution of the variables of interest is the same for both populations (apart from possibly the mean. I.e. same variance)
(3) they are distributed normally

48
Q

Two-sample t-test - Pivotal function (Formula)

A
49
Q

Two-sample t-test - Pooled variance and CI (Formula)

A

We can get a better estimate of σ-hat by pooling the info from the two samples

50
Q

Proportion test - CLT derivation

A

Suppose X ~ Bi(n, θ). Then X is The number of successes in n independent trials, each trial having θ probability of success.

We can use the Normal approx to the Binomial distribution (CLT):

X ~ N(nθ, nθ*(1-θ))

This requires a discrete dist. to be approximated with a continuous one, introducing some inconsistencies. For this we use a continuity correction.

The CLT tells us that for sufficiently large n (e.g. n>=20, nθ>=5 and nθ*(1-θ)>=5) the test statistic of interest is:

51
Q

Point estimators (Def and desirable properties)

A

A point estimate is a particular numeric value of the function t(x), obtained from a particular set of data x = (x_1, x_2, … , x_n).

Properties:

(1) Range of t(X) should be the same as the range of θ
(2) t(X) should be UNBIASED
(3) t(X) should be CONSISTENT
(4) MVUE - Minimum variance unbiased estimator

Note: the lower the variance, the more efficient the estimator

52
Q

Maximum likelihood estimation (Motivation)

A

Selecting the value of θ (parameter) for a chosen probability distribution, for which our given set of observations has a maximum probability.

Given observed data and an assumed probability model, we want to find estimates for the population parameters that maximise the likelihood that our distribution fits the data

53
Q

MLE (Def)

A

The MLE is the value of θ-hat, where θ is the population parameter for any probability distribution, which maximises the Likelihood function L(θ; x)

54
Q

Likelihood function - Discrete (Formula)

A
55
Q

Log-Likelihood function (Formula)

A

It’s often easier algebraically to work with the natural log of the L function: since ln(x) is a monotonic function, l(θ) reaches its maximum at the same value of θ as L(θ, x)

56
Q

Likelihood function - Continuous (Formula)

A

For continuous distributions, the PDF of X_i evaluated at x_i does not directly represent the probability of the data. However, it is approximately PROPORTIONAL to the probability that X_i lies in a small interval around x_i, so it’s reasonable to take the likelihood of the parameter θ to be:

57
Q

MLE (Steps)

A

(1) Evaluate the Likelihood function L(θ; x)
(2) Obtain the Log-Likelihood function l(θ)
(3) Differentiate wrt θ and set l’(θ) = 0
(4) Solve for θ
(5) Verify it’s a maximum, i.e. l’’(θ)

58
Q

Newton-Raphton method (Def)

A

When it’s not possible to find the root of l’(θ) = 0 in closed form, we can use the method to find the MLE numerically

59
Q

Efficiency (Def)

A

An UNBIASED estimator is said to be efficient if it has the minimum possible variance; the efficiency of an unbiased estimator is the ratio of the minimum possible variance to the variance of the estimator