Statistics Flashcards

Question

Cost of sampling

Answer 1

Where c_i is the cost of each unit in stratum i, and n_i is the sample size of stratum i

Answer 2

Choosing size of each stratum’s sample size based on the stratum’s proportion with respect to the population size

Answer 3

Note: requires you to know the pop. standard deviation of the stratum upfront!

Answer 4

Use when: - The target population is heterogeneous - Subgroups (strata) can be defined - We want a sample to be representative of these groups Pros: - Works well if a pop. contains a wide variety of characteristics that may be used to group units - Gives smaller standard errors and greater PRECISION vs. SRS (the more heterogeneity between strata, the greater precision) Cons: - Need prior knowledge to form strata

Answer 5

If it is not feasible to access all units, then Cluster sampling may be used. Def: Cluster sampling is a sampling procedure where we sample units within a population using a sampling method where sampling units are clusters. Assumption: units within clusters are heterogeneous, clusters are homogeneous

Answer 6

(1) Divide the pop. in natural clusters based on some rule (e.g. geographic area) (2) Treat each cluster as a sampling unit (3) Sample clusters based on a sampling method (4) Collect all info from all sampling units within a cluster

Answer 7

Pros: - Do not need to access all units of a population (e.g. if we have no info or it’s too expansive) Cons: - Clusters may not reflect the true diversity of the population

Answer 8

If we sample n clusters from N equally sized clusters and y-bar_i is the sample mean in cluster i, the mean of cluster means (see pic) is an unbiased estimator of the pop mean y-bar

Answer 9

``` N = no. of clusters M_i = units within cluster i M = pop. size M-bar = M/N, avg. number of units in clusters n = no. of clusters sampled y-bar*N = avg. value of y across clusters ```

Answer 10

The distribution of a dataset can be summarised by: - Location: the centre of the distribution (e.g. mean, median) - Spread: variation/range of the data - Shape: shape of the distribution

Answer 11

If there’s an even number of observations, it’s the average of the two middle values. - Range: max value minus min value (not robust, i.e influenced by outliers) - LQ: at depth 1/4 * (n+1) - UQ: at depth 3/4 * (n+1) - IQR: UQ-LQ (robust)

Answer 12

- Dotplot (discrete data) - Histogram (cont data) - Bar chart (categorical data) - Stem and leaf diagram (cont data) - Boxplot (cont data) - Scatterplot (compare 2 cont vars)

Answer 13

Lower fence = LQ - 1.5 * IQR UPPER fence = UQ + 1.5 * IQR

Answer 14

Parameter: single number summarising the variable of interest in the population Statistic: same as above, but within a sample. It’s a FUNCTION of the data in the sample

Answer 15

It’s the probability of obtaining a value for your TEST STATISTIC that is at best as extreme as the observed value, assuming H_0 is true

Answer 16

Under repeated sampling and recalculation, 95% of CIs would contain the true population value

Answer 17

A Pivotal function is a function of the data, X, and a parameter of interest, θ, which when regarded as a r.v. calculated at θ_T (true value of θ), has a probability distribution whose form does not depend on any unknown parameter. We denote it by PIV(θ_T, X)

Answer 18

(1) Specify H_0 and H_1 (2) Define a test statistic (TS) (3) Compute the observed value of the TS from the data We reject H_0 in favour of H_1 when |t| (or TS) is too large to be consistent with H_0. For example, adopting a significance level of α=0.05, we reject H_0 if: |t| > t_0.975(n-1) This is a one-sample t-test

Answer 19

- Our data x_1, … , x_n have arisen from a normal distribution - Our data points are independent from one another

Answer 20

If the data come from the same individuals (e.g. 2 measurements), the independence assumption does not stand. If appropriate, we can take the difference of the measurements, reducing it to a One-sample test. In this case, the differences are assumed to be independent of each other

Answer 21

Non-parametric tests make fewer assumptions (e.g. normality is not required), so they can be used instead of e.g. t-test. For example, the Wilcoxon signed ranks test performs inference on the median

Answer 22

(1) All values are independent (2) The distribution of the variables of interest is the same for both populations (apart from possibly the mean. I.e. same variance) (3) they are distributed normally

Answer 23

We can get a better estimate of σ-hat by pooling the info from the two samples

Answer 24

Suppose X ~ Bi(n, θ). Then X is The number of successes in n independent trials, each trial having θ probability of success. We can use the Normal approx to the Binomial distribution (CLT): X ~ N(nθ, nθ*(1-θ)) This requires a discrete dist. to be approximated with a continuous one, introducing some inconsistencies. For this we use a continuity correction. The CLT tells us that for sufficiently large n (e.g. n>=20, nθ>=5 and nθ*(1-θ)>=5) the test statistic of interest is:

Answer 25

A point estimate is a particular numeric value of the function t(x), obtained from a particular set of data *x* = (x_1, x_2, … , x_n). Properties: (1) Range of t(*X*) should be the same as the range of θ (2) t(*X*) should be UNBIASED (3) t(*X*) should be CONSISTENT (4) MVUE - Minimum variance unbiased estimator Note: the lower the variance, the more efficient the estimator

Answer 26

Selecting the value of θ (parameter) for a chosen probability distribution, for which our given set of observations has a maximum probability. Given observed data and an assumed probability model, we want to find estimates for the population parameters that maximise the likelihood that our distribution fits the data

Answer 27

The MLE is the value of θ-hat, where θ is the population parameter for any probability distribution, which maximises the Likelihood function L(θ; x)

Answer 28

It’s often easier algebraically to work with the natural log of the L function: since ln(x) is a monotonic function, l(θ) reaches its maximum at the same value of θ as L(θ, x)

Answer 29

For continuous distributions, the PDF of X_i evaluated at x_i does not directly represent the probability of the data. However, it is approximately PROPORTIONAL to the probability that X_i lies in a small interval around x_i, so it’s reasonable to take the likelihood of the parameter θ to be:

Answer 30

(1) Evaluate the Likelihood function L(θ; x) (2) Obtain the Log-Likelihood function l(θ) (3) Differentiate wrt θ and set l’(θ) = 0 (4) Solve for θ (5) Verify it’s a maximum, i.e. l’’(θ)

Answer 31

When it’s not possible to find the root of l’(θ) = 0 in closed form, we can use the method to find the MLE numerically

Answer 32

An UNBIASED estimator is said to be efficient if it has the minimum possible variance; the efficiency of an unbiased estimator is the ratio of the minimum possible variance to the variance of the estimator

Statistics Flashcards

(59 cards)