Statistical tests Flashcards

1
Q

K-S test or Kolmogorov-Smirnov test

A

Nonparametric test for the equality of continuous, one-dimensional probability distributions for one sample or two samples.

The null distribution of this statistic is calculated under the null hypothesis that the samples are drawn from the same distribution (in the two-sample case) or that the sample is drawn from the reference distribution (in the one-sample case)

The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using these to define the specific reference distribution changes the null distribution of the test statistic: see below. Various studies have found that, even in this corrected form, the test is less powerful for testing normality than the Shapiro–Wilk test or Anderson–Darling test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Mann-Whitney U test

A

Also known as Mann–Whitney–Wilcoxon (MWW) , Wilcoxon rank-sum test or Wilcoxon-Mann-Whitney test

This is a non-parametric statistical hypothesis test for assessing whether one of two samples of independent observations tends to have larger values than the other. It is one of the most well-known non-parametric significance tests.

A very general formulation is to assume that:

(1) All the observations from both groups are independent of each other,
(2) The responses are ordinal (i.e. one can at least say, of any two observations, which is the greater),
(3) Under the null hypothesis the distributions of both groups are equal, so that the probability of an observation from one population (X) exceeding an observation from the second population (Y) equals the probability of an observation from Y exceeding an observation from X, that is, there is a symmetry between populations with respect to probability of random drawing of a larger observation.
(4) Under the alternative hypothesis the probability of an observation from one population (X) exceeding an observation from the second population (Y) (after exclusion of ties) is not equal to 0.5. The alternative may also be stated in terms of a one-sided test, for example: P(X > Y) + 0.5 P(X = Y) > 0.5.

Additional facts:

-related to kendall’s tau: equivalent if one variable is binary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Non-parametric tests

A

Non-parametric (or distribution-free) inferential statistical methods are mathematical procedures for statistical hypothesis testing which, unlike parametric statistics, make no assumptions about the probability distributions of the variables being assessed.

Examples include:

Kolmogorov–Smirnov test

Mann–Whitney U or Wilcoxon rank sum test

Siegel–Tukey test

sign test

Wilcoxon signed-rank test

Anderson–Darling test

Kuiper’s test

Logrank Test

McNemar’s test

median test

Pitman’s permutation test

Wald–Wolfowitz runs test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Parametric tests

A

Parametric statistics is a branch of statistics that assumes that the data has come from a type of probability distribution and makes inferences about the parameters of the distribution.

Parametric methods make more assumptions than non-parametric methods. If those extra assumptions are correct, parametric methods can produce more accurate and precise estimates. They are said to have more statistical power.

Examples of parametric tests:

t-tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What test(s) to use when you have two samples of data independently collected and you want to compare whether one has values greater than the other?

A

(1) Mann-Whitne U test (also called Mann–Whitney–Wilcoxon (MWW) , Wilcoxon rank-sum test or Wilcoxon-Mann-Whitney test)
- More robust to outliers
- Good if data is ordinal, but not interval scaled
- In case of normality, 95% efficient compared to t-test
- non parametric, usually a better choice, except perhaps in small sample sizes, power might be important
(2) independent samples Student’s t-test
- Assumes normality (parametric)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Wilcoxon signed-rank test

A

Non-parametric statistical test to compare two related samples, matched samples, or repeated measurements on a single sample to determine if their population mean ranks differ – a paired difference test.

Assumes:

(1) Data are paired and come from the same population.
(2) Each pair is chosen randomly and independent.
(3) The data are measured on an interval scale (ordinal is not sufficient because we take differences), but need not be normal.

Related methods: sign test, t-test (paired student’s t-test, t-test for matched pairs, t-test for dependent samples), paired z-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Paired difference tests

(what it is, methods, and popular uses)

A

In statistics, a paired difference test is a type of location test that is used when comparing two sets of measurements to assess whether their population means differ. A paired difference test uses additional information about the sample that is not present in an ordinary unpaired testing situation, either to increase the statistical power, or to reduce the effects of confounders.

Methods include:

  • t-test (when stdev. is not known)
  • paired Z-test (when stdev. is known)
  • Wilcoxon signed-rank test (non-normal distributions, assumes symmetric distribution?)
  • sign test (non-parametric, does not assume symmetry?, less powerful)

Popular uses include:

  • before and after a treatment – “repeated measures” tests (increases power)
  • reduce confounding by introducing artificial pairs that match on some level…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sign test

A

Non parametric test to test the hypothesis that there is “no difference in medians” between the continuous distributions of two random variables X and Y, in the situation when we can draw paired samples from X and Y.

Because it is non-parametric it has very general applicability but may lack the statistical power of other tests such as the paired-samples t-test or the Wilcoxon signed-rank test.

Method:

Let p = Pr(X > Y), and then test the null hypothesis H0: p = 0.50. In other words, the null hypothesis states that given a random pair of measurements (xi, yi), then xi and yi are equally likely to be larger than the other.

Then let W be the number of pairs for which yi − xi > 0. Assuming that H0 is true, then W follows a binomial distribution W ~ b(m, 0.5). The “W” is for Frank Wilcoxon who developed the test, then later, the more powerful Wilcoxon signed-rank test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Median test

A

*Mostly thought to be obsolete due to low power

Instead use: Wilcoxon–Mann–Whitney U two-sample test

It is a nonparametric test that tests the null hypothesis that the medians of the populations from which two samples are drawn are identical.

Difference betwen this and Mann-Whitney U

The relevant difference between the two tests is that the median test only considers the position of each observation relative to the overall median, whereas the Wilcoxon–Mann–Whitney test takes the ranks of each observation into account. Thus the latter test is usually the more powerful of the two.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Methods to compare means

A

See http://en.wikipedia.org/wiki/Comparing_means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kuiper’s test

A

Kuiper’s test is used in statistics to test that whether a given distribution, or family of distributions, is contradicted by evidence from a sample of data.

Properties: invariant to cyclic transformations, and as sensitive in tails as near median

Uses: cyclic variations by time of year or day of wee/time of day, in general any circular probability distributions

Related to: Kolmogorov–Smirnov test, Anderson–Darling test

More

Kuiper’s test[1] is closely related to the more well-known Kolmogorov–Smirnov test (or K-S test as it is often called). As with the K-S test, the discrepancy statistics D+ and D− represent the absolute sizes of the most positive and most negative differences between the two cumulative distribution functions that are being compared. The trick with Kuiper’s test is to use the quantity D+ + D− as the test statistic. This small change makes Kuiper’s test as sensitive in the tails as at the median and also makes it invariant under cyclic transformations of the independent variable. The Anderson–Darling test is another test that provides equal sensitivity at the tails as the median, but it does not provide the cyclic invariance.

This invariance under cyclic transformations makes Kuiper’s test invaluable when testing for cyclic variations by time of year or day of the week or time of day, and more generally for testing the fit of, and differences between, circular probability distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Jarque–Bera test

A

The Jarque–Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cramér–von Mises criterion

A

Cramér–von Mises criterion is a criterion used for judging the goodness of fit of a cumulative distribution function compared to a given empirical distribution function , or for comparing two empirical distributions. It is also used as a part of other algorithms, such as minimum distance estimation.

In one-sample applications is the theoretical distribution and is the empirically observed distribution. Alternatively the two distributions can both be empirically estimated ones; this is called the two-sample case.

Related alternative tests: Kolmogorov–Smirnov test, Watson test (almost same)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Siegel–Tukey test

A

The Siegel–Tukey test is a non-parametric test which may be applied to data measured at least on an ordinal scale. It tests for differences in scale between two groups.

The test is used to determine if one of two groups of data tends to have more widely dispersed values than the other. In other words, the test determines whether one of the two groups tends to move, sometimes to the right, sometimes to the left, but away from the center (of the ordinal scale).

1960

More:

The principle is based on the following idea:

Suppose there are two groups A and B with n observations for the first group and m observations for the second (so there are N = n + m total observations). If all N observations are arranged in ascending order, it can be expected that the values of the two groups will be mixed or sorted randomly, if there are no differences between the two groups (following the null hypothesis H0). This would mean that among the ranks of extreme (high and low) scores, there would be similar values from Group A and Group B.

If, say, Group A were more inclined to extreme values (the alternative hypothesis H1), then there will be a higher proportion of observations from group A with low or high values, and a reduced proportion of values at the center.

Hypothesis H0: σ2A = σ2B & MeA = MeB (where σ2 and Me are the variance and the median, respectively)
Hypothesis H1: σ2A > σ2B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Statistical hypothesis tests

A

Statistical hypothesis tests answer the question Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?.[2] That probability is known as the P-value.

Statistical hypothesis testing is a key technique of frequentist statistical inference. The Bayesian approach to hypothesis testing is to base decisions on the posterior probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Wald–Wolfowitz runs test

A

The runs test (also called Wald–Wolfowitz test) is a non-parametric statistical test that checks a randomness hypothesis for a two-valued data sequence. More precisely, it can be used to test the hypothesis that the elements of the sequence are mutually independent.

“run” of a sequence is a maximal non-empty segment of the sequence consisting of adjacent equal elements. For example, the sequence “++++−−−+++−−++++++−−−−” consists of six runs, three of which consist of +’s and the others of −’s. The run test is based on the null hypothesis that the two elements + and - are independently drawn from the same distribution.

Under the null hypothesis, the number of runs in a sequence of length N is a random variable whose conditional distribution given the observation of N+ positive values and N− negative values (N = N+ + N−) is approximately normal.

The mean and variance do not depend on the “fairness” of the process generating the elements of the sequence, that is, that +’s and −’s have equal probabilities, but only on the assumption that the elements are independent and identically distributed. If the number of runs is significantly higher or lower than expected, the hypothesis of statistical independence of the elements may be rejected.

Runs tests can be used to test:

the randomness of a distribution, by taking the data in the given order and marking with + the data greater than the median, and with – the data less than the median; (Numbers equalling the median are omitted.)
whether a function fits well to a data set, by marking the data exceeding the function value with + and the other data with −. For this use, the runs test, which takes into account the signs but not the distances, is complementary to the chi square test, which takes into account the distances but not the signs.
The Kolmogorov–Smirnov test is more powerful, if it can be applied.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Kendall’s W

A

Kendall’s W (Kendall’s coefficient of concordance) is a non-parametric statistic. It is a normalization of the statistic of the Friedman test, and can be used for assessing agreement among raters. Kendall’s W ranges from 0 (no agreement) to 1 (complete agreement).

More:

uppose, for instance, that a number of people have been asked to rank a list of political concerns, from most important to least important. Kendall’s W can be calculated from these data. If the test statistic W is 1, then all the survey respondents have been unanimous, and each respondent has assigned the same order to the list of concerns. If W is 0, then there is no overall trend of agreement among the respondents, and their responses may be regarded as essentially random. Intermediate values of W indicate a greater or lesser degree of unanimity among the various responses.

While tests using the standard Pearson correlation coefficient assume normally distributed values and compare two sequences of outcomes at a time, Kendall’s W makes no assumptions regarding the nature of the probability distribution and can handle any number of distinct outcomes.

W is linearly related to the mean value of the Spearman’s rank correlation coefficients between all pairs of the rankings over which it is calculated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Friedman test

A

The Friedman test is a non-parametric statistical test used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row (or block) together, then considering the values of ranks by columns. Applicable to complete block designs, it is thus a special case of the Durbin test.

Classic examples of use are:

n wine judges each rate k different wines. Are any wines ranked consistently higher or lower than the others?
n wines are each rated by k different judges. Are the judges’ ratings consistent with each other?
n welders each use k welding torches, and the ensuing welds were rated on quality. Do any of the torches produce consistently better or worse welds?

The Friedman test is used for one-way repeated measures analysis of variance by ranks. In its use of ranks it is similar to the Kruskal-Wallis one-way analysis of variance by ranks.

When using this kind of design for a binary response, one instead uses the Cochran’s Q test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Durbin test

A

In the analysis of designed experiments, the Friedman test is the most common non-parametric test for complete block designs. The Durbin test is a nonparametric test for balanced incomplete designs that reduces to the Friedman test in the case of a complete block design.

More:

In a randomized block design, k treatments are applied to b blocks.For some experiments, it may not be realistic to run all treatments in all blocks, so one may need to run an incomplete block design. In this case, it is strongly recommended to run a balanced incomplete design. A balanced incomplete block design has the following properties:

Every block contains k experimental units.
Every treatment appears in r blocks.

Every treatment appears with every other treatment an equal number of times.

The Durbin test is based on the following assumptions:

The b blocks are mutually independent. That means the results within one block do not affect the results within other blocks.
The data can be meaningfully ranked (i.e., the data have at least an ordinal scale).

Cochran’s Q test is applied for the special case of a binary response variable (i.e., one that can have only one of two possible outcomes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Cochran’s Q test

A

In statistics, in the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran’s Q test is a non-parametric statistical test to verify if k treatments have identical effects.[1][2] It is named for William Gemmell Cochran. Cochran’s Q test should not be confused with Cochran’s C test, which is a variance outlier test.

Cochran’s Q test assumes that there are k > 2 experimental treatments and that the observations are arranged in b blocks

Cochran’s Q test is

H0: The treatments are equally effective.
Ha: There is a difference in effectiveness among treatments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Analysis of variance

A

Analysis of variance (ANOVA) is a collection of statistical models, and their associated procedures, in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are all equal, and therefore generalizes t-test to more than two groups. Doing multiple two-sample t-tests would result in an increased chance of committing a type I error. For this reason, ANOVAs are useful in comparing two, three, or more means.

The analysis of variance can be presented in terms of a linear model, which makes the following assumptions about the probability distribution of the responses:

Independence of observations – this is an assumption of the model that simplifies the statistical analysis.
Normality – the distributions of the residuals are normal.
Equality (or “homogeneity”) of variances, called homoscedasticity — the variance of data in groups should be the same.
The separate assumptions of the textbook model imply that the errors are independently, identically, and normally distributed for fixed effects models

More Anova-like tests:

ANOVA on ranks
ANOVA-simultaneous component analysis
AMOVA
ANCOVA
ANORVA
MANOVA
Mixed-design analysis of variance
Two-way analysis of variance
One-way analysis of variance

More, see: http://en.wikipedia.org/wiki/ANOVA

22
Q

Kendall tau rank correlation coefficient

A

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall’s tau (τ) coefficient, is a statistic used to measure the association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.

Specifically, it is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897.

et (x1, y1), (x2, y2), …, (xn, yn) be a set of observations of the joint random variables X and Y respectively, such that all the values of (xi) and (yi) are unique. Any pair of observations (xi, yi) and (xj, yj) are said to be concordant if the ranks for both elements agree: that is, if both xi > xj and yi > yj or if both xi < xj and yi < yj. They are said to be discordant, if xi > xj and yi < yj or if xi < xj and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor discordant.

If X and Y are independent, then we would expect the coefficient to be approximately zero.

23
Q

Goodman and Kruskal’s gamma

A

In statistics, Goodman and Kruskal’s gamma measures the strength of association of the cross tabulated data when both variables are measured at the ordinal level. It makes no adjustment for either table size or ties. Values range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.

24
Q

Cohen’s kappa

A

Cohen’s kappa coeffici

ent is a statistical measure of inter-rater agreement or inter-annotator agreement[1] for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some researchers[2][citation needed] have expressed concern over κ’s tendency to take the observed categories’ frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement.

Others[3][citation needed] contest the assertion that kappa “takes into account” chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario.

Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories.

The equation for κ is:

where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as defined by Pr(e)), κ = 0.

Kappa assumes its theoretical maximum value of 1 only when both observers distribute codes the same, that is, when corresponding row and column sums are identical. Anything less is less than perfect agreement. till, the maximum value kappa could achieve given unequal distributions helps interpret the value of kappa actually obtained. The equation for κ maximum is: (See http://en.wikipedia.org/wiki/Cohen%27s_kappa)

25
Q

Cochran’s Q test

A

In statistics, in the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran’s Q test is a non-parametric statistical test to verify if k treatments have identical effects. Cochran’s Q test should not be confused with Cochran’s C test, which is a variance outlier test.

Cochran’s Q test is

H0: The treatments are equally effective.
Ha: There is a difference in effectiveness among treatments.
The Cochran’s Q test statistic is

k is the number of treatments
X• j is the column total for the jth treatment
b is the number of blocks
Xi • is the row total for the ith block
N is the grand total

Cochran’s Q test is based on the following assumptions:

A large sample approximation; in particular, it assumes that b is “large”.
The blocks were randomly selected from the population of all possible blocks.
The outcomes of the treatments can be coded as binary responses (i.e., a “0” or “1”) in a way that is common to all treatments within each block.

Related/alternatives:

When using this kind of design for a response that is not binary but rather ordinal or continuous, one instead uses the Friedman test or Durbin tests.

The case where there are exactly two treatments is equivalent to McNemar’s test, which is itself equivalent to a two-tailed sign test.

26
Q

Logrank test

A

In statistics, the logrank test is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test and appropriate to use when the data are right skewed and censored (technically, the censoring must be non-informative). It is widely used in clinical trials to establish the efficacy of a new treatment compared to a control treatment when the measurement is the time to event (such as the time from initial treatment to a heart attack). The test is sometimes called the Mantel–Cox test, named after Nathan Mantel and David Cox. The logrank test can also be viewed as a time stratified Cochran–Mantel–Haenszel test

The logrank test statistic compares estimates of the hazard functions of the two groups at each observed event time. It is constructed by computing the observed and expected number of events in one of the groups at each observed event time and then adding these to obtain an overall summary across all time points where there is an event.

Related

  • The logrank statistic can be derived as the score test for the Cox proportional hazards model comparing two groups. It is therefore asymptotically equivalent to the likelihood ratio test statistic based from that model.
  • The logrank statistic is asymptotically equivalent to the likelihood ratio test statistic for any family of distributions with proportional hazard alternative. For example, if the data from the two samples have exponential distributions.
  • The logrank statistic can be used when observations are censored. If censored observations are not present in the data then the Wilcoxon rank sum test is appropriate.
  • The logrank statistic gives all calculations the same weight, regardless of the time at which an event occurs. The Peto logrank statistic gives more weight to earlier events when there are a large number of observations.
27
Q

Score test

A

Rao’s score test, or the score test (often known as the Lagrange multiplier test in econometrics[1]) is a statistical test of a simple null hypothesis that a parameter of interest is equal to some particular value . It is the most powerful test when the true value of is close to . The main advantage of the Score-test is that it does not require an estimate of the information under the alternative hypothesis or unconstrained maximum likelihood. This makes testing feasible when the unconstrained maximum likelihood estimate is a boundary point in the parameter space.

Related:

The likelihood ratio test, the Wald test, and the Score test are asymptotically equivalent tests of hypotheses. When testing nested models, the statistics for each test converge to a Chi-squared distribution with degrees of freedom equal to the difference in degrees of freedom in the two models.

In many situations, the score statistic reduces to another commonly used statistic.

  • When the data follows a normal distribution, the score statistic is the same as the t statistic.[clarification needed]
  • When the data consists of binary observations, the score statistic is the same as the chi-squared statistic in the Pearson’s chi-squared test.
  • When the data consists of failure time data in two groups, the score statistic for the Cox partial likelihood is the same as the log-rank statistic in the log-rank test. Hence the log-rank test for difference in survival between two groups is most powerful when the proportional hazards assumption holds.
28
Q

Wald test

A

The Wald test is a parametric statistical test. statistician Abraham Wald with a great variety of uses. Whenever a relationship within or between data items can be expressed as a statistical model with parameters to be estimated from a sample, the Wald test can be used to test the true value of the parameter based on the sample estimate.

For example, suppose an economist, who has data on social class and shoe size, wonders whether social class is associated with shoe size. Say is the average increase in shoe size for upper-class people compared to middle-class people: then the Wald test can be used to test whether is 0 (in which case social class has no association with shoe size) or non-zero (shoe size varies between social classes). Here, , the hypothetical difference in shoe sizes between upper and middle-class people in the whole population, is a parameter. An estimate of might be the difference in shoe size between upper and middle-class people in the sample. In the Wald test, the economist uses the estimate and an estimate of variability (see below) to draw conclusions about the unobserved true . Or, for a medical example, suppose smoking multiplies the risk of lung cancer by some number R: then the Wald test can be used to test whether R = 1 (i.e. there is no effect of smoking) or is greater (or less) than 1 (i.e. smoking alters risk).

A Wald test can be used in a great variety of different models including models for dichotomous variables and models for continuous variables.

Under the Wald statistical test, the maximum likelihood estimate of the parameter(s) of interest is compared with the proposed value , with the assumption that the difference between the two will be approximately normally distributed. Typically the square of the difference is compared to a chi-squared distribution.

Alternative tests and related

The likelihood-ratio test can also be used to test whether an effect exists or not. Usually the Wald test and the likelihood ratio test give very similar conclusions (as they are asymptotically equivalent), but very rarely, they disagree enough to lead to different conclusions: the researcher finds him/herself asking, or being asked, why the p-value is significant when the confidence interval includes 0, or why the p-value is not significant when the confidence interval excludes 0. In this situation, first remember that statistical significance is always somewhat arbitrary, as it depends on an arbitrarily chosen significance level.

There are several reasons to prefer the likelihood ratio test to the Wald test.[3][4][5] One is that the Wald test can give different answers to the same question, depending on how the question is phrased.[6] For example, asking whether R = 1 is the same as asking whether log R = 0; but the Wald statistic for R = 1 is not the same as the Wald statistic for log R = 0 (because there is in general no neat relationship between the standard errors of R and log R). Likelihood ratio tests will give exactly the same answer whether we work with R, log R or any other monotonic transformation of R. The other reason is that the Wald test uses two approximations (that we know the standard error, and that the distribution is chi-squared), whereas the likelihood ratio test uses one approximation (that the distribution is chi-squared).

Yet another alternative is the score test, which has the advantage that it can be formulated in situations where the variability is difficult to estimate; e.g. the Cochran–Mantel–Haenzel test is a score test.

29
Q

Chow Test

A

The Chow test is a statistical and econometric test of whether the coefficients in two linear regressions on different data sets are equal. The Chow test was invented by economist Gregory Chow. In econometrics, the Chow test is most commonly used in time series analysis to test for the presence of a structural break. In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population.

The null hypothesis of the Chow test asserts that a1=a2, b1=b2, and c1=c2, and there is the assumption that the model errors are independent and identically distributed from a normal distribution with unknown variance.

30
Q

Uniformly most powerful test

A

In statistical hypothesis testing, a uniformly most powerful (UMP) test is a hypothesis test which has the greatest power 1 − β among all possible tests of a given size α. For example, according to the Neyman–Pearson lemma, the likelihood-ratio test is UMP for testing simple (point) hypotheses

31
Q

Likelihood-ratio test

A

In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which (the null model) is a special case of the other (the alternative model). The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other. This likelihood ratio, or equivalently its logarithm, can then be used to compute a p-value, or compared to a critical value to decide whether to reject the null model in favour of the alternative model. When the logarithm of the likelihood ratio is used, the statistic is known as a log-likelihood ratio statistic, and the probability distribution of this test statistic, assuming that the null model is true, can be approximated using Wilks’ theorem.

In the case of distinguishing between two models, each of which has no unknown parameters, use of the likelihood ratio test can be justified by the Neyman–Pearson lemma, which demonstrates that such a test has the highest power among all competitors.

The likelihood ratio, often denoted by (the capital Greek letter lambda), is the ratio of the likelihood function varying the parameters over two different sets in the numerator and denominator. A likelihood-ratio test is a statistical test for making a decision between two hypotheses based on the value of this ratio.

It is central to the Neyman–Pearson approach to statistical hypothesis testing, and, like statistical hypothesis testing generally, is both widely used and much criticized.

More:http://en.wikipedia.org/wiki/Likelihood_ratio_test

A statistical model is often a parametrized family of probability density functions or probability mass functions f(x|theta) . A simple-vs-simple hypotheses test has completely specified models under both the null and alternative hypotheses, which for convenience are written in terms of fixed values of a notional parameter theta:

H0: theta=theta_0

H1:theta = theta_1

likelihood ratio:

if Lambda>c, do not reject H0, if Lambda

32
Q

McNemar’s test

A

In statistics, McNemar’s test is a normal approximation used on nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (“marginal homogeneity”). It is named after Quinn McNemar, who introduced it in 1947.[1] An application of the test in genetics is the transmission disequilibrium test for detecting genetic linkage.

The test is applied to a 2 × 2 contingency table, which tabulates the outcomes of two tests on a sample of n subjects

The null hypothesis of marginal homogeneity states that the two marginal probabilities for each outcome are the same.

Related tests

  • The binomial sign test gives an exact test for the McNemar’s test.
  • The Cochran’s Q test for two “treatments” is equivalent to the McNemar’s test.
  • The Liddell’s exact test is an exact alternative to McNemar’s test.[9][10]
  • The Stuart–Maxwell test is different generalization of the McNemar test, used for testing marginal homogeneity in a square table with more than two rows/columns.[11]
  • The Bhapkar’s test (1966) is a more powerful alternative to the Stuart–Maxwell test.
33
Q

Pearson’s chi-squared test

A

Pearson’s chi-squared test (χ2) is the best-known of several chi-squared tests (Yates, likelihood ratio, portmanteau test in time series, etc.) – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900.[1] In contexts where it is important to make a distinction between the test statistic and its distribution, names similar to Pearson X-squared test or statistic are used. It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events considered must be mutually exclusive and have total probability 1. A common case for this is where the events each cover an outcome of a categorical variable. A simple example is the hypothesis that an ordinary six-sided die is “fair”, i. e., all six outcomes are equally likely to occur

Pearson’s chi-squared test is used to assess two types of comparison: tests of goodness of fit and tests of independence.

  • A test of goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical distribution.
  • A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other (e.g. polling responses from people of different nationalities to see if one’s nationality affects the response).

The first step is to calculate the chi-squared test statistic, X2, which resembles a normalized sum of squared deviations between observed and theoretical frequencies (see below). The second step is to determine the degrees of freedom, , of that statistic, which is essentially the number of frequencies reduced by the number of parameters of the fitted distribution. In the third step, X2 is compared to the critical value of no significance from the distribution, which in many cases gives a good approximation of the distribution of X2. A test that does not rely on this approximation is Fisher’s exact test; it is substantially more accurate in obtaining a significance level, especially with few observations.

The chi-squared test, when used with the standard approximation that a chi-squared distribution is applicable, has the following assumptions:

(1) Simple random sample – The sample data is a random sampling from a fixed distribution or population where each member of the population has an equal probability of selection. Variants of the test have been developed for complex samples, such as where the data is weighted.
(2) Sample size (whole table) – A sample with a sufficiently large size is assumed. If a chi squared test is conducted on a sample with a smaller size, then the chi squared test will yield an inaccurate inference. The researcher, by using chi squared test on small samples, might end up committing a Type II error.
(3) Expected cell count – Adequate expected cell counts. Some require 5 or more, and others require 10 or more. A common rule is 5 or more in all cells of a 2-by-2 table, and 5 or more in 80% of cells in larger tables, but no cells with zero expected count. When this assumption is not met, Yates’s Correction is applied.
(4) Independence – The observations are always assumed to be independent of each other. This means chi-squared cannot be used to test correlated data (like matched pairs or panel data). In those cases you might want to turn to McNemar’s test.

34
Q

Yates’s correction for continuity

A

In statistics, Yates’ correction for continuity (or Yates’ chi-squared test) is used in certain situations when testing for independence in a contingency table. In some cases, Yates’ correction may adjust too far, and so its current use is limited.

35
Q

Portmanteau test

A

A portmanteau test is a type of statistical hypothesis test in which the null hypothesis is well specified, but the alternative hypothesis is more loosely specified. Tests constructed in this context can have the property of being at least moderately powerful against a wide range of departures from the null hypothesis. Thus, in applied statistics, a portmanteau test provides a reasonable way of proceeding as a general check of a model’s match to a dataset where there are many different ways in which the model may depart from the underlying data generating process. Use of such tests avoids having to be very specific about the particular type of departure being tested.

Usage Example

n time series analysis, two well-known versions of a portmanteau test are available for testing for autocorrelation in the residuals of a model: it tests whether any of a group of autocorrelations of the residual time series are different from zero. This test is the Ljung–Box test, which is an improved version of the Box–Pierce test, having been devised at essentially the same time; a seemingly trivial simplification (omitted in the improved test) was found to have a deleterious effect. This portmanteau test is useful in working with ARIMA models.

In the context of regression analysis, including regession analysis with time series structures, a portmanteau test has been devised which allows a general test to made for the possibility that a range of types nonlinear transformations of combinations of the explanatory variables should have been included in addition to a selected model structure.

36
Q

Ljung–Box test

A

is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero. Instead of testing randomness at each distinct lag, it tests the “overall” randomness based on a number of lags, and is therefore a portmanteau test.

This test is sometimes known as the Ljung–Box Q test, and it is closely connected to the Box–Pierce test (which is named after George E. P. Box and David A. Pierce). In fact, the Ljung–Box test statistic was described explicitly in the paper that led to the use of the Box-Pierce statistic,[1][2] and from which that statistic takes its name. The Box-Pierce test statistic is a simplified version of the Ljung–Box statistic for which subsequent simulation studies have shown poor performance.

The Ljung–Box test is widely applied in econometrics and other applications of time series analysis.

The Ljung–Box test test can be defined as follows.

H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process).
Ha: The data are not independently distributed.

The Ljung–Box test is commonly used in autoregressive integrated moving average (ARIMA) modeling. Note that it is applied to the residuals of a fitted ARIMA model, not the original series, and in such applications the hypothesis actually being tested is that the residuals from the ARIMA model have no autocorrelation. When testing ARIMA models, no adjustment to the test statistic or to the critical region of the test are made in relation to the structure of the ARIMA model.

37
Q

Tukey’s range test

A

Tukey’s test, also known as the Tukey range test, Tukey method, Tukey’s honest significance test, Tukey’s HSD (honestly significant difference) test,[1] or the Tukey–Kramer method, is a single-step multiple comparison procedure and statistical test. It is used in conjunction with an ANOVA to find means that are significantly different from each other. Named after John Tukey, it compares all possible pairs of means, and is based on a studentized range distribution (q) (this distribution is similar to the distribution of t from the t-test).[2] The Tukey HSD tests should not be confused with the Tukey Mean Difference tests (also known as the Bland-Altman Test).

Tukey’s test compares the means of every treatment to the means of every other treatment; that is, it applies simultaneously to the set of all pairwise comparisons

and identifies any difference between two means that is greater than the expected standard error. The confidence coefficient for the set, when all sample sizes are equal, is exactly 1 − α. For unequal sample sizes, the confidence coefficient is greater than 1 − α. In other words, the Tukey method is conservative when there are unequal sample sizes.

Assumptions of Tukey’s test

(1) The observations being tested are independent
(2) There is equal within-group variance across the groups associated with each mean in the test (homogeneity of variance).
[edit]

Tukey’s test is based on a formula very similar to that of the t-test. In fact, Tukey’s test is essentially a t-test, except that it corrects for experiment-wise error rate (when there are multiple comparisons being made, the probability of making a type I error increases — Tukey’s test corrects for that, and is thus more suitable for multiple comparisons than doing a number of t-tests would be).

Related:

If only pairwise comparisons are to be made, the Tukey–Kramer method will result in a narrower confidence limit (which is preferable and more powerful) than Scheffé’s method. In the general case when many or all contrasts might be of interest, Scheffé’s method tends to give narrower confidence limits and is therefore the preferred method.

-Newman–Keuls method

38
Q

Newman–Keuls method

A
39
Q

G-test

A

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.[citation needed]

The general formula for G is

where Oi is the observed frequency in a cell, E is the expected frequency on the null hypothesis, and the sum is taken over all cells, and where ln denotes the natural logarithm (log to the base e) and the sum is taken over all non-empty cells.

G-tests are coming into increasing use, particularly since they were recommended at least since the 1981 edition of the popular statistics textbook by Sokal and Rohlf.

Related

Given the null hypothesis that the observed frequencies result from random sampling from a distribution with the given expected frequencies, the distribution of G is approximately a chi-squared distribution, with the same number of degrees of freedom as in the corresponding chi-squared test.

For very small samples the multinomial test for goodness of fit, and Fisher’s exact test for contingency tables, or even Bayesian hypothesis selection are preferable to the G-test

The commonly used chi-squared tests for goodness of fit to a distribution and for independence in contingency tables are in fact approximations of the log-likelihood ratio on which the G-tests are based.

The G-test quantity is proportional to the Kullback–Leibler divergence of the empirical distribution from the theoretical distribution.

For analysis of contingency tables the value of G can also be expressed in terms of mutual information.

An application of the G-test is known as the McDonald–Kreitman test in statistical genetics. Dunning[6] introduced the test to the computational linguistics community where it is now widely used.

40
Q

Fisher’s exact test

A

Fisher’s exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, R. A. Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. Fisher is said to have devised the test following a comment from Dr Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup; see lady tasting tea.

The test is useful for categorical data that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. So in Fisher’s original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Dr Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated – that is, whether Dr Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table. The p-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Dr Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.

With large samples, a chi-squared test can be used in this situation. However, the significance value it provides is only an approximation, because the sampling distribution of the test statistic that is calculated is only approximately equal to the theoretical chi-squared distribution. The approximation is inadequate when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the “expected values”) being low. The usual rule of thumb for deciding whether the chi-squared approximation is good enough is that the chi-squared test is not suitable when the expected values in any of the cells of a contingency table are below 5, or below 10 when there is only one degree of freedom (this rule is now known to be overly conservative[4]). In fact, for small, sparse, or unbalanced data, the exact and asymptotic p-values can be quite different and may lead to opposite conclusions concerning the hypothesis of interest.[5][6] In contrast the Fisher test is, as its name states, exact as long as the experimental procedure keeps the row and column totals fixed, and it can therefore be used regardless of the sample characteristics. It becomes difficult to calculate with large samples or well-balanced tables, but fortunately these are exactly the conditions where the chi-squared test is appropriate.

For hand calculations, the test is only feasible in the case of a 2 × 2 contingency table. However the principle of the test can be extended to the general case of an m × n table, and some statistical packages provide a calculation (sometimes using a Monte Carlo method to obtain an approximation) for the more general case.

Despite the fact that Fisher’s test gives exact p-values, some authors have argued that it is conservative, i.e. that its actual rejection rate is below the nominal significance level.The apparent contradiction stems from the combination of a discrete statistic with fixed significance levels. To be more precise, consider the following proposal for a significance test at the 5%-level: reject the null hypothesis for each table to which Fisher’s test assigns a p-value equal to or smaller than 5%. Because the set of all tables is discrete, there may not be a table for which equality is achieved. If is the largest p-value smaller than 5% which can actually occur for some table, then the proposed test effectively tests at the -level. For small sample sizes, might be significantly lower than 5%.While this effect occurs for any discrete statistic (not just in contingency tables, or for Fisher’s test), it has been argued that the problem is compounded by the fact that Fisher’s test conditions on the marginals.To avoid the problem, many authors discourage the use of fixed significance levels when dealing with discrete problems.

Another early discussion revolved around the necessity to condition on the marginals.Fisher’s test gives exact p-values both for fixed and for random marginals. Other tests, most prominently Barnard’s, require random marginals. Some authors (including, later, Barnard himself[13]) have criticized Barnard’s test based on this property. They argue that the marginal totals are an (almost[14]) ancillary statistic, containing (almost) no information about the tested property.

Related:

An alternative exact test, Barnard’s exact test, has been developed and proponents of it suggest that this method is more powerful, particularly in 2 × 2 tables. Another alternative is to use maximum likelihood estimates to calculate a p-value from the exact binomial or multinomial distributions and accept or reject based on the p-value.

source: wikipedia

41
Q

Barnard’s test

A

In statistics, Barnard’s test is an exact test of the null hypothesis of independence of rows and columns in a contingency table. It is an alternative to Fisher’s exact test but is more time-consuming to compute.

42
Q

Chi-squared test

A

A chi-squared test, also referred to as chi-square test or test, is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-squared distribution as closely as desired by making the sample size large enough.

Some examples of chi-squared tests where the chi-squared distribution is only approximately valid:

  • Pearson’s chi-squared test, also known as the chi-squared goodness-of-fit test or chi-squared test for independence. When the chi-squared test is mentioned without any modifiers or without other precluding context, this test is usually meant (for an exact test used in place of , see Fisher’s exact test).
  • Yates’s correction for continuity, also known as Yates’ chi-squared test.
  • Cochran–Mantel–Haenszel chi-squared test.
  • McNemar’s test, used in certain 2 × 2 tables with pairing
  • Linear-by-linear association chi-squared test
  • The portmanteau test in time-series analysis, testing for the presence of autocorrelation
  • Likelihood-ratio tests in general statistical modelling, for testing whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one).

One case where the distribution of the test statistic is an exact chi-squared distribution is the test that the variance of a normally distributed population has a given value based on a sample variance. Such a test is uncommon in practice because values of variances to test against are seldom known exactly.

43
Q

Deviance

A

In statistics, deviance is a quality of fit statistic for a model that is often used for statistical hypothesis testing.

The deviance for a model M0, based on a dataset y, is defined as

Here hat-theta_0 denotes the fitted values of the parameters in the model M0, while hat-theta_s denotes the fitted parameters for the “full model” (or “saturated model”): both sets of fitted values are implicitly functions of the observations y. Here the full model is a model with a parameter for every observation so that the data are fitted exactly. This expression is simply −2 times the log-likelihood ratio of the reduced model compared to the full model. The deviance is used to compare two models - in particular in the case of generalized linear models where it has a similar role to residual variance from ANOVA in linear models (RSS).

Suppose in the framework of the GLM, we have two nested models, M1 and M2. In particular, suppose that M1 contains the parameters in M2, and k additional parameters. Then, under the null hypothesis that M2 is the true model, the difference between the deviances for the two models follows an approximate chi-squared distribution with k-degrees of freedom.

44
Q

Permutation tests

A

A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved from the works of R.A. Fisher and E.J.G. Pitman in the 1930s.

To illustrate the basic idea of a permutation test, suppose we have two groups and whose sample means are and , and that we want to test, at 5% significance level, whether they come from the same distribution. Let and be the sample size corresponding to each group. The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis H that the two groups have identical probability distribution.

The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the observed value of the test statistic, T(obs). Then the observations of groups and are pooled.

Next, the difference in sample means is calculated and recorded for every possible way of dividing these pooled values into two groups of size and (i.e., for every permutation of the group labels A and B). The set of these calculated differences is the exact distribution of possible differences under the null hypothesis that group label does not matter.

The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to T(obs). The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to ABS(T(obs)).

If the only purpose of the test is reject or not reject the null hypothesis, we can as an alternative sort the recorded differences, and then observe if T(obs) is contained within the middle 95% of them. If it is not, we reject the hypothesis of identical probability curves at the 5% significance level.

[edit]Relation to parametric tests
Permutation tests are a subset of non-parametric statistics. The basic premise is to use only the assumption that it is possible that all of the treatment groups are equivalent, and that every member of them is the same before sampling began (i.e. the slot that they fill is not differentiable from other slots before the slots are filled). From this, one can calculate a statistic and then see to what extent this statistic is special by seeing how likely it would be if the treatment assignments had been jumbled.

In contrast to permutation tests, the reference distributions for many popular “classical” statistical tests, such as the t-test, F-test, z-test and χ2 test, are obtained from theoretical probability distributions. Fisher’s exact test is an example of a commonly used permutation test for evaluating the association between two dichotomous variables. When sample sizes are large, the Pearson’s chi-square test will give accurate results. For small samples, the chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the test statistic, and in this situation the use of Fisher’s exact test becomes more appropriate. A rule of thumb is that the expected count in each cell of the table should be greater than 5 before Pearson’s chi-squared test is used.[citation needed]

Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when losses are proportional to the size of an error rather than its square). All simple and many relatively complex parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption. For example, it is possible in this manner to construct a permutation t-test, a permutation chi-squared test of association, a permutation version of Aly’s test for comparing variances and so on.

The major down-side to permutation tests are that they

Can be computationally intensive and may require “custom” code for difficult-to-calculate statistics. This must be rewritten for every case.
Are primarily used to provide a p-value. The inversion of the test to get confidence regions/intervals requires even more computation.
[edit]Advantages
Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses.

Permutation tests can be used for analyzing unbalanced designs [7] and for combining dependent tests on mixtures of categorical, ordinal, and metric data (Pesarin, 2001).

Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small sample sizes.

Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated path algorithms applicable in special situations, made the application of permutation test methods practical for a wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based “exact” confidence intervals.

[edit]Limitations
An important assumption behind a permutation test is that the observations are exchangeable under the null hypothesis. An important consequence of this assumption is that tests of difference in location (like a permutation t-test) require equal variance. In this respect, the permutation t-test shares the same weakness as the classical Student’s t-test (the Behrens–Fisher problem). A third alternative in this situation is to use a bootstrap-based test. Good (2000) explains the difference between permutation tests and bootstrap tests the following way: “Permutations test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap entails less-stringent assumptions.” Of course, bootstrap tests are not exact.

[edit]Monte Carlo testing
An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data to allow complete enumeration in a convenient manner. This is done by generating the reference distribution by Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the possible replicates. The realization that this could be applied to any permutation test on any dataset was an important breakthrough in the area of applied statistics. The earliest known reference to this approach is Dwass (1957).[8] This type of permutation test is known under various names: approximate permutation test, Monte Carlo permutation tests or random permutation tests.[9]

After random permutations, it is possible to obtain a confidence interval for the p-value based on the Binomial distribution. For example, if after random permutations the p-value is estimated to be , then a 99% confidence interval for the true (the one that would result from trying all possible permutations) is .

On the other hand, the purpose of estimating the p-value is most often to decide whether , where is the threshold at which the null hypothesis will be rejected (typically ). In the example above, the confidence interval only tells us that there is roughly a 50% chance that the p-value is smaller than 0.05, i.e. it is completely unclear whether the null hypothesis should be rejected at a level .

If it is only important to know whether for a given , it is logical to continue simulating until the statement can be established to be true or false with a very low probability of error. Given a bound on the admissible probability of error (the probability of finding that when in fact or vice versa), the question of how many permutations to generate can be seen as the question of when to stop generating permutations, based on the outcomes of the simulations so far, in order to guarantee that the conclusion (which is either or ) is correct with probability at least as large as . ( will typically be chosen to be extremely small, e.g. 1/1000.) Stopping rules to achieve this have been developed[10] which can be incorporated with minimal additional computational cost. In fact, depending on the true underlying p-value it will often be found that the number of simulations required is remarkably small (e.g. as low as 5 and often not larger than 100) before a decision can be reached with virtual certainty.

45
Q

More stuff

A

http://en.wikipedia.org/wiki/Category:Non-parametric_statistics

Hoeffding’s independence test

Nemenyi test

Multinomial test

Anderson–Darling test

Mantel test

Location tests,

Given a type of problem, pick the appropriate test used.

Examples of tests

Clean upcards

Separate into different types of data (categorical vs. other?) Or have a card that lists different types of tests

Separate tests into uses

There’s a great table of stuff in both: http://en.wikipedia.org/wiki/Statistical_tests

http://en.wikipedia.org/wiki/Comparing_means

Neyman–Pearson lemma

Logistic regression#Introduction

Make simpler cards and make connections between things clearer. Boil down to most important stuff and split up information.

http: //en.wikipedia.org/wiki/Pitman_permutation_test
http: //en.wikipedia.org/wiki/Non-parametric_statistics

46
Q

Van der Waerden test

A

Named for the Dutch mathematician Bartel Leendert van der Waerden, the Van der Waerden test is a statistical test that k population distribution functions are equal. The Van Der Waerden test converts the ranks from a standard Kruskal-Wallis one-way analysis of variance to quantiles of the standard normal distribution (details given below). These are called normal scores and the test is computed from these normal scores.

The k population version of the test is an extension of the test for two populations published by Van der Waerden

Analysis of Variance (ANOVA) is a data analysis technique for examining the significance of the factors (independent variables) in a multi-factor model. The one factor model can be thought of as a generalization of the two sample t-test. That is, the two sample t-test is a test of the hypothesis that two population means are equal. The one factor ANOVA tests the hypothesis that k population means are equal. The standard ANOVA assumes that the errors (i.e., residuals) are normally distributed. If this normality assumption is not valid, an alternative is to use a non-parametric test.

Related methods

The most common non-parametric test for the one-factor model is the Kruskal-Wallis test. The Kruskal-Wallis test is based on the ranks of the data. The advantage of the Van Der Waerden test is that it provides the high efficiency of the standard ANOVA analysis when the normality assumptions are in fact satisfied, but it also provides the robustness of the Kruskal-Wallis test when the normality assumptions are not satisfied.

47
Q

Student’s t-test

A

A t-test is any statistical hypothesis test in which the test statistic follows a Student’s t distribution if the null hypothesis is supported. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student’s t distribution.

Among the most frequently used t-tests are:

  • A one-sample location test of whether the mean of a normally distributed population has a value specified in a null hypothesis.
  • A two sample location test of the null hypothesis that the means of two normally distributed populations are equal. All such tests are usually called Student’s t-tests, though strictly speaking that name should only be used if the variances of the two populations are also assumed to be equal; the form of the test used when this assumption is dropped is sometimes called Welch’s t-test. These tests are often referred to as “unpaired” or “independent samples” t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping.[5]
  • A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero. For example, suppose we measure the size of a cancer patient’s tumor before and after a treatment. If the treatment is effective, we expect the tumor size for many of the patients to be smaller following the treatment. This is often referred to as the “paired” or “repeated measures” t-test:[5][6] see paired difference test.
  • A test of whether the slope of a regression line differs significantly from 0.

Assumptions

Most t-test statistics have the form T = Z/s, where Z and s are functions of the data. Typically, Z is designed to be sensitive to the alternative hypothesis (i.e. its magnitude tends to be larger when the alternative hypothesis is true), whereas s is a scaling parameter that allows the distribution of T to be determined.

As an example, in the one-sample t-test Z = , where is the sample mean of the data, is the sample size, and is the population standard deviation of the data; s in the one-sample t-test is , where is the sample standard deviation.

The assumptions underlying a t-test are that

(1) Z follows a standard normal distribution under the null hypothesis
(2) s^2 follows a χ2 distribution with p degrees of freedom under the null hypothesis, where p is a positive constant
(3) Z and s are independent.

In a specific type of t-test, these conditions are consequences of the population being studied, and of the way in which the data are sampled. For example, in the t-test comparing the means of two independent samples, the following assumptions should be met:

(1) Each of the two populations being compared should follow a normal distribution. This can be tested using a normality test, such as the Shapiro-Wilk or Kolmogorov–Smirnov test, or it can be assessed graphically using a normal quantile plot.
(2) If using Student’s original definition of the t-test, the two populations being compared should have the same variance (testable using F test, Levene’s test, Bartlett’s test, or the Brown–Forsythe test; or assessable graphically using a Q-Q plot). If the sample sizes in the two groups being compared are equal, Student’s original t-test is highly robust to the presence of unequal variances.[7] Welch’s t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.
(3) The data used to carry out the test should be sampled independently from the two populations being compared. This is in general not testable from the data, but if the data are known to be dependently sampled (i.e. if they were sampled in clusters), then the classical t-tests discussed here may give misleading results.

Types:

Two-sample t-tests for a difference in mean involve independent samples, paired samples and overlapping samples. Paired t-tests are a form of blocking, and have greater power than unpaired tests when the paired units are similar with respect to “noise factors” that are independent of membership in the two groups being compared.[8] In a different context, paired t-tests can be used to reduce the effects of confounding factors in an observational study.

Related:

The t-test provides an exact test for the equality of the means of two normal populations with unknown, but equal, variances. (The Welch’s t-test is a nearly exact test for the case where the data are normal but the variances may differ.) For moderately large samples and a one tailed test, the t is relatively robust to moderate violations of the normality assumption.

For exactness, the t-test and Z-test require normality of the sample means, and the t-test additionally requires that the sample variance follows a scaled χ2 distribution, and that the sample mean and sample variance be statistically independent. Normality of the individual data values is not required if these conditions are met. By the central limit theorem, sample means of moderately large samples are often well-approximated by a normal distribution even if the data are not normally distributed. For non-normal data, the distribution of the sample variance may deviate substantially from a χ2 distribution. However, if the sample size is large, Slutsky’s theorem implies that the distribution of the sample variance has little effect on the distribution of the test statistic. If the data are substantially non-normal and the sample size is small, the t-test can give misleading results. See Location test for Gaussian scale mixture distributions for some theory related to one particular family of non-normal distributions

When the normality assumption does not hold, a non-parametric alternative to the t-test can often have better statistical power. For example, for two independent samples when the data distributions are asymmetric (that is, the distributions are skewed) or the distributions have large tails, then the Wilcoxon Rank Sum test (also known as the Mann-Whitney U test) can have three to four times higher power than the t-test.[ The nonparametric counterpart to the paired samples t test is the Wilcoxon signed-rank test for paired samples. For a discussion on choosing between the t and nonparametric alternatives, see Sawilowsky.

One-way analysis of variance generalizes the two-sample t-test when the data belong to more than two groups.

48
Q

Welch’s t test

A

In statistics, Welch’s t test is an adaptation of Student’s t-test intended for use with two samples having possibly unequal variances.[1] As such, it is an approximate solution to the Behrens–Fisher problem

Once t and have been computed, these statistics can be used with the t-distribution to test the null hypothesis that the two population means are equal (using a two-tailed test), or the null hypothesis that one of the population means is greater than or equal to the other (using a one-tailed test). In particular, the test will yield a p-value which might or might not give evidence sufficient to reject the null hypothesis.

Welch’s t-test defines the statistic t by the following formula where Xi, s_i, and N_i are the th sample mean, sample variance and sample size

Unlike in Student’s t-test, the denominator is not based on a pooled variance estimate.

(note the degrees of freedom associated with this variance estimate is approximated using the Welch-Satterthwaite equation)

49
Q

Hotelling’s T-squared statistic

A

Hotelling’s T-squared statistic is a generalization of Student’s t statistic that is used in multivariate hypothesis testing

50
Q

F-test

A

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact F-tests mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s

Examples of F-tests include:

(1) The hypothesis that the means of several normally distributed populations, all having the same standard deviation, are equal. This is perhaps the best-known F-test, and plays an important role in the analysis of variance (ANOVA).
(2) The hypothesis that a proposed regression model fits the data well. See Lack-of-fit sum of squares.
(3) The hypothesis that a data set in a regression analysis follows the simpler of two proposed linear models that are nested within each other.
(4) Scheffé’s method for multiple comparisons adjustment in linear models.

F-test of the equality of two variances
Main article: F-test of equality of variances
This F-test is sensitive to non-normality.[2][3] In the analysis of variance (ANOVA), alternative tests include Levene’s test, Bartlett’s test, and the Brown–Forsythe test. However, when any of these tests are conducted to test the underlying assumption of homoscedasticity (i.e. homogeneity of variance), as a preliminary step to testing for mean effects, there is an increase in the experiment-wise Type I error rate.[4]

51
Q

Z-test

A

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Because of the central limit theorem, many test statistics are approximately normally distributed for large samples. For each significance level, the Z-test has a single critical value (for example, 1.96 for 5% two tailed) which makes it more convenient than the Student’s t-test which has separate critical values for each sample size. Therefore, many statistical tests can be conveniently performed as approximate Z-tests if the sample size is large or the population variance known. If the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large (n < 30), the Student t-test may be more appropriate.

If T is a statistic that is approximately normally distributed under the null hypothesis, the next step in performing a Z-test is to estimate the expected value θ of T under the null hypothesis, and then obtain an estimate s of the standard deviation of T. We then calculate the standard score Z = (T − θ) / s, from which one-tailed and two-tailed p-values can be calculated as Φ(−|Z|) and 2Φ(−|Z|), respectively, where Φ is the standard normal cumulative distribution function

Requirements and assumptions:

For the Z-test to be applicable, certain conditions must be met.

Nuisance parameters should be known, or estimated with high accuracy (an example of a nuisance parameter would be the standard deviation in a one-sample location test). Z-tests focus on a single parameter, and treat all other unknown parameters as being fixed at their true values. In practice, due to Slutsky’s theorem, “plugging in” consistent estimates of nuisance parameters can be justified. However if the sample size is not large enough for these estimates to be reasonably accurate, the Z-test may not perform well.
The test statistic should follow a normal distribution. Generally, one appeals to the central limit theorem to justify assuming that a test statistic varies normally. There is a great deal of statistical research on the question of when a test statistic varies approximately normally. If the variation of the test statistic is strongly non-normal, a Z-test should not be used.
If estimates of nuisance parameters are plugged in as discussed above, it is important to use estimates appropriate for the way the data were sampled. In the special case of Z-tests for the one or two sample location problem, the usual sample standard deviation is only appropriate if the data were collected as an independent sample.

In some situations, it is possible to devise a test that properly accounts for the variation in plug-in estimates of nuisance parameters. In the case of one and two sample location problems, a t-test does this.

Examples: (location tests)

The term Z-test is often used to refer specifically to the one-sample location test comparing the mean of a set of measurements to a given constant. If the observed data X1, …, Xn are (i) uncorrelated, (ii) have a common mean μ, and (iii) have a common variance σ2, then the sample average X has mean μ and variance σ2 / n. If our null hypothesis is that the mean value of the population is a given number μ0, we can use X −μ0 as a test-statistic, rejecting the null hypothesis if X −μ0 is large.

To calculate the standardized statistic Z = (X − μ0) / s, we need to either know or have an approximate value for σ2, from which we can calculate s2 = σ2 / n. In some applications, σ2 is known, but this is uncommon. If the sample size is moderate or large, we can substitute the sample variance for σ2, giving a plug-in test. The resulting test will not be an exact Z-test since the uncertainty in the sample variance is not accounted for — however, it will be a good approximation unless the sample size is small. A t-test can be used to account for the uncertainty in the sample variance when the sample size is small and the data are exactly normal. There is no universal constant at which the sample size is generally considered large enough to justify use of the plug-in test. Typical rules of thumb range from 20 to 50 samples. For larger sample sizes, the t-test procedure gives almost identical p-values as the Z-test procedure.

Other location tests that can be performed as Z-tests are the two-sample location test and the paired difference test.

Examples: (Z-tests other than location tests)

Location tests are the most familiar t-tests. Another class of Z-tests arises in maximum likelihood estimation of the parameters in a parametric statistical model. Maximum likelihood estimates are approximately normal under certain conditions, and their asymptotic variance can be calculated in terms of the Fisher information. The maximum likelihood estimate divided by its standard error can be used as a test statistic for the null hypothesis that the population value of the parameter equals zero. More generally, if hat_θ is the maximum likelihood estimate of a parameter θ, and θ_0 is the value of θ under the null hypothesis,

can be used as a Z-test statistic.

When using a Z-test for maximum likelihood estimates, it is important to be aware that the normal approximation may be poor if the sample size is not sufficiently large. Although there is no simple, universal rule stating how large the sample size must be to use a Z-test, simulation can give a good idea as to whether a Z-test is appropriate in a given situation.

Z-tests are employed whenever it can be argued that a test statistic follows a normal distribution under the null hypothesis of interest. Many non-parametric test statistics, such as U statistics, are approximately normal for large enough sample sizes, and hence are often performed as Z-tests.

Example Usage:

Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 points, and 12 points, respectively. Our interest is in the scores of 55 students in a particular school who received a mean score of 96. We can ask whether this mean score is significantly lower than the regional mean — that is, are the students in this school comparable to a simple random sample of 55 students from the region as a whole, or are their scores surprisingly low?

We begin by calculating the standard error of the mean : 1.62

Next we calculate the z-score, which is the distance from the sample mean to the population mean in units of the standard error: -2.47

n this example, we treat the population mean and variance as known, which would be appropriate either if all students in the region were tested, or if a large random sample were used to estimate the population mean and variance with minimal estimation error.

The classroom mean score is 96, which is −2.47 standard error units from the population mean of 100. Looking up the z-score in a table of the standard normal distribution, we find that the probability of observing a standard normal value below -2.47 is approximately 0.5 - 0.4932 = 0.0068. This is the one-sided p-value for the null hypothesis that the 55 students are comparable to a simple random sample from the population of all test-takers. The two-sided p-value is approximately 0.014 (twice the one-sided p-value).

Another way of stating things is that with probability 1 − 0.014 = 0.986, a simple random sample of 55 students would have a mean test score within 4 units of the population mean. We could also say that with 98.6% confidence we reject the null hypothesis that the 55 test takers are comparable to a simple random sample from the population of test-takers.

The Z-test tells us that the 55 students of interest have an unusually low mean test score compared to most simple random samples of similar size from the population of test-takers. A deficiency of this analysis is that it does not consider whether the effect size of 4 points is meaningful. If instead of a classroom, we considered a subregion containing 900 students whose mean score was 99, nearly the same z-score and p-value would be observed. This shows that if the sample size is large enough, very small differences from the null value can be highly statistically significant. See statistical hypothesis testing for further discussion of this issue.

52
Q

Omnibus test

A

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

In addition, Omnibus test is a general name refers to an overall or a global test and in most cases omnibus test is called in other expressions such as: F-test or Chi-squared test.

Omnibus test as a statistical test is implemented on an overall hypothesis that tends to find general significance between parameters’ variance, while examining parameters of the same type, such as: Hypotheses regarding equality vs. inequality between k expectancies in Analysis Of Variance(ANOVA) ;

or regarding equality between k standard deviations in testing equality of variances in ANOVA or regarding coefficients in Multiple linear regression or in Logistic regression.

Omnibus tests commonly refers to either one of those statistical tests:

ANOVA F test to test significance between all factor means and/or between there variances equality in Analysis of Variance procedure ;
The omnibus multivariate F Test in ANOVA with repeated measures ;
F test for equality/inequality of the regression coefficients in Multiple Regression;
Chi-Square test for exploring significance differences between blocks of independent explanatory variables or their coefficients in a logistic regression.

Those omnibus tests are usually conducted whenever one tends to test an overall hypothesis on a quadratic statistic (like sum of squares or variance or covariance) or rational quadratic statistic (like the ANOVA overall F test in Analysis of Variance or F Test in Analysis of covariance or the F Test in Linear Regression, or Chi-Square in Logistic Regression).

While significance is founded on the omnibus test, it doesn’t specify exactly where the difference is occurred, meaning, it doesn’t bring specification on which parameter is significally different from the other, but it statistically determine that there is a difference, so at least two of the tested parameters are statistically different. If significance was met, none of those tests will tell specifically which mean different from the other (in ANOVA), which coefficient differ from the other (in Regression) etc.