Statistical tests Flashcards
K-S test or Kolmogorov-Smirnov test
Nonparametric test for the equality of continuous, one-dimensional probability distributions for one sample or two samples.
The null distribution of this statistic is calculated under the null hypothesis that the samples are drawn from the same distribution (in the two-sample case) or that the sample is drawn from the reference distribution (in the one-sample case)
The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using these to define the specific reference distribution changes the null distribution of the test statistic: see below. Various studies have found that, even in this corrected form, the test is less powerful for testing normality than the Shapiro–Wilk test or Anderson–Darling test.
Mann-Whitney U test
Also known as Mann–Whitney–Wilcoxon (MWW) , Wilcoxon rank-sum test or Wilcoxon-Mann-Whitney test
This is a non-parametric statistical hypothesis test for assessing whether one of two samples of independent observations tends to have larger values than the other. It is one of the most well-known non-parametric significance tests.
A very general formulation is to assume that:
(1) All the observations from both groups are independent of each other,
(2) The responses are ordinal (i.e. one can at least say, of any two observations, which is the greater),
(3) Under the null hypothesis the distributions of both groups are equal, so that the probability of an observation from one population (X) exceeding an observation from the second population (Y) equals the probability of an observation from Y exceeding an observation from X, that is, there is a symmetry between populations with respect to probability of random drawing of a larger observation.
(4) Under the alternative hypothesis the probability of an observation from one population (X) exceeding an observation from the second population (Y) (after exclusion of ties) is not equal to 0.5. The alternative may also be stated in terms of a one-sided test, for example: P(X > Y) + 0.5 P(X = Y) > 0.5.
Additional facts:
-related to kendall’s tau: equivalent if one variable is binary.
Non-parametric tests
Non-parametric (or distribution-free) inferential statistical methods are mathematical procedures for statistical hypothesis testing which, unlike parametric statistics, make no assumptions about the probability distributions of the variables being assessed.
Examples include:
Kolmogorov–Smirnov test
Mann–Whitney U or Wilcoxon rank sum test
Siegel–Tukey test
sign test
Wilcoxon signed-rank test
Anderson–Darling test
Kuiper’s test
Logrank Test
McNemar’s test
median test
Pitman’s permutation test
Wald–Wolfowitz runs test
Parametric tests
Parametric statistics is a branch of statistics that assumes that the data has come from a type of probability distribution and makes inferences about the parameters of the distribution.
Parametric methods make more assumptions than non-parametric methods. If those extra assumptions are correct, parametric methods can produce more accurate and precise estimates. They are said to have more statistical power.
Examples of parametric tests:
t-tests
What test(s) to use when you have two samples of data independently collected and you want to compare whether one has values greater than the other?
(1) Mann-Whitne U test (also called Mann–Whitney–Wilcoxon (MWW) , Wilcoxon rank-sum test or Wilcoxon-Mann-Whitney test)
- More robust to outliers
- Good if data is ordinal, but not interval scaled
- In case of normality, 95% efficient compared to t-test
- non parametric, usually a better choice, except perhaps in small sample sizes, power might be important
(2) independent samples Student’s t-test
- Assumes normality (parametric)
Wilcoxon signed-rank test
Non-parametric statistical test to compare two related samples, matched samples, or repeated measurements on a single sample to determine if their population mean ranks differ – a paired difference test.
Assumes:
(1) Data are paired and come from the same population.
(2) Each pair is chosen randomly and independent.
(3) The data are measured on an interval scale (ordinal is not sufficient because we take differences), but need not be normal.
Related methods: sign test, t-test (paired student’s t-test, t-test for matched pairs, t-test for dependent samples), paired z-test
Paired difference tests
(what it is, methods, and popular uses)
In statistics, a paired difference test is a type of location test that is used when comparing two sets of measurements to assess whether their population means differ. A paired difference test uses additional information about the sample that is not present in an ordinary unpaired testing situation, either to increase the statistical power, or to reduce the effects of confounders.
Methods include:
- t-test (when stdev. is not known)
- paired Z-test (when stdev. is known)
- Wilcoxon signed-rank test (non-normal distributions, assumes symmetric distribution?)
- sign test (non-parametric, does not assume symmetry?, less powerful)
Popular uses include:
- before and after a treatment – “repeated measures” tests (increases power)
- reduce confounding by introducing artificial pairs that match on some level…
Sign test
Non parametric test to test the hypothesis that there is “no difference in medians” between the continuous distributions of two random variables X and Y, in the situation when we can draw paired samples from X and Y.
Because it is non-parametric it has very general applicability but may lack the statistical power of other tests such as the paired-samples t-test or the Wilcoxon signed-rank test.
Method:
Let p = Pr(X > Y), and then test the null hypothesis H0: p = 0.50. In other words, the null hypothesis states that given a random pair of measurements (xi, yi), then xi and yi are equally likely to be larger than the other.
Then let W be the number of pairs for which yi − xi > 0. Assuming that H0 is true, then W follows a binomial distribution W ~ b(m, 0.5). The “W” is for Frank Wilcoxon who developed the test, then later, the more powerful Wilcoxon signed-rank test.
Median test
*Mostly thought to be obsolete due to low power
Instead use: Wilcoxon–Mann–Whitney U two-sample test
It is a nonparametric test that tests the null hypothesis that the medians of the populations from which two samples are drawn are identical.
Difference betwen this and Mann-Whitney U
The relevant difference between the two tests is that the median test only considers the position of each observation relative to the overall median, whereas the Wilcoxon–Mann–Whitney test takes the ranks of each observation into account. Thus the latter test is usually the more powerful of the two.
Methods to compare means
See http://en.wikipedia.org/wiki/Comparing_means
Kuiper’s test
Kuiper’s test is used in statistics to test that whether a given distribution, or family of distributions, is contradicted by evidence from a sample of data.
Properties: invariant to cyclic transformations, and as sensitive in tails as near median
Uses: cyclic variations by time of year or day of wee/time of day, in general any circular probability distributions
Related to: Kolmogorov–Smirnov test, Anderson–Darling test
More
Kuiper’s test[1] is closely related to the more well-known Kolmogorov–Smirnov test (or K-S test as it is often called). As with the K-S test, the discrepancy statistics D+ and D− represent the absolute sizes of the most positive and most negative differences between the two cumulative distribution functions that are being compared. The trick with Kuiper’s test is to use the quantity D+ + D− as the test statistic. This small change makes Kuiper’s test as sensitive in the tails as at the median and also makes it invariant under cyclic transformations of the independent variable. The Anderson–Darling test is another test that provides equal sensitivity at the tails as the median, but it does not provide the cyclic invariance.
This invariance under cyclic transformations makes Kuiper’s test invaluable when testing for cyclic variations by time of year or day of the week or time of day, and more generally for testing the fit of, and differences between, circular probability distributions.
Jarque–Bera test
The Jarque–Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution
Cramér–von Mises criterion
Cramér–von Mises criterion is a criterion used for judging the goodness of fit of a cumulative distribution function compared to a given empirical distribution function , or for comparing two empirical distributions. It is also used as a part of other algorithms, such as minimum distance estimation.
In one-sample applications is the theoretical distribution and is the empirically observed distribution. Alternatively the two distributions can both be empirically estimated ones; this is called the two-sample case.
Related alternative tests: Kolmogorov–Smirnov test, Watson test (almost same)
Siegel–Tukey test
The Siegel–Tukey test is a non-parametric test which may be applied to data measured at least on an ordinal scale. It tests for differences in scale between two groups.
The test is used to determine if one of two groups of data tends to have more widely dispersed values than the other. In other words, the test determines whether one of the two groups tends to move, sometimes to the right, sometimes to the left, but away from the center (of the ordinal scale).
1960
More:
The principle is based on the following idea:
Suppose there are two groups A and B with n observations for the first group and m observations for the second (so there are N = n + m total observations). If all N observations are arranged in ascending order, it can be expected that the values of the two groups will be mixed or sorted randomly, if there are no differences between the two groups (following the null hypothesis H0). This would mean that among the ranks of extreme (high and low) scores, there would be similar values from Group A and Group B.
If, say, Group A were more inclined to extreme values (the alternative hypothesis H1), then there will be a higher proportion of observations from group A with low or high values, and a reduced proportion of values at the center.
Hypothesis H0: σ2A = σ2B & MeA = MeB (where σ2 and Me are the variance and the median, respectively)
Hypothesis H1: σ2A > σ2B
Statistical hypothesis tests
Statistical hypothesis tests answer the question Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?.[2] That probability is known as the P-value.
Statistical hypothesis testing is a key technique of frequentist statistical inference. The Bayesian approach to hypothesis testing is to base decisions on the posterior probability.
Wald–Wolfowitz runs test
The runs test (also called Wald–Wolfowitz test) is a non-parametric statistical test that checks a randomness hypothesis for a two-valued data sequence. More precisely, it can be used to test the hypothesis that the elements of the sequence are mutually independent.
“run” of a sequence is a maximal non-empty segment of the sequence consisting of adjacent equal elements. For example, the sequence “++++−−−+++−−++++++−−−−” consists of six runs, three of which consist of +’s and the others of −’s. The run test is based on the null hypothesis that the two elements + and - are independently drawn from the same distribution.
Under the null hypothesis, the number of runs in a sequence of length N is a random variable whose conditional distribution given the observation of N+ positive values and N− negative values (N = N+ + N−) is approximately normal.
The mean and variance do not depend on the “fairness” of the process generating the elements of the sequence, that is, that +’s and −’s have equal probabilities, but only on the assumption that the elements are independent and identically distributed. If the number of runs is significantly higher or lower than expected, the hypothesis of statistical independence of the elements may be rejected.
Runs tests can be used to test:
the randomness of a distribution, by taking the data in the given order and marking with + the data greater than the median, and with – the data less than the median; (Numbers equalling the median are omitted.)
whether a function fits well to a data set, by marking the data exceeding the function value with + and the other data with −. For this use, the runs test, which takes into account the signs but not the distances, is complementary to the chi square test, which takes into account the distances but not the signs.
The Kolmogorov–Smirnov test is more powerful, if it can be applied.
Kendall’s W
Kendall’s W (Kendall’s coefficient of concordance) is a non-parametric statistic. It is a normalization of the statistic of the Friedman test, and can be used for assessing agreement among raters. Kendall’s W ranges from 0 (no agreement) to 1 (complete agreement).
More:
uppose, for instance, that a number of people have been asked to rank a list of political concerns, from most important to least important. Kendall’s W can be calculated from these data. If the test statistic W is 1, then all the survey respondents have been unanimous, and each respondent has assigned the same order to the list of concerns. If W is 0, then there is no overall trend of agreement among the respondents, and their responses may be regarded as essentially random. Intermediate values of W indicate a greater or lesser degree of unanimity among the various responses.
While tests using the standard Pearson correlation coefficient assume normally distributed values and compare two sequences of outcomes at a time, Kendall’s W makes no assumptions regarding the nature of the probability distribution and can handle any number of distinct outcomes.
W is linearly related to the mean value of the Spearman’s rank correlation coefficients between all pairs of the rankings over which it is calculated.
Friedman test
The Friedman test is a non-parametric statistical test used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row (or block) together, then considering the values of ranks by columns. Applicable to complete block designs, it is thus a special case of the Durbin test.
Classic examples of use are:
n wine judges each rate k different wines. Are any wines ranked consistently higher or lower than the others?
n wines are each rated by k different judges. Are the judges’ ratings consistent with each other?
n welders each use k welding torches, and the ensuing welds were rated on quality. Do any of the torches produce consistently better or worse welds?
The Friedman test is used for one-way repeated measures analysis of variance by ranks. In its use of ranks it is similar to the Kruskal-Wallis one-way analysis of variance by ranks.
When using this kind of design for a binary response, one instead uses the Cochran’s Q test.
Durbin test
In the analysis of designed experiments, the Friedman test is the most common non-parametric test for complete block designs. The Durbin test is a nonparametric test for balanced incomplete designs that reduces to the Friedman test in the case of a complete block design.
More:
In a randomized block design, k treatments are applied to b blocks.For some experiments, it may not be realistic to run all treatments in all blocks, so one may need to run an incomplete block design. In this case, it is strongly recommended to run a balanced incomplete design. A balanced incomplete block design has the following properties:
Every block contains k experimental units.
Every treatment appears in r blocks.
Every treatment appears with every other treatment an equal number of times.
The Durbin test is based on the following assumptions:
The b blocks are mutually independent. That means the results within one block do not affect the results within other blocks.
The data can be meaningfully ranked (i.e., the data have at least an ordinal scale).
Cochran’s Q test is applied for the special case of a binary response variable (i.e., one that can have only one of two possible outcomes)
Cochran’s Q test
In statistics, in the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran’s Q test is a non-parametric statistical test to verify if k treatments have identical effects.[1][2] It is named for William Gemmell Cochran. Cochran’s Q test should not be confused with Cochran’s C test, which is a variance outlier test.
Cochran’s Q test assumes that there are k > 2 experimental treatments and that the observations are arranged in b blocks
Cochran’s Q test is
H0: The treatments are equally effective.
Ha: There is a difference in effectiveness among treatments.
Analysis of variance
Analysis of variance (ANOVA) is a collection of statistical models, and their associated procedures, in which the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are all equal, and therefore generalizes t-test to more than two groups. Doing multiple two-sample t-tests would result in an increased chance of committing a type I error. For this reason, ANOVAs are useful in comparing two, three, or more means.
The analysis of variance can be presented in terms of a linear model, which makes the following assumptions about the probability distribution of the responses:
Independence of observations – this is an assumption of the model that simplifies the statistical analysis.
Normality – the distributions of the residuals are normal.
Equality (or “homogeneity”) of variances, called homoscedasticity — the variance of data in groups should be the same.
The separate assumptions of the textbook model imply that the errors are independently, identically, and normally distributed for fixed effects models
More Anova-like tests:
ANOVA on ranks
ANOVA-simultaneous component analysis
AMOVA
ANCOVA
ANORVA
MANOVA
Mixed-design analysis of variance
Two-way analysis of variance
One-way analysis of variance
More, see: http://en.wikipedia.org/wiki/ANOVA
Kendall tau rank correlation coefficient
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall’s tau (τ) coefficient, is a statistic used to measure the association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.
Specifically, it is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897.
et (x1, y1), (x2, y2), …, (xn, yn) be a set of observations of the joint random variables X and Y respectively, such that all the values of (xi) and (yi) are unique. Any pair of observations (xi, yi) and (xj, yj) are said to be concordant if the ranks for both elements agree: that is, if both xi > xj and yi > yj or if both xi < xj and yi < yj. They are said to be discordant, if xi > xj and yi < yj or if xi < xj and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor discordant.
If X and Y are independent, then we would expect the coefficient to be approximately zero.
Goodman and Kruskal’s gamma
In statistics, Goodman and Kruskal’s gamma measures the strength of association of the cross tabulated data when both variables are measured at the ordinal level. It makes no adjustment for either table size or ties. Values range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.
Cohen’s kappa
Cohen’s kappa coeffici
ent is a statistical measure of inter-rater agreement or inter-annotator agreement[1] for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some researchers[2][citation needed] have expressed concern over κ’s tendency to take the observed categories’ frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement.
Others[3][citation needed] contest the assertion that kappa “takes into account” chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario.
Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories.
The equation for κ is:
where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as defined by Pr(e)), κ = 0.
Kappa assumes its theoretical maximum value of 1 only when both observers distribute codes the same, that is, when corresponding row and column sums are identical. Anything less is less than perfect agreement. till, the maximum value kappa could achieve given unequal distributions helps interpret the value of kappa actually obtained. The equation for κ maximum is: (See http://en.wikipedia.org/wiki/Cohen%27s_kappa)







