Statistical tests Flashcards

Question

Cochran's Q test

Answer 1

In statistics, in the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran's Q test is a non-parametric statistical test to verify if k treatments have identical effects. Cochran's Q test should not be confused with Cochran's C test, which is a variance outlier test. Cochran's Q test is H0: The treatments are equally effective. Ha: There is a difference in effectiveness among treatments. The Cochran's Q test statistic is k is the number of treatments X• j is the column total for the jth treatment b is the number of blocks Xi • is the row total for the ith block N is the grand total Cochran's Q test is based on the following **assumptions**: A large sample approximation; in particular, it assumes that b is "large". The blocks were randomly selected from the population of all possible blocks. The outcomes of the treatments can be coded as binary responses (i.e., a "0" or "1") in a way that is common to all treatments within each block. **Related/alternatives**: When using this kind of design for a response that is not binary but rather ordinal or continuous, one instead uses the Friedman test or Durbin tests. The case where there are exactly two treatments is equivalent to McNemar's test, which is itself equivalent to a two-tailed sign test.

Answer 2

In statistics, the logrank test is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test and appropriate to use when the data are right skewed and censored (technically, the censoring must be non-informative). It is widely used in clinical trials to establish the efficacy of a new treatment compared to a control treatment when the measurement is the time to event (such as the time from initial treatment to a heart attack). The test is sometimes called the Mantel–Cox test, named after Nathan Mantel and David Cox. The logrank test can also be viewed as a time stratified Cochran–Mantel–Haenszel test The logrank test statistic compares estimates of the hazard functions of the two groups at each observed event time. It is constructed by computing the observed and expected number of events in one of the groups at each observed event time and then adding these to obtain an overall summary across all time points where there is an event. **Related** - The logrank statistic can be derived as the score test for the Cox proportional hazards model comparing two groups. It is therefore asymptotically equivalent to the likelihood ratio test statistic based from that model. - The logrank statistic is asymptotically equivalent to the likelihood ratio test statistic for any family of distributions with proportional hazard alternative. For example, if the data from the two samples have exponential distributions. - The logrank statistic can be used when observations are censored. If censored observations are not present in the data then the Wilcoxon rank sum test is appropriate. - The logrank statistic gives all calculations the same weight, regardless of the time at which an event occurs. The Peto logrank statistic gives more weight to earlier events when there are a large number of observations.

Answer 3

Rao's score test, or the score test (often known as the Lagrange multiplier test in econometrics[1]) is a statistical test of a simple null hypothesis that a parameter of interest is equal to some particular value . It is the most powerful test when the true value of is close to . The main advantage of the Score-test is that it does not require an estimate of the information under the alternative hypothesis or unconstrained maximum likelihood. This makes testing feasible when the unconstrained maximum likelihood estimate is a boundary point in the parameter space. Related: The likelihood ratio test, the Wald test, and the Score test are asymptotically equivalent tests of hypotheses. When testing nested models, the statistics for each test converge to a Chi-squared distribution with degrees of freedom equal to the difference in degrees of freedom in the two models. **In many situations, the score statistic reduces to another commonly used statistic**. - When the data follows a normal distribution, the score statistic is the same as the t statistic.[clarification needed] - When the data consists of binary observations, the score statistic is the same as the chi-squared statistic in the Pearson's chi-squared test. - When the data consists of failure time data in two groups, the score statistic for the Cox partial likelihood is the same as the log-rank statistic in the log-rank test. Hence the log-rank test for difference in survival between two groups is most powerful when the proportional hazards assumption holds.

Answer 4

The Wald test is a parametric statistical test. statistician Abraham Wald with a great variety of uses. Whenever a relationship within or between data items can be expressed as a statistical model with parameters to be estimated from a sample, the Wald test can be used to test the true value of the parameter based on the sample estimate. For example, suppose an economist, who has data on social class and shoe size, wonders whether social class is associated with shoe size. Say is the average increase in shoe size for upper-class people compared to middle-class people: then the Wald test can be used to test whether is 0 (in which case social class has no association with shoe size) or non-zero (shoe size varies between social classes). Here, , the hypothetical difference in shoe sizes between upper and middle-class people in the whole population, is a parameter. An estimate of might be the difference in shoe size between upper and middle-class people in the sample. In the Wald test, the economist uses the estimate and an estimate of variability (see below) to draw conclusions about the unobserved true . Or, for a medical example, suppose smoking multiplies the risk of lung cancer by some number R: then the Wald test can be used to test whether R = 1 (i.e. there is no effect of smoking) or is greater (or less) than 1 (i.e. smoking alters risk). A Wald test can be used in a great variety of different models including models for dichotomous variables and models for continuous variables. Under the Wald statistical test, the maximum likelihood estimate of the parameter(s) of interest is compared with the proposed value , with the assumption that the difference between the two will be approximately normally distributed. Typically the square of the difference is compared to a chi-squared distribution. **Alternative tests and related** The likelihood-ratio test can also be used to test whether an effect exists or not. Usually the Wald test and the likelihood ratio test give very similar conclusions (as they are asymptotically equivalent), but very rarely, they disagree enough to lead to different conclusions: the researcher finds him/herself asking, or being asked, why the p-value is significant when the confidence interval includes 0, or why the p-value is not significant when the confidence interval excludes 0. In this situation, first remember that statistical significance is always somewhat arbitrary, as it depends on an arbitrarily chosen significance level. There are several reasons to prefer the likelihood ratio test to the Wald test.[3][4][5] One is that the Wald test can give different answers to the same question, depending on how the question is phrased.[6] For example, asking whether R = 1 is the same as asking whether log R = 0; but the Wald statistic for R = 1 is not the same as the Wald statistic for log R = 0 (because there is in general no neat relationship between the standard errors of R and log R). Likelihood ratio tests will give exactly the same answer whether we work with R, log R or any other monotonic transformation of R. The other reason is that the Wald test uses two approximations (that we know the standard error, and that the distribution is chi-squared), whereas the likelihood ratio test uses one approximation (that the distribution is chi-squared). Yet another alternative is the score test, which has the advantage that it can be formulated in situations where the variability is difficult to estimate; e.g. the Cochran–Mantel–Haenzel test is a score test.

Answer 5

The Chow test is a statistical and econometric test of whether the coefficients in two linear regressions on different data sets are equal. The Chow test was invented by economist Gregory Chow. In econometrics, the Chow test is most commonly used in time series analysis to test for the presence of a structural break. In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population. The null hypothesis of the Chow test asserts that a1=a2, b1=b2, and c1=c2, and there is the assumption that the model errors are independent and identically distributed from a normal distribution with unknown variance.

Answer 6

In statistical hypothesis testing, a uniformly most powerful (UMP) test is a hypothesis test which has the greatest power 1 − β among all possible tests of a given size α. For example, according to the Neyman–Pearson lemma, the likelihood-ratio test is UMP for testing simple (point) hypotheses

Answer 7

In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which (the null model) is a special case of the other (the alternative model). The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other. This likelihood ratio, or equivalently its logarithm, can then be used to compute a p-value, or compared to a critical value to decide whether to reject the null model in favour of the alternative model. When the logarithm of the likelihood ratio is used, the statistic is known as a log-likelihood ratio statistic, and the probability distribution of this test statistic, assuming that the null model is true, can be approximated using Wilks' theorem. ## Footnote In the case of distinguishing between two models, each of which has no unknown parameters, use of the likelihood ratio test can be justified by the Neyman–Pearson lemma, which demonstrates that such a test has the highest power among all competitors. The likelihood ratio, often denoted by (the capital Greek letter lambda), is the ratio of the likelihood function varying the parameters over two different sets in the numerator and denominator. A likelihood-ratio test is a statistical test for making a decision between two hypotheses based on the value of this ratio. It is central to the Neyman–Pearson approach to statistical hypothesis testing, and, like statistical hypothesis testing generally, is both widely used and much criticized. More:http://en.wikipedia.org/wiki/Likelihood\_ratio\_test A statistical model is often a parametrized family of probability density functions or probability mass functions f(x|theta) . A simple-vs-simple hypotheses test has completely specified models under both the null and alternative hypotheses, which for convenience are written in terms of fixed values of a notional parameter theta: H0: theta=theta\_0 H1:theta = theta\_1 likelihood ratio: if Lambda\>c, do not reject H0, if Lambda

Answer 8

In statistics, McNemar's test is a normal approximation used on nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal ("marginal homogeneity"). It is named after Quinn McNemar, who introduced it in 1947.[1] An application of the test in genetics is the transmission disequilibrium test for detecting genetic linkage. The test is applied to a 2 × 2 contingency table, which tabulates the outcomes of two tests on a sample of n subjects The null hypothesis of marginal homogeneity states that the two marginal probabilities for each outcome are the same. **Related tests** - The binomial sign test gives an exact test for the McNemar's test. - The Cochran's Q test for two "treatments" is equivalent to the McNemar's test. - The Liddell's exact test is an exact alternative to McNemar's test.[9][10] - The Stuart–Maxwell test is different generalization of the McNemar test, used for testing marginal homogeneity in a square table with more than two rows/columns.[11] - The Bhapkar's test (1966) is a more powerful alternative to the Stuart–Maxwell test.

Answer 9

Pearson's chi-squared test (χ2) is the best-known of several chi-squared tests (Yates, likelihood ratio, portmanteau test in time series, etc.) – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900.[1] In contexts where it is important to make a distinction between the test statistic and its distribution, names similar to Pearson X-squared test or statistic are used. It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events considered must be mutually exclusive and have total probability 1. A common case for this is where the events each cover an outcome of a categorical variable. A simple example is the hypothesis that an ordinary six-sided die is "fair", i. e., all six outcomes are equally likely to occur Pearson's chi-squared test is used to assess two types of comparison: tests of goodness of fit and tests of independence. - A test of goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical distribution. - A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other (e.g. polling responses from people of different nationalities to see if one's nationality affects the response). The first step is to calculate the chi-squared test statistic, X2, which resembles a normalized sum of squared deviations between observed and theoretical frequencies (see below). The second step is to determine the degrees of freedom, , of that statistic, which is essentially the number of frequencies reduced by the number of parameters of the fitted distribution. In the third step, X2 is compared to the critical value of no significance from the distribution, which in many cases gives a good approximation of the distribution of X2. A test that does not rely on this approximation is Fisher's exact test; it is substantially more accurate in obtaining a significance level, especially with few observations. The chi-squared test, when used with the standard approximation that a chi-squared distribution is applicable, has the following **assumptions**: (1) Simple random sample – The sample data is a random sampling from a fixed distribution or population where each member of the population has an equal probability of selection. Variants of the test have been developed for complex samples, such as where the data is weighted. (2) Sample size (whole table) – A sample with a sufficiently large size is assumed. If a chi squared test is conducted on a sample with a smaller size, then the chi squared test will yield an inaccurate inference. The researcher, by using chi squared test on small samples, might end up committing a Type II error. (3) Expected cell count – Adequate expected cell counts. Some require 5 or more, and others require 10 or more. A common rule is 5 or more in all cells of a 2-by-2 table, and 5 or more in 80% of cells in larger tables, but no cells with zero expected count. When this assumption is not met, Yates's Correction is applied. (4) Independence – The observations are always assumed to be independent of each other. This means chi-squared cannot be used to test correlated data (like matched pairs or panel data). In those cases you might want to turn to McNemar's test.

Answer 10

In statistics, Yates' correction for continuity (or Yates' chi-squared test) is used in certain situations when testing for independence in a contingency table. In some cases, Yates' correction may adjust too far, and so its current use is limited.

Answer 11

A portmanteau test is a type of statistical hypothesis test in which the null hypothesis is well specified, but the alternative hypothesis is more loosely specified. Tests constructed in this context can have the property of being at least moderately powerful against a wide range of departures from the null hypothesis. Thus, in applied statistics, a portmanteau test provides a reasonable way of proceeding as a general check of a model's match to a dataset where there are many different ways in which the model may depart from the underlying data generating process. Use of such tests avoids having to be very specific about the particular type of departure being tested. **Usage Example** n time series analysis, two well-known versions of a portmanteau test are available for testing for autocorrelation in the residuals of a model: it tests whether any of a group of autocorrelations of the residual time series are different from zero. This test is the Ljung–Box test, which is an improved version of the Box–Pierce test, having been devised at essentially the same time; a seemingly trivial simplification (omitted in the improved test) was found to have a deleterious effect. This portmanteau test is useful in working with ARIMA models. In the context of regression analysis, including regession analysis with time series structures, a portmanteau test has been devised which allows a general test to made for the possibility that a range of types nonlinear transformations of combinations of the explanatory variables should have been included in addition to a selected model structure.

Answer 12

is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero. Instead of testing randomness at each distinct lag, it tests the "overall" randomness based on a number of lags, and is therefore a portmanteau test. This test is sometimes known as the Ljung–Box Q test, and it is closely connected to the Box–Pierce test (which is named after George E. P. Box and David A. Pierce). In fact, the Ljung–Box test statistic was described explicitly in the paper that led to the use of the Box-Pierce statistic,[1][2] and from which that statistic takes its name. The Box-Pierce test statistic is a simplified version of the Ljung–Box statistic for which subsequent simulation studies have shown poor performance. The Ljung–Box test is widely applied in econometrics and other applications of time series analysis. The Ljung–Box test test can be defined as follows. H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process). Ha: The data are not independently distributed. The Ljung–Box test is commonly used in autoregressive integrated moving average (ARIMA) modeling. Note that it is applied to the residuals of a fitted ARIMA model, not the original series, and in such applications the hypothesis actually being tested is that the residuals from the ARIMA model have no autocorrelation. When testing ARIMA models, no adjustment to the test statistic or to the critical region of the test are made in relation to the structure of the ARIMA model.

Answer 13

Tukey's test, also known as the Tukey range test, Tukey method, Tukey's honest significance test, Tukey's HSD (honestly significant difference) test,[1] or the Tukey–Kramer method, is a single-step multiple comparison procedure and statistical test. It is used in conjunction with an ANOVA to find means that are significantly different from each other. Named after John Tukey, it compares all possible pairs of means, and is based on a studentized range distribution (q) (this distribution is similar to the distribution of t from the t-test).[2] The Tukey HSD tests should not be confused with the Tukey Mean Difference tests (also known as the Bland-Altman Test). Tukey's test compares the means of every treatment to the means of every other treatment; that is, it applies simultaneously to the set of all pairwise comparisons and identifies any difference between two means that is greater than the expected standard error. The confidence coefficient for the set, when all sample sizes are equal, is exactly 1 − α. For unequal sample sizes, the confidence coefficient is greater than 1 − α. In other words, the Tukey method is conservative when there are unequal sample sizes. **Assumptions of Tukey's test** (1) The observations being tested are independent (2) There is equal within-group variance across the groups associated with each mean in the test (homogeneity of variance). [edit] Tukey's test is based on a formula very similar to that of the t-test. In fact, Tukey's test is essentially a t-test, except that it corrects for experiment-wise error rate (when there are multiple comparisons being made, the probability of making a type I error increases — Tukey's test corrects for that, and is thus more suitable for multiple comparisons than doing a number of t-tests would be). Related: If only pairwise comparisons are to be made, the Tukey–Kramer method will result in a narrower confidence limit (which is preferable and more powerful) than Scheffé's method. In the general case when many or all contrasts might be of interest, Scheffé's method tends to give narrower confidence limits and is therefore the preferred method. -Newman–Keuls method

Answer 14

In statistics, G-tests are likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended.[citation needed] ## Footnote The general formula for G is where Oi is the observed frequency in a cell, E is the expected frequency on the null hypothesis, and the sum is taken over all cells, and where ln denotes the natural logarithm (log to the base e) and the sum is taken over all non-empty cells. G-tests are coming into increasing use, particularly since they were recommended at least since the 1981 edition of the popular statistics textbook by Sokal and Rohlf. **Related** Given the null hypothesis that the observed frequencies result from random sampling from a distribution with the given expected frequencies, the distribution of G is approximately a chi-squared distribution, with the same number of degrees of freedom as in the corresponding chi-squared test. For very small samples the multinomial test for goodness of fit, and Fisher's exact test for contingency tables, or even Bayesian hypothesis selection are preferable to the G-test The commonly used chi-squared tests for goodness of fit to a distribution and for independence in contingency tables are in fact approximations of the log-likelihood ratio on which the G-tests are based. The G-test quantity is proportional to the Kullback–Leibler divergence of the empirical distribution from the theoretical distribution. For analysis of contingency tables the value of G can also be expressed in terms of mutual information. An application of the G-test is known as the McDonald–Kreitman test in statistical genetics. Dunning[6] introduced the test to the computational linguistics community where it is now widely used.

Answer 15

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, R. A. Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. Fisher is said to have devised the test following a comment from Dr Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup; see lady tasting tea. The test is useful for categorical data that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. So in Fisher's original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Dr Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated – that is, whether Dr Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table. The p-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Dr Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table. With large samples, a chi-squared test can be used in this situation. However, the significance value it provides is only an approximation, because the sampling distribution of the test statistic that is calculated is only approximately equal to the theoretical chi-squared distribution. The approximation is inadequate when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the "expected values") being low. The usual rule of thumb for deciding whether the chi-squared approximation is good enough is that the chi-squared test is not suitable when the expected values in any of the cells of a contingency table are below 5, or below 10 when there is only one degree of freedom (this rule is now known to be overly conservative[4]). In fact, for small, sparse, or unbalanced data, the exact and asymptotic p-values can be quite different and may lead to opposite conclusions concerning the hypothesis of interest.[5][6] In contrast the Fisher test is, as its name states, exact as long as the experimental procedure keeps the row and column totals fixed, and it can therefore be used regardless of the sample characteristics. It becomes difficult to calculate with large samples or well-balanced tables, but fortunately these are exactly the conditions where the chi-squared test is appropriate. For hand calculations, the test is only feasible in the case of a 2 × 2 contingency table. However the principle of the test can be extended to the general case of an m × n table, and some statistical packages provide a calculation (sometimes using a Monte Carlo method to obtain an approximation) for the more general case. Despite the fact that Fisher's test gives exact p-values, some authors have argued that it is conservative, i.e. that its actual rejection rate is below the nominal significance level.The apparent contradiction stems from the combination of a discrete statistic with fixed significance levels. To be more precise, consider the following proposal for a significance test at the 5%-level: reject the null hypothesis for each table to which Fisher's test assigns a p-value equal to or smaller than 5%. Because the set of all tables is discrete, there may not be a table for which equality is achieved. If is the largest p-value smaller than 5% which can actually occur for some table, then the proposed test effectively tests at the -level. For small sample sizes, might be significantly lower than 5%.While this effect occurs for any discrete statistic (not just in contingency tables, or for Fisher's test), it has been argued that the problem is compounded by the fact that Fisher's test conditions on the marginals.To avoid the problem, many authors discourage the use of fixed significance levels when dealing with discrete problems. Another early discussion revolved around the necessity to condition on the marginals.Fisher's test gives exact p-values both for fixed and for random marginals. Other tests, most prominently Barnard's, require random marginals. Some authors (including, later, Barnard himself[13]) have criticized Barnard's test based on this property. They argue that the marginal totals are an (almost[14]) ancillary statistic, containing (almost) no information about the tested property. Related: An alternative exact test, Barnard's exact test, has been developed and proponents of it suggest that this method is more powerful, particularly in 2 × 2 tables. Another alternative is to use maximum likelihood estimates to calculate a p-value from the exact binomial or multinomial distributions and accept or reject based on the p-value. source: wikipedia

Answer 16

In statistics, Barnard's test is an exact test of the null hypothesis of independence of rows and columns in a contingency table. It is an alternative to Fisher's exact test but is more time-consuming to compute.

Answer 17

A chi-squared test, also referred to as chi-square test or test, is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-squared distribution as closely as desired by making the sample size large enough. Some examples of chi-squared tests where the chi-squared distribution is only approximately valid: - Pearson's chi-squared test, also known as the chi-squared goodness-of-fit test or chi-squared test for independence. When the chi-squared test is mentioned without any modifiers or without other precluding context, this test is usually meant (for an exact test used in place of , see Fisher's exact test). - Yates's correction for continuity, also known as Yates' chi-squared test. - Cochran–Mantel–Haenszel chi-squared test. - McNemar's test, used in certain 2 × 2 tables with pairing - Linear-by-linear association chi-squared test - The portmanteau test in time-series analysis, testing for the presence of autocorrelation - Likelihood-ratio tests in general statistical modelling, for testing whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one). One case where the distribution of the test statistic is an exact chi-squared distribution is the test that the variance of a normally distributed population has a given value based on a sample variance. Such a test is uncommon in practice because values of variances to test against are seldom known exactly.

Answer 18

In statistics, deviance is a quality of fit statistic for a model that is often used for statistical hypothesis testing. The deviance for a model M0, based on a dataset y, is defined as Here hat-theta\_0 denotes the fitted values of the parameters in the model M0, while hat-theta\_s denotes the fitted parameters for the "full model" (or "saturated model"): both sets of fitted values are implicitly functions of the observations y. Here the full model is a model with a parameter for every observation so that the data are fitted exactly. This expression is simply −2 times the log-likelihood ratio of the reduced model compared to the full model. The deviance is used to compare two models - in particular in the case of generalized linear models where it has a similar role to residual variance from ANOVA in linear models (RSS). Suppose in the framework of the GLM, we have two nested models, M1 and M2. In particular, suppose that M1 contains the parameters in M2, and k additional parameters. Then, under the null hypothesis that M2 is the true model, the difference between the deviances for the two models follows an approximate chi-squared distribution with k-degrees of freedom.

Answer 19

A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved from the works of R.A. Fisher and E.J.G. Pitman in the 1930s. ## Footnote To illustrate the basic idea of a permutation test, suppose we have two groups and whose sample means are and , and that we want to test, at 5% significance level, whether they come from the same distribution. Let and be the sample size corresponding to each group. The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis H that the two groups have identical probability distribution. The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the observed value of the test statistic, T(obs). Then the observations of groups and are pooled. Next, the difference in sample means is calculated and recorded for every possible way of dividing these pooled values into two groups of size and (i.e., for every permutation of the group labels A and B). The set of these calculated differences is the exact distribution of possible differences under the null hypothesis that group label does not matter. The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to T(obs). The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to ABS(T(obs)). If the only purpose of the test is reject or not reject the null hypothesis, we can as an alternative sort the recorded differences, and then observe if T(obs) is contained within the middle 95% of them. If it is not, we reject the hypothesis of identical probability curves at the 5% significance level. [edit]Relation to parametric tests Permutation tests are a subset of non-parametric statistics. The basic premise is to use only the assumption that it is possible that all of the treatment groups are equivalent, and that every member of them is the same before sampling began (i.e. the slot that they fill is not differentiable from other slots before the slots are filled). From this, one can calculate a statistic and then see to what extent this statistic is special by seeing how likely it would be if the treatment assignments had been jumbled. In contrast to permutation tests, the reference distributions for many popular "classical" statistical tests, such as the t-test, F-test, z-test and χ2 test, are obtained from theoretical probability distributions. Fisher's exact test is an example of a commonly used permutation test for evaluating the association between two dichotomous variables. When sample sizes are large, the Pearson's chi-square test will give accurate results. For small samples, the chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the test statistic, and in this situation the use of Fisher's exact test becomes more appropriate. A rule of thumb is that the expected count in each cell of the table should be greater than 5 before Pearson's chi-squared test is used.[citation needed] Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when losses are proportional to the size of an error rather than its square). All simple and many relatively complex parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption. For example, it is possible in this manner to construct a permutation t-test, a permutation chi-squared test of association, a permutation version of Aly's test for comparing variances and so on. The major down-side to permutation tests are that they Can be computationally intensive and may require "custom" code for difficult-to-calculate statistics. This must be rewritten for every case. Are primarily used to provide a p-value. The inversion of the test to get confidence regions/intervals requires even more computation. [edit]Advantages Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses. Permutation tests can be used for analyzing unbalanced designs [7] and for combining dependent tests on mixtures of categorical, ordinal, and metric data (Pesarin, 2001). Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small sample sizes. Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated path algorithms applicable in special situations, made the application of permutation test methods practical for a wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based "exact" confidence intervals. [edit]Limitations An important assumption behind a permutation test is that the observations are exchangeable under the null hypothesis. An important consequence of this assumption is that tests of difference in location (like a permutation t-test) require equal variance. In this respect, the permutation t-test shares the same weakness as the classical Student's t-test (the Behrens–Fisher problem). A third alternative in this situation is to use a bootstrap-based test. Good (2000) explains the difference between permutation tests and bootstrap tests the following way: "Permutations test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap entails less-stringent assumptions." Of course, bootstrap tests are not exact. [edit]Monte Carlo testing An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data to allow complete enumeration in a convenient manner. This is done by generating the reference distribution by Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the possible replicates. The realization that this could be applied to any permutation test on any dataset was an important breakthrough in the area of applied statistics. The earliest known reference to this approach is Dwass (1957).[8] This type of permutation test is known under various names: approximate permutation test, Monte Carlo permutation tests or random permutation tests.[9] After random permutations, it is possible to obtain a confidence interval for the p-value based on the Binomial distribution. For example, if after random permutations the p-value is estimated to be , then a 99% confidence interval for the true (the one that would result from trying all possible permutations) is . On the other hand, the purpose of estimating the p-value is most often to decide whether , where is the threshold at which the null hypothesis will be rejected (typically ). In the example above, the confidence interval only tells us that there is roughly a 50% chance that the p-value is smaller than 0.05, i.e. it is completely unclear whether the null hypothesis should be rejected at a level . If it is only important to know whether for a given , it is logical to continue simulating until the statement can be established to be true or false with a very low probability of error. Given a bound on the admissible probability of error (the probability of finding that when in fact or vice versa), the question of how many permutations to generate can be seen as the question of when to stop generating permutations, based on the outcomes of the simulations so far, in order to guarantee that the conclusion (which is either or ) is correct with probability at least as large as . ( will typically be chosen to be extremely small, e.g. 1/1000.) Stopping rules to achieve this have been developed[10] which can be incorporated with minimal additional computational cost. In fact, depending on the true underlying p-value it will often be found that the number of simulations required is remarkably small (e.g. as low as 5 and often not larger than 100) before a decision can be reached with virtual certainty.

Answer 20

http://en.wikipedia.org/wiki/Category:Non-parametric\_statistics Hoeffding's independence test Nemenyi test Multinomial test Anderson–Darling test Mantel test Location tests, Given a type of problem, pick the appropriate test used. Examples of tests Clean upcards Separate into different types of data (categorical vs. other?) Or have a card that lists different types of tests Separate tests into uses There's a great table of stuff in both: http://en.wikipedia.org/wiki/Statistical\_tests http://en.wikipedia.org/wiki/Comparing\_means Neyman–Pearson lemma Logistic regression#Introduction Make simpler cards and make connections between things clearer. Boil down to most important stuff and split up information. http: //en.wikipedia.org/wiki/Pitman\_permutation\_test http: //en.wikipedia.org/wiki/Non-parametric\_statistics

Answer 21

Named for the Dutch mathematician Bartel Leendert van der Waerden, the Van der Waerden test is a statistical test that k population distribution functions are equal. The Van Der Waerden test converts the ranks from a standard Kruskal-Wallis one-way analysis of variance to quantiles of the standard normal distribution (details given below). These are called normal scores and the test is computed from these normal scores. The k population version of the test is an extension of the test for two populations published by Van der Waerden Analysis of Variance (ANOVA) is a data analysis technique for examining the significance of the factors (independent variables) in a multi-factor model. The one factor model can be thought of as a generalization of the two sample t-test. That is, the two sample t-test is a test of the hypothesis that two population means are equal. The one factor ANOVA tests the hypothesis that k population means are equal. The standard ANOVA assumes that the errors (i.e., residuals) are normally distributed. If this normality assumption is not valid, an alternative is to use a non-parametric test. **Related methods** The most common non-parametric test for the one-factor model is the Kruskal-Wallis test. The Kruskal-Wallis test is based on the ranks of the data. The advantage of the Van Der Waerden test is that it provides the high efficiency of the standard ANOVA analysis when the normality assumptions are in fact satisfied, but it also provides the robustness of the Kruskal-Wallis test when the normality assumptions are not satisfied.

Answer 22

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is supported. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t distribution. Among the most frequently used t-tests are: - A one-sample location test of whether the mean of a normally distributed population has a value specified in a null hypothesis. - A two sample location test of the null hypothesis that the means of two normally distributed populations are equal. All such tests are usually called Student's t-tests, though strictly speaking that name should only be used if the variances of the two populations are also assumed to be equal; the form of the test used when this assumption is dropped is sometimes called Welch's t-test. These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping.[5] - A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero. For example, suppose we measure the size of a cancer patient's tumor before and after a treatment. If the treatment is effective, we expect the tumor size for many of the patients to be smaller following the treatment. This is often referred to as the "paired" or "repeated measures" t-test:[5][6] see paired difference test. - A test of whether the slope of a regression line differs significantly from 0. **Assumptions** Most t-test statistics have the form T = Z/s, where Z and s are functions of the data. Typically, Z is designed to be sensitive to the alternative hypothesis (i.e. its magnitude tends to be larger when the alternative hypothesis is true), whereas s is a scaling parameter that allows the distribution of T to be determined. As an example, in the one-sample t-test Z = , where is the sample mean of the data, is the sample size, and is the population standard deviation of the data; s in the one-sample t-test is , where is the sample standard deviation. The assumptions underlying a t-test are that (1) Z follows a standard normal distribution under the null hypothesis (2) s^2 follows a χ2 distribution with p degrees of freedom under the null hypothesis, where p is a positive constant (3) Z and s are independent. In a specific type of t-test, these conditions are consequences of the population being studied, and of the way in which the data are sampled. For example, in the t-test comparing the means of two independent samples, the following assumptions should be met: (1) Each of the two populations being compared should follow a normal distribution. This can be tested using a normality test, such as the Shapiro-Wilk or Kolmogorov–Smirnov test, or it can be assessed graphically using a normal quantile plot. (2) If using Student's original definition of the t-test, the two populations being compared should have the same variance (testable using F test, Levene's test, Bartlett's test, or the Brown–Forsythe test; or assessable graphically using a Q-Q plot). If the sample sizes in the two groups being compared are equal, Student's original t-test is highly robust to the presence of unequal variances.[7] Welch's t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar. (3) The data used to carry out the test should be sampled independently from the two populations being compared. This is in general not testable from the data, but if the data are known to be dependently sampled (i.e. if they were sampled in clusters), then the classical t-tests discussed here may give misleading results. **Types**: Two-sample t-tests for a difference in mean involve independent samples, paired samples and overlapping samples. Paired t-tests are a form of blocking, and have greater power than unpaired tests when the paired units are similar with respect to "noise factors" that are independent of membership in the two groups being compared.[8] In a different context, paired t-tests can be used to reduce the effects of confounding factors in an observational study. **Related**: The t-test provides an exact test for the equality of the means of two normal populations with unknown, but equal, variances. (The Welch's t-test is a nearly exact test for the case where the data are normal but the variances may differ.) For moderately large samples and a one tailed test, the t is relatively robust to moderate violations of the normality assumption. For exactness, the t-test and Z-test require normality of the sample means, and the t-test additionally requires that the sample variance follows a scaled χ2 distribution, and that the sample mean and sample variance be statistically independent. Normality of the individual data values is not required if these conditions are met. By the central limit theorem, sample means of moderately large samples are often well-approximated by a normal distribution even if the data are not normally distributed. For non-normal data, the distribution of the sample variance may deviate substantially from a χ2 distribution. However, if the sample size is large, Slutsky's theorem implies that the distribution of the sample variance has little effect on the distribution of the test statistic. If the data are substantially non-normal and the sample size is small, the t-test can give misleading results. See Location test for Gaussian scale mixture distributions for some theory related to one particular family of non-normal distributions When the normality assumption does not hold, a non-parametric alternative to the t-test can often have better statistical power. For example, for two independent samples when the data distributions are asymmetric (that is, the distributions are skewed) or the distributions have large tails, then the Wilcoxon Rank Sum test (also known as the Mann-Whitney U test) can have three to four times higher power than the t-test.[ The nonparametric counterpart to the paired samples t test is the Wilcoxon signed-rank test for paired samples. For a discussion on choosing between the t and nonparametric alternatives, see Sawilowsky. One-way analysis of variance generalizes the two-sample t-test when the data belong to more than two groups.

Answer 23

In statistics, Welch's t test is an adaptation of Student's t-test intended for use with two samples having possibly unequal variances.[1] As such, it is an approximate solution to the Behrens–Fisher problem Once t and have been computed, these statistics can be used with the t-distribution to test the null hypothesis that the two population means are equal (using a two-tailed test), or the null hypothesis that one of the population means is greater than or equal to the other (using a one-tailed test). In particular, the test will yield a p-value which might or might not give evidence sufficient to reject the null hypothesis. Welch's t-test defines the statistic t by the following formula where Xi, s\_i, and N\_i are the th sample mean, sample variance and sample size Unlike in Student's t-test, the denominator is not based on a pooled variance estimate. (note the degrees of freedom associated with this variance estimate is approximated using the Welch-Satterthwaite equation)

Answer 24

Hotelling's T-squared statistic is a generalization of Student's t statistic that is used in multivariate hypothesis testing

Answer 25

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. Exact F-tests mainly arise when the models have been fitted to the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance ratio in the 1920s **Examples of F-tests include:** (1) The hypothesis that the means of several normally distributed populations, all having the same standard deviation, are equal. This is perhaps the best-known F-test, and plays an important role in the analysis of variance (ANOVA). (2) The hypothesis that a proposed regression model fits the data well. See Lack-of-fit sum of squares. (3) The hypothesis that a data set in a regression analysis follows the simpler of two proposed linear models that are nested within each other. (4) Scheffé's method for multiple comparisons adjustment in linear models. **F-test of the equality of two variances** Main article: F-test of equality of variances This F-test is sensitive to non-normality.[2][3] In the analysis of variance (ANOVA), alternative tests include Levene's test, Bartlett's test, and the Brown–Forsythe test. However, when any of these tests are conducted to test the underlying assumption of homoscedasticity (i.e. homogeneity of variance), as a preliminary step to testing for mean effects, there is an increase in the experiment-wise Type I error rate.[4]

Answer 26

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Because of the central limit theorem, many test statistics are approximately normally distributed for large samples. For each significance level, the Z-test has a single critical value (for example, 1.96 for 5% two tailed) which makes it more convenient than the Student's t-test which has separate critical values for each sample size. Therefore, many statistical tests can be conveniently performed as approximate Z-tests if the sample size is large or the population variance known. If the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large (n \< 30), the Student t-test may be more appropriate. ## Footnote If T is a statistic that is approximately normally distributed under the null hypothesis, the next step in performing a Z-test is to estimate the expected value θ of T under the null hypothesis, and then obtain an estimate s of the standard deviation of T. We then calculate the standard score Z = (T − θ) / s, from which one-tailed and two-tailed p-values can be calculated as Φ(−|Z|) and 2Φ(−|Z|), respectively, where Φ is the standard normal cumulative distribution function **Requirements and assumptions:** For the Z-test to be applicable, certain conditions must be met. Nuisance parameters should be known, or estimated with high accuracy (an example of a nuisance parameter would be the standard deviation in a one-sample location test). Z-tests focus on a single parameter, and treat all other unknown parameters as being fixed at their true values. In practice, due to Slutsky's theorem, "plugging in" consistent estimates of nuisance parameters can be justified. However if the sample size is not large enough for these estimates to be reasonably accurate, the Z-test may not perform well. The test statistic should follow a normal distribution. Generally, one appeals to the central limit theorem to justify assuming that a test statistic varies normally. There is a great deal of statistical research on the question of when a test statistic varies approximately normally. If the variation of the test statistic is strongly non-normal, a Z-test should not be used. If estimates of nuisance parameters are plugged in as discussed above, it is important to use estimates appropriate for the way the data were sampled. In the special case of Z-tests for the one or two sample location problem, the usual sample standard deviation is only appropriate if the data were collected as an independent sample. In some situations, it is possible to devise a test that properly accounts for the variation in plug-in estimates of nuisance parameters. In the case of one and two sample location problems, a t-test does this. **Examples**: (location tests) The term Z-test is often used to refer specifically to the one-sample location test comparing the mean of a set of measurements to a given constant. If the observed data X1, ..., Xn are (i) uncorrelated, (ii) have a common mean μ, and (iii) have a common variance σ2, then the sample average X has mean μ and variance σ2 / n. If our null hypothesis is that the mean value of the population is a given number μ0, we can use X −μ0 as a test-statistic, rejecting the null hypothesis if X −μ0 is large. To calculate the standardized statistic Z = (X − μ0) / s, we need to either know or have an approximate value for σ2, from which we can calculate s2 = σ2 / n. In some applications, σ2 is known, but this is uncommon. If the sample size is moderate or large, we can substitute the sample variance for σ2, giving a plug-in test. The resulting test will not be an exact Z-test since the uncertainty in the sample variance is not accounted for — however, it will be a good approximation unless the sample size is small. A t-test can be used to account for the uncertainty in the sample variance when the sample size is small and the data are exactly normal. There is no universal constant at which the sample size is generally considered large enough to justify use of the plug-in test. Typical rules of thumb range from 20 to 50 samples. For larger sample sizes, the t-test procedure gives almost identical p-values as the Z-test procedure. Other location tests that can be performed as Z-tests are the two-sample location test and the paired difference test. **Examples: (Z-tests other than location tests)** Location tests are the most familiar t-tests. Another class of Z-tests arises in maximum likelihood estimation of the parameters in a parametric statistical model. Maximum likelihood estimates are approximately normal under certain conditions, and their asymptotic variance can be calculated in terms of the Fisher information. The maximum likelihood estimate divided by its standard error can be used as a test statistic for the null hypothesis that the population value of the parameter equals zero. More generally, if hat\_θ is the maximum likelihood estimate of a parameter θ, and θ\_0 is the value of θ under the null hypothesis, can be used as a Z-test statistic. When using a Z-test for maximum likelihood estimates, it is important to be aware that the normal approximation may be poor if the sample size is not sufficiently large. Although there is no simple, universal rule stating how large the sample size must be to use a Z-test, simulation can give a good idea as to whether a Z-test is appropriate in a given situation. Z-tests are employed whenever it can be argued that a test statistic follows a normal distribution under the null hypothesis of interest. Many non-parametric test statistics, such as U statistics, are approximately normal for large enough sample sizes, and hence are often performed as Z-tests. **Example Usage**: Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 points, and 12 points, respectively. Our interest is in the scores of 55 students in a particular school who received a mean score of 96. We can ask whether this mean score is significantly lower than the regional mean — that is, are the students in this school comparable to a simple random sample of 55 students from the region as a whole, or are their scores surprisingly low? We begin by calculating the standard error of the mean : 1.62 Next we calculate the z-score, which is the distance from the sample mean to the population mean in units of the standard error: -2.47 n this example, we treat the population mean and variance as known, which would be appropriate either if all students in the region were tested, or if a large random sample were used to estimate the population mean and variance with minimal estimation error. The classroom mean score is 96, which is −2.47 standard error units from the population mean of 100. Looking up the z-score in a table of the standard normal distribution, we find that the probability of observing a standard normal value below -2.47 is approximately 0.5 - 0.4932 = 0.0068. This is the one-sided p-value for the null hypothesis that the 55 students are comparable to a simple random sample from the population of all test-takers. The two-sided p-value is approximately 0.014 (twice the one-sided p-value). Another way of stating things is that with probability 1 − 0.014 = 0.986, a simple random sample of 55 students would have a mean test score within 4 units of the population mean. We could also say that with 98.6% confidence we reject the null hypothesis that the 55 test takers are comparable to a simple random sample from the population of test-takers. The Z-test tells us that the 55 students of interest have an unusually low mean test score compared to most simple random samples of similar size from the population of test-takers. A deficiency of this analysis is that it does not consider whether the effect size of 4 points is meaningful. If instead of a classroom, we considered a subregion containing 900 students whose mean score was 99, nearly the same z-score and p-value would be observed. This shows that if the sample size is large enough, very small differences from the null value can be highly statistically significant. See statistical hypothesis testing for further discussion of this issue.

Answer 27

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts. In addition, Omnibus test is a general name refers to an overall or a global test and in most cases omnibus test is called in other expressions such as: F-test or Chi-squared test. Omnibus test as a statistical test is implemented on an overall hypothesis that tends to find general significance between parameters' variance, while examining parameters of the same type, such as: Hypotheses regarding equality vs. inequality between k expectancies in Analysis Of Variance(ANOVA) ; or regarding equality between k standard deviations in testing equality of variances in ANOVA or regarding coefficients in Multiple linear regression or in Logistic regression. Omnibus tests commonly refers to either one of those statistical tests: ANOVA F test to test significance between all factor means and/or between there variances equality in Analysis of Variance procedure ; The omnibus multivariate F Test in ANOVA with repeated measures ; F test for equality/inequality of the regression coefficients in Multiple Regression; Chi-Square test for exploring significance differences between blocks of independent explanatory variables or their coefficients in a logistic regression. Those omnibus tests are usually conducted whenever one tends to test an overall hypothesis on a quadratic statistic (like sum of squares or variance or covariance) or rational quadratic statistic (like the ANOVA overall F test in Analysis of Variance or F Test in Analysis of covariance or the F Test in Linear Regression, or Chi-Square in Logistic Regression). While significance is founded on the omnibus test, it doesn't specify exactly where the difference is occurred, meaning, it doesn't bring specification on which parameter is significally different from the other, but it statistically determine that there is a difference, so at least two of the tested parameters are statistically different. If significance was met, none of those tests will tell specifically which mean different from the other (in ANOVA), which coefficient differ from the other (in Regression) etc.

Statistical tests Flashcards

(52 cards)