Assumptions (Fields Ch. 5) Flashcards

1
Q

Bootstrap

A

a technique from which the sampling distribution of a statistic is estimated by taking repeated samples (with replacement) from the data set (in effect, treating the data as a population from which smaller samples are taken). The statistic of interest (e.g., the mean, or b coefficient) is calculated for each sample, from which the sampling distribution of the statistic is estimated. The standard error of the statistic is estimated as the standard deviation of the sampling distribution created from the bootstrap samples. From this, confidence intervals and significance tests can be computed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Heterogeneity of variance

A

the opposite of homogeneity of variance. This term means that the variance of one variable varies (i.e., is different) across levels of another variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Heteroscedasticity

A

the opposite of homoscedasticity. This occurs when the residuals at each level of the predictor variables(s) have unequal variances. Put another way, at each point along any predictor variable, the spread of residuals is different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Homogeneity of variance

A

the assumption that the variance of one variable is stable (i.e., relatively similar) at all levels of another variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Homoscedasticity

A

an assumption in regression analysis that the residuals at each level of the predictor variable(s) have similar variances. Put another way, at each point along any predictor variable, the spread of residuals should be fairly constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Independence

A

the assumption that one data point does not influence another. When data come from people, it basically means that the behaviour of one person does not influence the behaviour of another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Kolmogorov-Smirnov test

A

one way to test is the data normal? a test of whether a distribution of scores is significantly different from a normal distribution. A significant value indicates a deviation from normality, but this test is notoriously affected by large samples in which small deviations from normality yield significant results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Levene’s test

A

this tests the hypothesis that the variances in different groups are equal (i.e., the difference between the variances is zero). It basically does a one-way ANOVA on the deviations (i.e., the absolute value of the difference between each score and the mean of its group). A significant result indicates that the variances are significantly different - therefore, the assumption of homogeneity of variances has been violated. When samples sizes are large, small differences in group variances can produce a significant Levene’s test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Mixed normal distribution

A

a normal-looking distribution that is contaminated by a small proportion of scores from a different distribution. These distributions are not normal and have too many scores in the tails (i.e., at the extremes). The effect of these heavy tails is to inflate the estimate of the population variance. This, in turn, makes significance tests lack power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Outlier

A

an observation or observations very different from most others. Outliers bias statistics (e.g., the mean) and their standard errors and confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

P-P plot

A

Short for a probability-probability plot. A graph plotting the cumulative probability of a variable against the cumulative probability of a particular distribution (often a normal distribution). Like a Q-Q plot, if values fall on the diagonal of the plot then the variable shares the same distribution as the one specified. Deviations from the diagonal show deviations from the distribution of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Parametric test

A

a test that requires data from one of the large catalogue of distributions that statisticians have described. Normally this term is used for parametric tests based on the normal distribution, which require four basic assumptions that must be met for the test to be accurate: a normally distributed sampling distribution (see normal distribution), homogeneity of variance, interval or ratio data, and independence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Q-Q plot

A

short for a quantile-quantile plot. A graph plotting the quantiles of a variable against the quantiles of a particular distribution (often a normal distribution). Like a P-P plot, if values fall on the diagonal of the plot then the variable shares the same distribution as the one specified. Deviations from the diagonal show deviations from the distribution of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Robust test

A

a term applied to a family of procedures to estimate statistics that are reliable even when the normal assumptions of the statistic are not met.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Transformation

A

the process of applying a mathematical function to all observations in a data set, usually to correct some distributional abnormality such as skew or kurtosis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Trimmed mean

A

a statistic used in many robust tests. It is a mean calculated after a certain percentage of the distribution has been removed at the extremes. For example, a 20% trimmed mean is a mean calculated after the top and bottom 20% of ordered scores have been removed. Imagine we had 20 scores representing the annual income of students (in thousands, rounded to the nearest thousand: 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 35. The mean income is 5 (£5000). This value is biased by an outlier. A 10% trimmed mean will remove 10% of scores from the top and bottom of ordered scores before the mean is calculated. With 20 scores, removing 10% of scores involves removing the top and bottom 2 scores. This gives us: 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, the mean of which is 3.44. The mean depends on a symmetrical distribution to be accurate, but a trimmed mean produces accurate results even when the distribution is not symmetrical. There are more complex examples of robust methods such as the bootstrap.

17
Q

Weighted least squares

A

a method of regression in which the parameters of the model are estimated using the method of least squares but observations are weighted by some other variable. Often they are weighted by the inverse of their variance to combat heteroscedasticity.

18
Q

What are the tests that I can use to see if my data is normal?

A

K-S test, S-W test, Chi-Squared test. Can use the free software.

19
Q

What are the assumptions of parametric data?

A

Normally distributed data (either the sampling distribution or the errors in the model must be normal)

Homogeneity if variance

Interval data

Independence

20
Q

As the sample gets larger, we can be more confident that…

A

The sampling distribution will be normally distributed (samples of 30 or more) - important for first assumption of parametric data tests (all the tests we do)