Statistics Flashcards
Assumptions in ANOVA
The Analysis of Variance (ANOVA) test is utilized to explore differences between or among means of two or more groups as well as examine the individual and interacting effects between multiple independent variables. The first assumption of an ANOVA test is homogeneity of variance which expects that both populations have similar variance; this is important as unequal variances might lead a researcher to make erroneous conclusions about true or null differences between groups. The second assumption is the normal distribution of error/residuals among groups/conditions. The third assumption is that the groups being examined are considered independent observations.
Interaction in ANOVA
Exploring interaction effects within a test like a factorial ANOVA explores whether the effect of one independent variable’s effect on the dependent variable depends on the level of another independent variable. A significant interaction will outweigh significant main effects (significant changes in a variable) in the interpretation of the results. An example of an interaction effect that might be investigated is whether the effect of a Treatment Condition of CBT that is significantly different between group one and group two is independent of, or interacts with, Gender. If an interaction effect is significant, results might show that the Treatment effect was much greater for cisgender women than transgender women, possibly due to the lesser applicability of treatment modules to the unique characteristics of transgender women’s lives.
Simple/Main Effects
When an interaction is significant, follow-up tests examining simple and main effects allow one to examine the interaction further. A simple effect is defined as the effect of an IV within one level of the second IV. For example, a simple effect might examine the scores of transgender women in the treatment condition of CBT. On the other hand, a main effect simply examines one IV and its differences on the DV averaged across levels of the second IV. For instance, we might expect there to be a main effect of a treatment condition generally because one group receives an intervention, and another does not. An interaction would be implied if this improvement with treatment condition varied depending on participants’ gender.
Assumptions in Linear Regression
A linear regression test examines predictions of X variables on Y variables/criterion, and both are typically on an interval scale, ratio scale, or nominal while dummy coded. The first assumption of a linear regression is that the predictor and criterion should have a linear relationship, with a straight line of best fit; this can be examined through scatterplots. The second assumption is that the residuals of variables need to be normally distributed, or unimodal, symmetrical, and mesokurtic; this can be examined through histograms and normality statistics. The third assumption is homoscedasticity, or that the variance of residuals is consistent across all levels of the predictor; this can be examined by plotting the residuals. The fourth assumption is the independence of observations and residuals to ensure no correlation between errors exists.
Central Limit
Theorem/Sampling Distribution of the Mean
The sampling distribution of the mean is the distribution of values we would expect to obtain for the mean if we drew an infinite number of samples from the population in question and calculated the mean on each sample. The central limit theorem describes how the sampling distribution of the mean (i.e., the distribution of sample means) approaches normal as sample size increases, even if the parent population is not normally distributed. The rate at which the sampling distribution of the mean approaches normal as n increases is a function of the shape of the parent population. If the parent population is itself normal, then the sampling distribution of the mean will be normal regardless of n. Also, when sample sizes are large enough (>30), the sampling distribution will be approximately normal, even if the population does not have a normal distribution. If sample sizes are smaller, this theorem states that if unimodal, the sampling distribution should still be relatively unimodal.
Box plot
Developed by John Tukey to examine the dispersion of data compared to a histogram or stem-and-leaf. The boxplot is designed to examine outliers around the median of the data by using the 1st and 3rd quartiles which bracket the middle 50% of scores (i.e., interquartile range). We would draw a box around the 1st and 3rd quartile, with a vertical line representing the median. The edges of the box plot, represented by the whiskers, show the lower and upper quartiles of 25% of the data. The maximum and minimum are determined by 1.5xIQR of Q1 and 3; any point beyond these is considered an outlier. Examining the boxplot allows us to tell if the distribution is symmetric by examining whether the median lies in the center of the box. Skewness can also be determined by the length of the whiskers compared to one another. Outliers are determined as values that are outside the whiskers.
Homoscedasticity vs. Heteroscedasticity
Homoscedasticity (or the homogeneity of variance) is an assumption of ANOVA and linear regression that assumes populations or variables have the same variance or error. Homoscedasticity can be examined through a residual scatterplot which should graphically show a line of data points. Heteroscedasticity (or the heterogeneity of variance) is the opposite in which populations or variables have different variances. Heteroscedasticity is a larger issue in research because residuals systematically change based on the level of the IV. We expect errors in prediction but want errors to be random in nature and therefore uniformly distributed around the regression line. In a graphical plot of residuals, heteroscedasticity is implied when residuals and fitted values cone or fan out in a plot.
Confidence intervals
Confidence intervals are provided as a range of values that likely contain an unknown population parameter, typically set at a confidence level of 95%. The upper and lower bounds of a confidence interval may change depending on the sample. In other words, if we use the same sampling method to select different samples and computed an interval estimate each time, we would expect the true population parameter to fall within the interval estimates 95% of the time. Some might misinterpret CIs to mean that we are 95% confident a true population parameter falls within the interval calculated. CI’s are statements of probability the interval encompasses the target population statistic, not that the target statistic is within the interval because the population parameter does not vary. Since the population parameter does not vary, variation in the CI’s is due to errors in the samples.
Family-wise Error Rate/Post-Hoc
When making comparisons between group means, a family of conclusions is made. This might commonly be encountered when conducting post-hoc tests on an analysis like an ANOVA in which one might consider how specific groups might differ on an outcome. The family-wise error rate is the probability that these comparisons will result in at least one Type I error. While the error rate of each comparison on its own stays the same, the family of comparisons’ error rate is influenced (i.e., Type I error is inflated); this is why researchers like Howell (2013) emphasize making post-hoc test decisions based in theory and practice and not searching for every possible comparison because multiple comparisons increase the family-wise error rate. A common way to control the family-wise error rate is the Bonferroni correction to utilize a more conservative alpha level based on the number of comparisons being made.
Effect size vs Statistical Significance
The more common practice of reporting effect sizes is to provide greater information about statistical results beyond significance testing. Significance testing on its own is to examine whether results, whether trivial or major, are unlikely to be due to chance. Significance testing is based on a chosen alpha value (e.g., p < .05). Statistical significance can be influenced by sample size, with significance more likely to be found due to larger power from larger sample sizes. However, effect size (e.g., the d-family or r-family of effect sizes) speak to “importance”, or the degree of meaningfulness of the difference or relationship being examined. Effect sizes can be standardized across research in order to be compared (e.g., Cohen’s d expressed in standard deviation units for interpretation), with larger effect sizes implying a greater practical significance of a finding. A test using 15,000 individuals may find statistical significance given the large number of participants, but the effect found might be meaningless depending on the determined effect size. Reporting effect sizes is necessary so researchers can interpret the meaningfulness of a significant finding for real-world applications.
Measures of central tendency/Mean, Median, Mode
Mean, median, and mode are measures of central tendency. The mode is the most common score, can be used with nominal/ordinal data, and is unaffected by extreme values; however, it may not represent the entire data set, can be unstable from sample to sample, and may require two scores. The median is the score at the 50th percentile, the value may not actually occur in the data, unaffected by extreme values, and the median location = (n+1)/2. The mean is the average score, the value may not actually occur in the data, can be manipulated algebraically, influenced by extreme values, and usually a better estimate of the population mean. Under the normal distribution, the mean, median, and mode are all equal. Skew will pull the mean towards the longer tail of the distribution.
Multicollinearity
Collinearity is when the predictors within an equation or model are correlated together. This might occur if the predictors measure the same construct, predictors are naturally correlated (e.g., weight & BMI), or sampling error. High multicollinearity can be problematic given its inflation of standard errors in regression coefficients, leading one to assume there is no relationship between the predictor and criterion. Additionally, multicollinearity can lead to faulty conclusions about R2 in which predictive capability is reduced in the variables. Predictor variables that overlap with one another do little to explain the intricacies of the prediction. When tolerance is calculated, it explains the degree of overlap between predictors as well as instability in the model. When variance inflation factor (VIF) is calculated (1 divided by tolerance) it explains how much a variable contributes to standard error. Want high tolerance (at least above .10) and low VIF (<10). To correct multicollinearity: (1) eliminate redundant X variables from the analysis, (2) combine X variables through factor analysis, (3) for some multicollinearity can use centering transformation and transform the variable to a deviation of the mean.
Residuals
Residuals are the difference between the predicted scores from the regression line and represent the error in prediction. Residuals reveal the amount of variability in the DV that is “left over” after accounting for the variability explained by the predictors in the analysis. In regression, one is making a guess/prediction that an IV is associated with the DV; a residual is a numeric value for how much the prediction was wrong. The lower the residual the more accurate the predictions in your regression, which indicates your IVs are related to, or predictive of, the DV. Scatterplots with a line of best fit can help visualize residuals in a data set. An assumption in regression is homoscedasticity, which assumes that the variance of the residuals is consistent across the predictor.
Measures of Dispersion/Variability & Variance
Measures of dispersion/variability help indicate the degree to which individual observations are clustered around or deviate from the average value, as the average value may reflect a tight range or wide range of values. Dispersion can occur around the mean, median, or mode. Measures of dispersion are as follows:
Range – The distance between the highest and lowest scores (e.g., 1-10) but is heavily dependent on extreme scores, making it difficult to ascertain scores in the middle or overall variability in scores.
IQR – This method attempts to mitigate the effect of extreme scores on range by utilizing the 50% range (Q3-Q1) while discarding the upper (Q3) and lower (Q1) 25% of the distribution of scores. This might be an issue if too much data is tossed that might have been meaningful. Winsorizing scores is similar to this process but utilized a 10% replacement with the nearest score.
Standard Deviation – Defined as the positive sqrt of the variance for a sample and represents the average of the deviations of each score from the mean. Regarding dispersion, standard deviation in a normal distribution allows us to examine how distant a score is from the mean, giving us a picture of overall variation in a data set.
Variance – Refers to summing the squared deviations and divide by the number of scores (N-1 for sample, N for population). This additionally allows one to see how scores are dispersed around the mean. The square root of the variance is taken as variance calculated itself has little interpretability.
Normal distribution (contaminated/mixed)
The normal distribution refers to how recorded values are overall distributed and tends to represent natural phenomena (e.g., IQ). Under the normal distribution, the mean, median, and mode of values are equal, the distribution is unimodal, it has no skewness, is symmetric, and is mesokurtic (i.e., bell curve). In a normal distribution, 68% of observations fall within 1+- SD of the mean. Normal distributions allow one to calculate z-scores with a mean of 0 and SD of 1. A contaminated (or mixed) distribution was described by Tukey (1960) as occurring when there are two normal distributions with mixed probabilities, leading to heavy tails in the distribution as a wider distribution is contaminating the primary distribution. Some data points may be outliers or come from a distribution with a different mean or variance.