Steury PreReqs Flashcards
Full list
- Hypothesis testing and why we use it - What a normal distribution is and why we use it - What a p value is and how it is used - Null hypothesis how is it used - Regression, what’s it used for, how do we do it - Sum of squared error - Sum of square due to - R squared - Confidence interval - t test - ANOVA, what it is, how to do it
F-statistic
Analysis of variance (ANOVA) can determine whether the means of three or more groups are different. ANOVA uses F-tests to statistically test the equality of means.
Variance is the square of the standard deviation. For us humans, standard deviations are easier to understand than variances because they’re in the same units as the data rather than squared units.
However, many analyses actually use variances in the calculations. F-statistics are based on the ratio of mean squares. The term “mean squares” may sound confusing but it is simply an estimate of population variance that accounts for the degrees of freedom (DF) used to calculate that estimate.
F-Statistic = Numerator: Variation Between Sample Means/ Denominator: Variation within the samples
F-Statistic Continued
The F-statistic is the test statistic for F-tests. In general, an F-statistic is a ratio of two quantities that are expected to be roughly equal under the null hypothesis, which produces an F-statistic of approximately 1.
The F-statistic incorporates both measures of variability discussed above. Let’s take a look at how these measures can work together to produce low and high F-values. Look at the graphs below and compare the width of the spread of the group means to the width of the spread within each group.
The low F-value graph shows a case where the group means are close together (low variability) relative to the variability within each group. The high F-value graph shows a case where the variability of group means is large relative to the within group variability. In order to reject the null hypothesis that the group means are equal, we need a high F-value.
For our plastic strength example, we’ll use the Factor Adj MS for the numerator (14.540) and the Error Adj MS for the denominator (4.402), which gives us an F-value of 3.30.
Is our F-value high enough? A single F-value is hard to interpret on its own. We need to place our F-value into a larger context before we can interpret it. To do that, we’ll use the F-distribution to calculate probabilities.
T test
The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the posttest-only two-group randomized experimental design. The question the t-test addresses is whether the means are statistically different. What does it mean to say that the averages for two groups are statistically different? Consider the three situations shown in Figure 2. The first thing to notice about the three situations is that the difference between the means is the same in all three. But, you should also notice that the three situations don’t look the same – they tell very different stories. The top example shows a case with moderate variability of scores within each group. The second situation shows the high variability case. the third shows the case with low variability. Clearly, we would conclude that the two groups appear most different or distinct in the bottom or low-variability case. Why? Because there is relatively little overlap between the two bell-shaped curves. In the high variability case, the group difference appears least striking because the two bell-shaped distributions overlap so much. This leads us to a very important conclusion: when we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. The t-test does just this. The formula for the t-test is a ratio. The top part of the ratio is just the difference between the two means or averages. The bottom part is a measure of the variability or dispersion of the scores. This formula is essentially another example of the signal-to-noise metaphor in research: the difference between the means is the signal that, in this case, we think our program or treatment introduced into the data; the bottom part of the formula is a measure of variability that is essentially noise that may make it harder to see the group difference.
ANOVA, what it is, how to do it
An ANOVA test is a way to find out if survey or experiment results of three or more groups are significant. In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis. Basically, you’re testing groups to see if there’s a difference between them. Examples of when you might want to test different groups: - A group of psychiatric patients are trying three different therapies: counseling, medication and biofeedback. You want to see if one therapy is better than the others. - A manufacturer has two different processes to make light bulbs. They want to know if one process is better than the other. - Students from different colleges take the same exam. You want to see if one college outperforms the other. One Way/Two Way One-way or two-way refers to the number of independent variables (IVs) in your Analysis of Variance test. One-way has one independent variable (with 2 levels) and two-way has two independent variables (can have multiple levels). For example, a one-way Analysis of Variance could have one IV (brand of cereal) and a two-way Analysis of Variance has two IVs (brand of cereal, calories). Groups/Levels Groups or levels are different groups in the same independent variable. In the above example, your levels for “brand of cereal” might be Lucky Charms, Raisin Bran, Cornflakes — a total of three levels. Your levels for “Calories” might be: sweetened, unsweetened — a total of two levels.
Regression, what’s it used for, how do we do it
Regression analysis is used in stats to find trends in data. Regression analysis will provide you with an equation for a graph so that you can make predictions about your data. It will also give you a slew of statistics (including a p-value and a correlation coefficient) to tell you how accurate your model is.
Linear regression is the most widely used statistical technique; it is a way to model a relationship between two sets of variables. The result is a linear regression equation that can be used to make predictions about data.
Multiple regression analysis is used to see if there is a statistically significant relationship between sets of variables. It’s used to find trends in those sets of data. Multiple regression analysis is almost the same as simple linear regression. The only difference between simple linear regression and multiple regression is in the number of predictors (“x” variables) used in the regression. - Simple regression analysis uses a single x variable for each dependent “y” variable. For example: (x1, Y1). - Multiple regression uses multiple “x” variables for each dependent variable: (x1)1, (x2)1, (x3)1, Y1).
R squared
Regression gives you an R squared value. This number tells you how good your model is, or how much of the variance in y is explainable by x. The values range from 0 to 1, with 0 being a terrible model and 1 being a perfect model.
What a normal distribution is and why we use it
A normal distribution, sometimes called the bell curve, is a distribution that occurs naturally in many situations. The empirical rule tells you what percentage of your data falls within a certain number of standard deviations from the mean: • 68% of the data falls within one standard deviation of the mean. • 95% of the data falls within two standard deviations of the mean. • 99.7% of the data falls within three standard deviations of the mean. Properties - The mean, mode and median are all equal. - The curve is symmetric at the center (i.e. around the mean, μ). - Exactly half of the values are to the left of center and exactly half the values are to the right. - The total area under the curve is 1.
What a p value is and how it is used
The p-value or probability value is, for a given statistical model, the probability that, when the null hypothesis is true, the statistical summary would be greater than or equal to the actual observed results.
Null hypothesis how is it used
The hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.
Confidence interval
A Confidence Interval is a range of values we are fairly sure our true value lies in.
Total sum of squares
In statistical data analysis the total sum of squares (TSS or SST) is a quantity that appears as part of a standard way of presenting results of such analyses. It is defined as being the sum, over all observations, of the squared differences of each observation from the overall mean. There is another notation for the SST. It is TSS or total sum of squares.
Sum of squares regression
It is the sum of the differences between the predicted value and the mean of the dependent variable. Another common notation is ESS or explained sum of squares.
Sum of squares error
The error is the difference between the observed value and the predicted value. it is also known as RSS or residual sum of squares. Residual as in: remaining or unexplained.
Chi square
The Chi Square statistic is commonly used for testing relationships between categorical variables. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent. An example research question that could be answered using a Chi-Square analysis would be: Is there a significant relationship between voter intent and political party membership?