Final Exam Flashcards
ANOVA (what does it do?)
Analysis of variance: tests differences among the means of multiple groups
Compares variance among subject within groups (the error mean square, MSerror) to the variation among the sampled individuals in different groups (the group mean square, MSgroups)
Test Stat for ANOVA
What is it under Ho? Ha?
Test Stat : the F-ratio. (F=MSgroup/MSerror)
Under Ho: F-ratio should be about 1, except by chance
- MSgroups = MSerror
Under Ha: F-ratio will exceed 1.
- MSgroups > MSerror
Assumptions of ANOVA
- Normal Distribution in each of the k populations
- Random Sampling
- Variance is the same is all k populations
SStotal equation
Total sum of squares
SStotal = SSerror + SSgroups
SSerror equation (check notes sheet)
Error sum of squares
the sum of ((the standard dev of group i ^ squared) x (number of observations in group i minus 1)
the sum of (si^squared)x(ni-1)
Grand mean
(Y-bar) The mean of all the data from all groups combined
Y-bar = Add up all the data points from all groups / number of data points (N)
Finding F-Ratio
- Find the grand mean
- Calculate SSgroups and SSerror
- Calculate SStotal
- Calculate MSgroups and MSerror
- Calculate F-ratio
- Use F-distribution table to find our critical value with our numerator df, denominator df, and alpha level.
- Compare and find out p-value
- Reject / Fail to Reject Ho
MSgroups
Group mean square: observed amount of variation among the subjects from all the group sample means (among)
MSgroups = SSgroups / dfgroups
dfgroups = k - 1
k=number of groups
MSerror
error mean square: variance among subjects that belong to the same group (within)
MSerror = SSerror / dferror
dferror = N - k
N = total number of data points in all groups k = number of groups
Robustness of ANOVA
Robust to deviations from normality assumption.
Robust to deviations from equal variance assumptions
*Kruskal-Wallis test
nonparametric method based on ranks, or analysis of variance based on ranks.
Planned Comparison vs. Unplanned Comparison
Planned: a comparison between means planned during the design of the study, identified before the data are examined
Unplanned: a comparison of multiple comparisons, such as between all pairs of means, carried out to help determine where differences between means lie. (Data dredging)
Tukey-Kramer Test
- Used to test all pairs of means to find out which groups stand apart from the others
- Type of unplanned comparison
Assumptions of Tukey-Kramer Test
- Normal Distribution
- Random Sampling
- Equal variance
*Not as robust as ANOVA
fixed effects vs. random effects
Fixed: Testing an explanatory variable using ANOVA on fixed groups - studies on predetermined groups and of direct interest.
Random: Testing an explanatory variable using ANOVA applied to random groups. groups are randomly sampled from a population of possible groups.
ANOVA on Random Groups
- Planned and Unplanned comparisons are not used
- Instead we use variance components : the amount of the variance in the data that is among random groups (sigma-A^squared)and the amount that is within groups (sigma^squared)
Repeatability
The fraction of the summed variance that is present among groups
Repeatability = s-A^squared / (s-A^squared + MSerror)
k (ANOVA)
number of groups
In ANOVA, if Ho is false
We expect to see MSgroups be greater than MSerror, so the F-ratio is greater than 1.
In ANOVA if Ho is true
Then the F-ratio will be about 1, except by chance.
MSgroups and MSerror should be about even
What does the ANOVA table include?
- Source of variation (groups, error, total)
- Sum of squares (g, e, tot)
- df (g, e, tot)
- mean squares (g, e)
- F ratio
- p value
Yij
The jth individual in the ith group
Group mean
(Y-bar sub i)
SSgroups equation (check notes sheet)
sum of squares groups:
The sum of (number of observations in group i (group i mean - grand mean))^squared
N (ANOVA)
total number of data points in all groups
ni (ANOVA)
number of observations in group i
df (groups)
k-1
k=number of groups
df (error)
N-k
N= total number of data points
k=number of groups
F crit (ANOVA)
F(alpha)(# of tails)(numerator (k-1)), (denom (N-k))
R^squared (ANOVA)
the group portion of variation expressed as a fraction of the total
SSgroups/SStotal = R^squared
When R^squared is close to 0, group means are all very similar, most of the variation is within groups.
When R^squared is close to 1, most of the variation is explained by the explanatory variable.
p-value
the probability of obtaining a test statistic as large as or larger than (as extreme as or more extreme than) the critical value under Ho
What quantity would you use to describe the fraction of the variation in expression levels explained by group differences?
R squared
Regression
the method used to predict values of one numerical variable from values of another.
Linear Regression
Draws a straight line through the data to predict the response variable from the explanatory variable (One type of study design)
Slope of regression line
indicates the rate of change
What does linear regression do?
Measures aspects of the linear relationship between two numerical variables
Difference between regression and correlation
Regression - fits a line through the data to predict one variable from another and to measure how steeply one variable changes with changes in the other.
correlation - measures strength of association between two variables, reflecting the amount of scatter in the data.
Assumptions of Linear Regression
-The relationship between the two variables is linear
“Best Line”
Has the smallest deviations in Y (vertical axis, response var) between the data points and the regression line
Least squares regression line
the line for which the sum of all the squared deviations in Y is the smallest.
Regression Line Equation
Y = a + bX
Y - response variable
a - the Y-intercept
b - slope of the regression line
if b is (+), then larger values of X predict larger values of Y
if b is (-), then larger values of X predict smaller values of Y.
Slope of a linear regression
rate of change in Y per unit of X
a and b
alpha and beta
(Linear regression)
a and b are sample estimates
alpha and beta are population parameters
Predictions
points on the line that correspond to specific values of X.
the predicted value of Y from a regression line estimates the mean value of Y for all individuals having a given value of X.
~ Y-hat
How to find predictions
Plug the X value into the equation to find the Y-hat
Residuals
Measure the scatter of points above and below the least-squares regression line. Crucial for evaluating the fit of the line to the data.
MSresiduals
Quantifies the spread of the scatter of points above and below the line.
Confidence Bands
measure the precision of the predicted mean Y for each value of X
Prediction intervals
measure the precision of the predicted single Y-value for each X.
Extrapolation
Attemping to predict the Y value for X values beyond the range of the data
Normal Quantile Plot
Compares each observation in the sample with its quantile expected from the standard normal distribution. Points should fall roughly along a straight line if the data come from a normal distribution.
Alternatives when Assumptions are Violated
- Ignore the violation of assumptions
- Work well for data comparing means when the normality assumption is violated, especially with a large sample size and violations are not too drastic - Transform the data : effective often
- Use a non-parametric method
- Use a permutation test: Uses a computer to generate a null distribution for a test stat.
Shapiro-Wilk Test
evaluates the goodness of fit of a normal distribution to a set of data randomly sampled from a population
Robust
A statistical procedure is robust if the answer it gives is not sensitive to violations of the assumptions of the method
Transformation
changes each measurement by the same mathematical formula.
Log Transformation
- Used for ratios or products of variables
- used when freq dist is skewed to the right
- used when the group with the larger mean also has the larger standard dev
- used when the data span several orders of mag
Arcsine Transformation
Used almost exclusively on data that are proportions
Square root Transformation
Used on count data.