Regression + Correlation Flashcards
Regression
Describes a relationship which isn’t deterministic between two variables one which is continuous. Allows easy visual analysis of linear /non-linear relationships - dependent and independent variables
Regression line
y = a + bx ( y changes with x)
Straight line of best fit where a = y-intercept and b = slope of the line. This dependence of the mean of the y variable on the x variable is known as the
regression of y on x.
Easiest way to assess trend between variables
Scatter plot
Sum of all squares
Estimate a line - a line is then drawn up/down from the line to each induvidual point. This difference is squared to remove the -ve and added. The smallest number = line of best fit
Correlation
test of the relationship between to variables
r = 0 is a linear straight line
Correlation coefficient
r. Can vary from 1 - -1 these begins the two extremes of correlation
1 = increase in one variable leads to a linear increase in the other variable
-1 = increased in one variable leads to an linear decrease in the other variable
Assumptions for regression analysis
The sample is representative of the population for the inference prediction.
The error is a random variable with a mean of zero conditional on the explanatory variables.
The independent variables are measured with no error. (Note: If this is not so, modeling may be done instead using errors-in-variables model techniques).
The independent variables (predictors) are linearly independent, i.e. it is not possible to express any predictor as a linear combination of the others.
The errors are uncorrelated
We also assume that the spread of FEV1 about this mean is measured by a standard deviation, σ, about the line and that this does not change with height.
The variance of the error is constant across observations
Regression of y on x
This dependence of the mean of the y variable on the x variable μ )( =α + βxx .
What does B measure
Measures the rate at which the mean of the y variable changes as the x variable changes.
B = 0
Mean of the y variable does not change with the x variable. Hence no association between the y and x variables.
Outcomes to extract from regression analysis
- the estimated slope and intercept, given under Coef;
- the standard error of the slope, given under SE Coef;
- the P-value for the test of the hypothesis β=0;
- the standard deviation about the line, given as S.
a = y intercept
Mean value when x = 0, hence may be negative, needed for correct orientation and degree of slope
Making predictions from regression
Natural variability needs to be taken into account hence wide wide limits on the prediction made for an individual are often in place
estimate for height h is α + β h,
Intervals for prediction
Confidence interval - calculate the uncertainty within the sample means to consturct an interval where we are 95% sure u (population mean) will lie
NB as sample increased interval will reduce
Prediction interval
Can expect to see the next data point sampled. Collect a sample of data and calculate a prediction interval. Then sample one more value from the population. If you do this many times, you’d expect that next value to lie within that prediction interval in 95% of the samples.
Crucially the prediction interval tells you about the distribution of values, not the uncertainty in determining the population mean. Account for uncertainty within the mean and data scatter hence are wider than confidence intervals