Linear regression and correlation Flashcards by Stu Gibson

Fitting a line to scatterplot?

Method of least squares; makes use of differences between line and points being positive and negative so must square them. Similar to standard deviation! Choose line that has smallest value for sum of squares.

How well did you know this?

Not at all

Perfectly

In regression, does σ of FEV1 (Y) depend on height (X axis)?

No: assumed that SD is constant for all heights (i.e. spread of FEV1 about the mean FEV1).

How well did you know this?

Not at all

Perfectly

In regression, is it true that the FEV1 of a given student depends on his height?

No - only that the mean of the population of FEV1 depends linearly on the other variable. Y axis given as mean; X is actual value. Consider the FEV1s for the population of students with a given height and assume that this mean varies linearly with height.

How well did you know this?

Not at all

Perfectly

Three parameters to estimate in regression?

σ (SD about the line), α (intercept) and β (slope). Just as for a single variable can estimate the population mean μ and σ, here μ depends on α + βx, which are population parameters and cannot be known. Instead, use sample, fit line where y = a+bx and use these sample estimates. Also need to estimate σ about the line using s. Intercept (a) is on the same scale as y so uses same units. B is y per x.

How well did you know this?

Not at all

Perfectly

b and β?

Just as use SE to see if m is a good estimate of μ, can see if b is a good estimate of β. Again, use SE, but more complicated.

How well did you know this?

Not at all

Perfectly

Hypothesis testing in regression?

Has a limited role; more about how good estimate of β is. The one hypothesis of interest is whether β = 0. This is because if β=0 then the mean of Y does not change with the x variable (i.e. no association).

How well did you know this?

Not at all

Perfectly

Regression in minitab?

Skip all but “the regression equation is”. Gives “FEV1” when should be mean FEV1. Also look at Predictor, Constant, Ht (x). Constant = intercept. Get SE of Ht listed under SE Coef, and get P value for Ht too. Next bit of output is S, the same as our s, i.e. estimate of SD (spread of FEV1 about the fitted line).

In summary, get estimated slope and intercept (under Coef), the SE of the slope, under SE Coef, the P value for β=0, and the SD about the line (s).

How well did you know this?

Not at all

Perfectly

How can intercept be negative?

This is because intercept is mean FEV1 when height = 0 which would obviously not make sense. For this reason intercept is only important for plotting the slope.

How well did you know this?

Not at all

Perfectly

What does a P value of 0.005 mean for the hypothesis β=o?

Means that there is very strong evidence that FEV1 mean does depend on the height. Does not mean that FEV1 can be determined once height is known, just that mean FEV1 changes with height.

How well did you know this?

Not at all

Perfectly

“Naming” regression?

Regression of y on x (seeing how mean y varies with x).

How well did you know this?

Not at all

Perfectly

What does s account for in regression?

The variation left in FEV1 values AFTER height has been accounted for

How well did you know this?

Not at all

Perfectly

Clinical applications of regression?

Some important variables might be very difficult or invasive to measure so instead predicting it from other, related variables would be helpful. In practice, the natural variability often gives wide limits.

How well did you know this?

Not at all

Perfectly

Single value estimate of FEV1 for height of 180cm?

As FEV1 = α+hβ, but we must use a and b, mean FEV1 for his height is -9.19 + 180 * 0.07 = 4.20L.

How well did you know this?

Not at all

Perfectly

Intervals in regression?

Before, acknowledge uncertainty in m as an estimate of μ by computing interval where expect μ to lie. These intervals widen as sample from which the interval is calculated increases (as estimate of μ becomes more precise). The same in regression: estimate the mean using a+bh means that uncertainty in a and b will give estimate of h (4.20L) uncertainty. This is more complicated, but intervals still get smaller as sample size increases.

How well did you know this?

Not at all

Perfectly

Why are confidence intervals not appropriate for regression?

Because if estimating mean FEV1 for one person of given height, it does not follow that making the sample as large as possible will make this interval indefinitely smaller as variation will still remain. Therefore cannot have a formula that allows sample size to reduce interval to 0.

How well did you know this?

Not at all

Perfectly

What interval is used instead of confidence intervals?

Study These Flashcards

Instead, use prediction intervals. Can also calculate confidence intervals (indicate that the mean FEV1 for students with height 180 cm is 3.8-4.6 with 95% confidence), but prediction interval indicates that 95% of students with height 180cm have FEV1 2.9-5.5. Confidence intervals will be curved because our estimate of mean FEV1 is better near the “centre”, and prediction intervals will be near-straight because they rely mainly on intrinsic variability, which is constant. The small amount of curve comes from incorporating the CIs.

How do confidence intervals and prediction intervals change with sample size?

Study These Flashcards

Confidence intervals will get much smaller; prediction intervals will change slightly.

Pitfalls in using regression?

Study These Flashcards

Often is merely a relationship between two variables using the sample itself; therefore unwise to use the line generated for other kinds of superficially similar data e.g. children, older males or female students. Also be wary out outliers as can alter estimates.

Why cannot reverse regression?

Study These Flashcards

The equation is MEAN FEV1 = a+bh and so cannot simply swap h and FEV1. The regressions of each on the other are different and it is important to choose which is appropriate.

Assumptions in regression?

Study These Flashcards

That the mean of the y variable at a given value of the x variable changes linearly with x, that the spread of the data about this line is constant and thus does not change as x changes, and that the DEVIATIONS from this line follow a normal distribution (important if calculating confidence or prediction intervals or hypothesis test).

Assessing the linearity assumption in regression?

Study These Flashcards

Draw scatterplot; see by eye if is plausible.

Assessing the spread about the line in regression being constant?

Study These Flashcards

Use residuals (vertical distance of a point from the fitted line). Positive if above the line, negative if below. If the fitted data truly reflects the structure of the data then the residuals are a sample from a distribution with population mean = 0 and they all have the same SD. Can see if have the same SD by plotting residuals against the height of the individual; should change little with height.

Assessing the normality of deviations from the line?

Study These Flashcards

As residuals = deviations from the line, this means checking that the residuals are from a common normal distribution. The best way of doing this is using a normal probability plot for the residuals.

Are any assumptions made regarding the x variable in regression?

Study These Flashcards

No! Can even be discrete.

The correlation co-efficient?

We use the product-moment correlation or the Pearson correlation, given the symbol r (rho for population).

Properties of r?

Always between -1 and 1, if the points were exactly on a straight line then r would either be -1 or 1; a value of 0 = no LINEAR relation but could be circular etc.; can be computed for data which comprises pairs of continuous variables.

What does a negative/positive value of r mean?

Means that the y variable tends to decrease where the x variable increases. If positive then tend to increases or decrease together.

Does the sign of r affect strength?

No - just the direction.

Hypothesis testing for r?

Often done (r=0.35 (P=0.07)) but actually if just testing that the population correlation ρ=0 (i.e. that there is no linear relation between the variables) then this is exactly the same test as β=0 in a regression of y on x (P value will be the same). Also need one of the variables to be continuous for this; if both are continuous then can do confidence intervals for ρ but not very useful.

Real problem with hypothesis testing in correlation?

Even if have very weak relationship (r=0.3, for example), then a large enough sample size may lead to a significant P value for ρ=0 testing. This will be evidence against the two variables being unrelated, but does not really tell you that they are closely related. In general, values of r betwen 0 and 0.6 or so are pretty hard to interpret. Regression usually more comprehensive.

Linear regression and correlation Flashcards

(30 cards)