Lecture 9 - Correlation & Regression Flashcards
In a binary relationship, what are x and y ?
x = explanatory variable y = response variable
Give an example of a response variable that you cannot put on y axis bc it is dichotomous (yes/no) ?
x = smoking y = lung cancer
*can’t use lung cancer bc it’s either yes or no
A scatter plot can only be used when both variables are _______
numerical
How can you describe the overall pattern of a scatterplot ?
by the form, direction, and strength of the relationship between the data
Describe the form and direction of the relationship
- Linear relationships, where the points roughly follow a straight line, are especially important.
- Relationships can be negative (A) or positive (B) in direction.
- Curvilinear relationships (C) and clusters (D) are other things to watch for.
- An important kind of deviation is an OUTLIER (E) where an individual value that falls outside the overall pattern
The strength of a relationship is determined by ??
how close the points in the scatter plot lie to a simple form such as a line
We use the _____ to evaluate the variability for univariate data; where n is the sample size
variance
For bivariate data, we use the _______ (the variance in x & y ) where n is the number of x,y pairs.
covariance
The covariance is limited as ?
a tool for measuring and describing relationships because it’s composite units are difficult to translate
Standardized values of variance have ____ units.
no (they are the same whether x is measured in cm or mg)
What do standardized values express?
Their deviations from the mean in terms of their s, and avoid the problem of trying to interpret unit-dependent covariance units.
What is the correlation coefficient ?
The strength of a relationship is quantified by the standardized covariance of the two continuous measures, which is termed the correlation coefficient p (rho) or more formally as the Pearson correlation coefficient.
The sample correlation coefficient is r
What does the correlation coefficient measure?
It measures both the strength and direction of a linear relationship between 2 continuous variables.
What is the square of the correlation coefficient?
r^2 is referred to as the coefficient of determination
When are two variables x and y positively associated ?
when above-average values of one variable tend to go with above-average values of the other; in this case r will be positive
When are two variables x and y negatively associated ?
when above-average values of one variable tend to go with below-average values of the other; in this case r will be negative
For the correlation coefficient, what is the null and alternative hypothesis ?
Null hypothesis: p = 0 (There is no linear relationship
Alternative hypothesis: p does not = 0 (Height and weight are linearly related)
**We can evaluate the correlation, using alpha = 0.05 with degrees of freedom n-2
Correlation makes no distinction between ______ and ______ variables.
explanatory and response
Correlation requires both variables to be _____
numerical
Correlation does not depend on the scale of ______ used
measurement
r > 0 indicates ?
a positive relationship between the variables
r < 0 indicates ?
a negative relationship between the variables
Correlation is always between ?
-1 and 1
Correlation measures the strength of linear relations and doesn’t apply to ____ relations
curved
Like the mean and SD, the correlation is strongly affected by ______
outliers
correlation coefficient can become a ____-tailed test
one
Correlation coefficient test”
Degrees of freedom ?
n-2
Compare regression to correlation
With correlation, we capture the degree to which to variable co-vary, there is no question about casual relationships
With regression, we introduce the concept of dependent and independent variables
What is the task in regression?
to find some meaningful way of describing the essence of a dependent relationship
Regression extends the concept of ______ by summarizing the relationship between 2 variables with a straight line that best describes their relationship. This line is termed the regression line.
correlation
dependent variable is always presented as the __-variable and is always plotted on the _____ axis of the regression plot
y, vertical
The independent variable __ is plotted on the _______ axis
x, horizontal
To describe the regression line we need to know two values, what are they?
1) The slope “b”.
- This value has clear practical interpretation, such as the average increased response per increase in dose.
2) The y-intercept “a”.
- This may have a practical interpretation (such as HR at rest); in some cases it may not ex. the weight of all individual with zero height ?????
what is Y-claret ?
The predicted value.
The fit of the predicted value can be summarized by ?
the difference between the value y observed, and the value of y (y-claret) predicted by the regression line.
The difference is called the residual value!!
y - y(claret) = observed - predicted
What is the formula for SSE?
sum of square of the error
SSE = sum of [(y-y^) squared]
When you find the y intercept, what are you testing for? (slide 50)
solving for b
testing to see if it’s a negative or positive relationship
Having obtained the regression line, we need to establish that it’s predictive value has not arisen by _____ alone.
chance
If b = 0, what does that mean?
there is no correlation between x and y (LIKE AT ALL)
For regression, what is the null and and alternative hypothesis ?
*this test is similar to ANOVA
Null: B = 0
Alternative: B does not = 0
*We still use F test, just like ANOVA
We find SST which is ?
SSR + SSE
df = ?
(SSR) Explained: df = 1
(SSE) Error: df = n-2
(SST) Total: df = n-1
How do we calculate r^2?
r^2 = SSR (explained) / SST (total)