Khan Academy: Exploring bivariate numerical data Flashcards
Are correlation coefficient and slope the same?
No, The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes. To find the slope of the line, you’ll need to perform a regression analysis.
By convention, a good scatter plot uses a reasonable ____ on both axes and puts the explanatory variable on the ____.
scale, x-axis
How can we calculate the equation of a regression line using x and y ‘s std and mean?
Which one is right?
1- residual= actual-expected
2-residual=expected-actual
1
The graph displays a residual plot that was constructed after running a least-squares regression on a set of bivariate numerical data (x,y)(x,y)left parenthesis, x, comma, y, right parenthesis.
Image
Is the slope of the least-squares regression line 0?
No, If a point in a residual plot lies on the horizontal axis, then we know that the point will be on the least squares regression line (LSRL). In this respect, the line y=0 does represent the LSRL in a residual plot. However, it is not the actual LSRL, and therefore we cannot conclude that the slope of the LSRL is zero.
What does r^2 tell us?
R-squared tells us what percentage of the prediction error in the y variable is eliminated when we use least-squares regression on the x variable instead of no regression method where we use the average of ys ( no use of x) to create a line that estimates a constant value (avg(y)) for all the values of y
Ref
In Other Words: In statistics, the coefficient of determination, denoted R² or r² and pronounced “R squared”, is the proportion of the variation in the dependent variable that is predictable from the independent variable.
r^2
is also called the____
coefficient of determination.
Many formal definitions say that r^2 tells us what percent of the variability in the y variable is accounted for by the regression on the x variable. what does it mean?
Variance is the average dispersity of data, so if we want to find sum of the variation of variable y, we would write: sum((y-avg(y)^2)= total variability of y
now when we use regression on x to estimate y, the squared error is: sum((y- pred)^2)= error= variation in y NOT described by the regression
SE= squared error
so, SE regression/SE w/o regression describes the percentage of variation not described by the regression.
Therefore, 1- SE regression/SE w/o regression describes what percentage of the variation in y, IS described by regressing on x. this value equals r^2 (correlation coefficient ^2) but the proof is out of the scope of the lesson.
coefficient of determination=correlation coefficient ^2
True/False?
True
standard deviation of the residuals is defined as root mean squared error
True/False?
True
The SD of error (residuals) shows the average amount of error. on average, how much the model disagrees with the actual data.
Ref
Association does not necessarily imply causation. True/False
True
Even though there is a moderately strong positive relationship, we cannot say that caffeine is the cause of the longer study times.
Another variable—like a student needing to pass this course to graduate—might be the cause of both increased study time and caffeine consumption.
Reverse causation is also possible. Instead of caffeine causing longer study times, maybe it’s the need to study longer that caused increased caffeine intake.
Ref
Is the model good because the residuals are close to y=0?
Without a context, it is impossible to assess whether the scale of a set of residuals is acceptable or not. For example, a few micrograms of error may have unacceptable consequences in a medicine, while several liters of error in filling a swimming pool might be tolerable.
Why do we create residual plots?
They give us a sense of how good a fit is, if the points are randomly scattered around the axis and have no trend, it’s a good fit, but if it has an upward trend for example, then a line is probably not a good idea
Or if there are many points far away from the x axis in the residual plot, then the line is not a good fit
What’s a high-leverage point in a distributions?
|External
It’s a point that’s far lower or far higher than the mean