Khan Academy: Exploring bivariate numerical data Flashcards

1
Q

Are correlation coefficient and slope the same?

A

No, The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes. To find the slope of the line, you’ll need to perform a regression analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

By convention, a good scatter plot uses a reasonable ____ on both axes and puts the explanatory variable on the ____.

A

scale, x-axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we calculate the equation of a regression line using x and y ‘s std and mean?

A

Ref

m=r* sd(y)/sd(x)
b= mean y- m* mean x
r=[1/(n-1)] sum(z-score(x)* z-score(y))

pred=mx+b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which one is right?
1- residual= actual-expected
2-residual=expected-actual

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The graph displays a residual plot that was constructed after running a least-squares regression on a set of bivariate numerical data (x,y)(x,y)left parenthesis, x, comma, y, right parenthesis.
Image
Is the slope of the least-squares regression line 0?

A

No, If a point in a residual plot lies on the horizontal axis, then we know that the point will be on the least squares regression line (LSRL). In this respect, the line y=0 does represent the LSRL in a residual plot. However, it is not the actual LSRL, and therefore we cannot conclude that the slope of the LSRL is zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does r^2 tell us?

A

R-squared tells us what percentage of the prediction error in the y variable is eliminated when we use least-squares regression on the x variable instead of no regression method where we use the average of ys ( no use of x) to create a line that estimates a constant value (avg(y)) for all the values of y
Ref

In Other Words: In statistics, the coefficient of determination, denoted R² or r² and pronounced “R squared”, is the proportion of the variation in the dependent variable that is predictable from the independent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

r^2
is also called the____

A

coefficient of determination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Many formal definitions say that r^2 tells us what percent of the variability in the y variable is accounted for by the regression on the x variable. what does it mean?

A

Variance is the average dispersity of data, so if we want to find sum of the variation of variable y, we would write: sum((y-avg(y)^2)= total variability of y

now when we use regression on x to estimate y, the squared error is: sum((y- pred)^2)= error= variation in y NOT described by the regression
SE= squared error
so, SE regression/SE w/o regression describes the percentage of variation not described by the regression.

Therefore, 1- SE regression/SE w/o regression describes what percentage of the variation in y, IS described by regressing on x. this value equals r^2 (correlation coefficient ^2) but the proof is out of the scope of the lesson.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

coefficient of determination=correlation coefficient ^2
True/False?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

standard deviation of the residuals is defined as root mean squared error
True/False?

A

True

The SD of error (residuals) shows the average amount of error. on average, how much the model disagrees with the actual data.
Ref

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Association does not necessarily imply causation. True/False

A

True

Even though there is a moderately strong positive relationship, we cannot say that caffeine is the cause of the longer study times.
Another variable—like a student needing to pass this course to graduate—might be the cause of both increased study time and caffeine consumption.
Reverse causation is also possible. Instead of caffeine causing longer study times, maybe it’s the need to study longer that caused increased caffeine intake.
Ref

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is the model good because the residuals are close to y=0?

A

Without a context, it is impossible to assess whether the scale of a set of residuals is acceptable or not. For example, a few micrograms of error may have unacceptable consequences in a medicine, while several liters of error in filling a swimming pool might be tolerable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we create residual plots?

A

They give us a sense of how good a fit is, if the points are randomly scattered around the axis and have no trend, it’s a good fit, but if it has an upward trend for example, then a line is probably not a good idea
Or if there are many points far away from the x axis in the residual plot, then the line is not a good fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What’s a high-leverage point in a distributions?
|External

A

It’s a point that’s far lower or far higher than the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the “S” variable in a computer regression table mean?

A

It’s the standard deviation of the residuals aka. Root Mean Squared Error
It’s a measure of how much the regression model disagrees with the data
##FootNote

Ref
Ref 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly