Topic 5: Linear Model Flashcards

Question 1

Q

Define bivarinate data and variables involved

Answer

A

Bivarinate data is a pair of variables (xi,yi) with i=1,2,3,…n

x: independet variable
y: dependent variable

Question 2

Q

What does a scatter plot summarize?

Answer

A

Scatter plot is a numerical summary of the relationship between 2 variables on the same 2D plane, creating a cloud of data points.

Question 3

Q

Describe linear association

Answer

A

Linear association shows how tightly the points cluster around the line.

Strong/positive association: tightly clustered
Weak/negative association: not tightly clustered

Question 4

Q

How many numerical summaries does scatter plot takes into account?

Answer

A

5 numerical summaries:
- mean & SD of x
- mean & SD of y
- correlation coefficient (r)

Question 5

Q

Describe the center and spread of data cloud in scatter plot

Answer

A

Center: point of average (mean of x, mean of y)
Horizontal spread: measured by SD of x (most in 2SD)
Vertical spread: measured by SD of y (most in 2SD)

Question 6

Q

Describe correlation coefficient and its features

Answer

A

Correlation coefficient is the numerical summary measuring clustering around line, showing sign and strength of association.

Pure number with no unit
Lies between -1 and 1
(+) r: upward slope
(-) r: downward slope
r = +/-1: perfect correlation; closer to -1/+1 tightly clustered
r = 0: points don’t fit around the line
Symmetry: r is not affected by interchanging the variable
Scaling: shift & scale invariant

Question 7

Q

How to calculate the population/sample correlation coefficient ?

Answer

A

Mean of the product of the variables in standard units

(data point - mean)/popsd or sd (z score of each variable)
product of the 2 z score of x & y vairbles
mean of the product = r

Question 8

Q

What is SD line?

Answer

A

SD line connects point average to point 1SD away from the mean in both x & y direction.

Question 9

Q

What are some features and limitations of SD line?

Answer

A

Features: it goes through point of average and captures the exact relationship if there is

Limitations: not use r
- cannot distinguish different cloud clustering
- can over or underestimate

Question 10

Q

What are some warnings regarding correlation coefficient?

Answer

A

Outliers can overly influence r
r cannot detect nonlinear association
Same r value can come from very diff. data set
Rates of averages can inflate the r
Association doesn’t mean causation
Small SDs can make r look bigger

Question 11

Q

What is regression line and equation?

Answer

A

Regression lines takes into account all 5 numerical summaries.
Connects point of average to (mean of x + SDx, mean of y + rSDy)
y = intercept + slopex

Question 12

Q

What is graph of averages?

Answer

A

Graph of averages plots average y for each x.
If the points give a straight line, it is the regression line.

Question 13

Q

What are some ways of prediction for y value when given a x value?

Answer

A

Baseline prediction: y prediction = average of y values over all x values
Prediction in a strip: y prediction = average of all y values associated with the given x
Based on regression line: use the line equation to predict y
Predicting percentile marks: x given in a certain percentile –> predict y percentile

Question 14

Q

What steps are taken to predict y value based on a given x percentile?

Answer

A

calculate z score in x direction: Zx (qnorm)
Zy = r * Zx
Zy turn back into percentile in y direction (pnorm)

Question 15

Q

What are residuals?

Answer

A

Residuals are vertical distances of data points above or below the regression line, representing errors between actual value and prediction.

Question 16

Q

Describe population RMS error and its equation

Answer

Study These Flashcards

A

Average of residuals, like “SD for the line”

RMS error pop = RMS (gaps of mean)
RMS in baseline prediction: RMS error = SDy
RMS error pop = sqrt(1 - r^2) * SDy

Question 17

Q

What is RMS error of r=+/-1 and r=0?

Answer

Study These Flashcards

A

r=+/-1: RMS error = 0
r=0: RMS error = SDy

Question 18

Q

What is residual plot and what do you look for in this plot?

Answer

Study These Flashcards

A

Residual plot: residuals vs x
We look for randomness in residual plot (if random, linear model is appropriate)

Question 19

Q

What are vertical strips for?

Answer

Study These Flashcards

A

Vertical strips on scatter plot

If within vertical strips, there is equal spread in y direction –> homoscedastic data –> RMS error used as measure for individual strips

If within vertical strips, there is unequal spread in y direction
–> heteroscedastic data –> RMS error CANNOT be used

Question 20

Q

How can normal approximation be used in vertical strips?

Answer

Study These Flashcards

A

Mean = mean of y + ZxrSDy
SD = RMS error
z score for the threshold
Use pnorm()

Topic 5: Linear Model Flashcards

(20 cards)