Topic 5: Linear Model Flashcards
Define bivarinate data and variables involved
Bivarinate data is a pair of variables (xi,yi) with i=1,2,3,…n
x: independet variable
y: dependent variable
What does a scatter plot summarize?
Scatter plot is a numerical summary of the relationship between 2 variables on the same 2D plane, creating a cloud of data points.
Describe linear association
Linear association shows how tightly the points cluster around the line.
Strong/positive association: tightly clustered
Weak/negative association: not tightly clustered
How many numerical summaries does scatter plot takes into account?
5 numerical summaries:
- mean & SD of x
- mean & SD of y
- correlation coefficient (r)
Describe the center and spread of data cloud in scatter plot
- Center: point of average (mean of x, mean of y)
- Horizontal spread: measured by SD of x (most in 2SD)
- Vertical spread: measured by SD of y (most in 2SD)
Describe correlation coefficient and its features
Correlation coefficient is the numerical summary measuring clustering around line, showing sign and strength of association.
- Pure number with no unit
- Lies between -1 and 1
- (+) r: upward slope
- (-) r: downward slope
- r = +/-1: perfect correlation; closer to -1/+1 tightly clustered
- r = 0: points don’t fit around the line
- Symmetry: r is not affected by interchanging the variable
- Scaling: shift & scale invariant
How to calculate the population/sample correlation coefficient ?
Mean of the product of the variables in standard units
- (data point - mean)/popsd or sd (z score of each variable)
- product of the 2 z score of x & y vairbles
- mean of the product = r
What is SD line?
SD line connects point average to point 1SD away from the mean in both x & y direction.
What are some features and limitations of SD line?
Features: it goes through point of average and captures the exact relationship if there is
Limitations: not use r
- cannot distinguish different cloud clustering
- can over or underestimate
What are some warnings regarding correlation coefficient?
- Outliers can overly influence r
- r cannot detect nonlinear association
- Same r value can come from very diff. data set
- Rates of averages can inflate the r
- Association doesn’t mean causation
- Small SDs can make r look bigger
What is regression line and equation?
Regression lines takes into account all 5 numerical summaries.
Connects point of average to (mean of x + SDx, mean of y + rSDy)
y = intercept + slopex
What is graph of averages?
Graph of averages plots average y for each x.
If the points give a straight line, it is the regression line.
What are some ways of prediction for y value when given a x value?
- Baseline prediction: y prediction = average of y values over all x values
- Prediction in a strip: y prediction = average of all y values associated with the given x
- Based on regression line: use the line equation to predict y
- Predicting percentile marks: x given in a certain percentile –> predict y percentile
What steps are taken to predict y value based on a given x percentile?
- calculate z score in x direction: Zx (qnorm)
- Zy = r * Zx
- Zy turn back into percentile in y direction (pnorm)
What are residuals?
Residuals are vertical distances of data points above or below the regression line, representing errors between actual value and prediction.
Describe population RMS error and its equation
Average of residuals, like “SD for the line”
RMS error pop = RMS (gaps of mean)
RMS in baseline prediction: RMS error = SDy
RMS error pop = sqrt(1 - r^2) * SDy
What is RMS error of r=+/-1 and r=0?
r=+/-1: RMS error = 0
r=0: RMS error = SDy
What is residual plot and what do you look for in this plot?
Residual plot: residuals vs x
We look for randomness in residual plot (if random, linear model is appropriate)
What are vertical strips for?
Vertical strips on scatter plot
If within vertical strips, there is equal spread in y direction –> homoscedastic data –> RMS error used as measure for individual strips
If within vertical strips, there is unequal spread in y direction
–> heteroscedastic data –> RMS error CANNOT be used
How can normal approximation be used in vertical strips?
Mean = mean of y + ZxrSDy
SD = RMS error
z score for the threshold
Use pnorm()