Topic 5: Linear Model Flashcards
Define bivarinate data and variables involved
Bivarinate data is a pair of variables (xi,yi) with i=1,2,3,…n
x: independet variable
y: dependent variable
What does a scatter plot summarize?
Scatter plot is a numerical summary of the relationship between 2 variables on the same 2D plane, creating a cloud of data points.
Describe linear association
Linear association shows how tightly the points cluster around the line.
Strong/positive association: tightly clustered
Weak/negative association: not tightly clustered
How many numerical summaries does scatter plot takes into account?
5 numerical summaries:
- mean & SD of x
- mean & SD of y
- correlation coefficient (r)
Describe the center and spread of data cloud in scatter plot
- Center: point of average (mean of x, mean of y)
- Horizontal spread: measured by SD of x (most in 2SD)
- Vertical spread: measured by SD of y (most in 2SD)
Describe correlation coefficient and its features
Correlation coefficient is the numerical summary measuring clustering around line, showing sign and strength of association.
- Pure number with no unit
- Lies between -1 and 1
- (+) r: upward slope
- (-) r: downward slope
- r = +/-1: perfect correlation; closer to -1/+1 tightly clustered
- r = 0: points don’t fit around the line
- Symmetry: r is not affected by interchanging the variable
- Scaling: shift & scale invariant
How to calculate the population/sample correlation coefficient ?
Mean of the product of the variables in standard units
- (data point - mean)/popsd or sd (z score of each variable)
- product of the 2 z score of x & y vairbles
- mean of the product = r
What is SD line?
SD line connects point average to point 1SD away from the mean in both x & y direction.
What are some features and limitations of SD line?
Features: it goes through point of average and captures the exact relationship if there is
Limitations: not use r
- cannot distinguish different cloud clustering
- can over or underestimate
What are some warnings regarding correlation coefficient?
- Outliers can overly influence r
- r cannot detect nonlinear association
- Same r value can come from very diff. data set
- Rates of averages can inflate the r
- Association doesn’t mean causation
- Small SDs can make r look bigger
What is regression line and equation?
Regression lines takes into account all 5 numerical summaries.
Connects point of average to (mean of x + SDx, mean of y + rSDy)
y = intercept + slopex
What is graph of averages?
Graph of averages plots average y for each x.
If the points give a straight line, it is the regression line.
What are some ways of prediction for y value when given a x value?
- Baseline prediction: y prediction = average of y values over all x values
- Prediction in a strip: y prediction = average of all y values associated with the given x
- Based on regression line: use the line equation to predict y
- Predicting percentile marks: x given in a certain percentile –> predict y percentile
What steps are taken to predict y value based on a given x percentile?
- calculate z score in x direction: Zx (qnorm)
- Zy = r * Zx
- Zy turn back into percentile in y direction (pnorm)
What are residuals?
Residuals are vertical distances of data points above or below the regression line, representing errors between actual value and prediction.