corrolation Flashcards
What is correlation in statistics?
Correlation quantifies the extent of association between two continuous variables.
How is correlation different from regression?
Correlation measures the strength of a relationship between two variables, while regression explains one variable in terms of another with an equation.
What is Pearson’s correlation coefficient (𝑟)?
A measure of linear correlation between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).
What does a correlation coefficient of 0 mean?
It means there is no linear relationship between the two variables.
What is a perfect correlation?
A perfect correlation occurs when all data points fall exactly on a straight line, with 𝑟 = ±1.
What does concurvity refer to?
It describes a non-linear association between two continuous variables.
In R, how can you compute the correlation coefficient between two variables, LLL and TotalHeight?
diffx <- hgt$LLL - mean(hgt$LLL)
diffy <- hgt$TotalHeight - mean(hgt$TotalHeight)
r <- sum(diffx * diffy) / sqrt(sum(diffx^2) * sum(diffy^2))
print(r)
What is a simpler way to compute correlation in R?
cor(x = hgt$TotalHeight, y = hgt$LLL)
What is covariance?
The numerator in the correlation formula, representing how two variables vary together.
How is correlation different from covariance?
Correlation standardizes covariance to a range of -1 to +1, making it comparable across different units.
What are the null (𝐻0) and alternative (𝐻1) hypotheses for testing correlation?
H0:ρ=0 (No association between variables)
𝐻1:𝜌≠0 (There is an association)
What test statistic is used to test correlation?
A 𝑡-test with 𝑛−2 degrees of freedom.
How do you compute a two-tailed 𝑝-value for correlation in R?
2 * pt(q = t_stat, df = n-2, lower.tail = FALSE)
What function in R performs a correlation test?
cor.test(x = hgt$TotalHeight, y = hgt$LLL)
What is the alternative hypothesis in a correlation test?
H: ρ≠0 (The true correlation is not zero, meaning an association exists).
How do we interpret a very small 𝑝-value in a correlation test?
It suggests strong evidence against the null hypothesis, meaning there is likely an association between the variables.
What does the confidence interval in a correlation test represent?
It provides a range within which the true population correlation coefficient (𝜌) is likely to lie.
Why is correlation not the same as causation?
Correlation only shows an association, but a causal link requires further evidence, such as controlled experiments.
What is a “spurious” or “nonsense” correlation?
A correlation between two variables that occurs due to chance or a hidden third variable rather than a causal relationship.
What are three possible explanations for a correlation?
Chance (random coincidence)
A third variable affecting both
Genuine causation
What is Anscombe’s quartet?
A set of four datasets that have the same correlation coefficient but different distributions, illustrating the limitations of correlation.
What is the correlation coefficient (𝑟) for each pair in Anscombe’s quartet?
r=0.816 for all pairs, despite vastly different data patterns.
What does the Anscombe’s quartet demonstrate about correlation?
That correlation alone does not capture the true nature of relationships between variables; visualization is essential.
What does the datasauRus dataset illustrate?
That datasets with different structures can have identical correlation values, emphasizing the need for visualization.
What are the three main reasons to study relationships between variables?
Description – To describe patterns in data.
Explanation – To understand causal relationships.
Prediction – To estimate unknown values.
What is simple linear regression?
A statistical method to model the relationship between one explanatory variable and one response variable.
What is the general form of a simple linear regression equation?
Y=β0+β1X+ε
Y is the response variable,
𝑋 is the explanatory variable,
𝛽0 is the intercept,
𝛽1 is the slope,
ε is the error term.
What are the assumptions of simple linear regression?
Linearity – The relationship between 𝑋 and 𝑌 is linear.
Independence – Observations are independent.
Homoscedasticity – Constant variance of errors.
Normality – Errors are normally distributed.
What is the dependent (response) variable in a regression model?
The variable we aim to explain or predict (denoted as 𝑌).
What is the independent (predictor) variable in a regression model?
The variable used to explain or predict the response variable (denoted as 𝑋)
What does the intercept (𝛽0) represent in a regression model?
It is the expected value of 𝑌 when 𝑋=0, or where the regression line crosses the y-axis.
When might the intercept (𝛽0) not be meaningful?
When the explanatory variable (𝑋) cannot realistically take a value of zero (e.g., age in a salary regression).
What does the slope (𝛽1) represent in a regression model?
It describes the expected change in 𝑌 for a one-unit increase in 𝑋.
What does the sign of the slope (𝛽1) indicate?
𝛽1>0 → Positive relationship
𝛽1=0 → No relationship
𝛽1<0 → Negative relationship
What is the error term (𝜖𝑖) in a regression model?
It represents the difference between the observed and predicted values of 𝑌, accounting for variability not explained by 𝑋.
What assumption is made about the error term in simple linear regression?
Errors (𝜖𝑖) are assumed to be normally distributed with mean zero: ϵi ∼N(0,σ^2)
What is the least squares (LS) criterion in regression?
It finds the line that minimizes the sum of squared residuals (SSR), ensuring the best fit for the data.
Why is the least squares method preferred?
It ensures that the fitted line has the smallest possible sum of squared differences between observed and predicted values, leading to optimal parameter estimates.
What does the least squares criterion minimize in regression?
The sum of squared residuals (SSR), ensuring the best-fitting line.
What are residuals in regression?
The vertical differences between observed data points and the fitted regression line.
What is the formula for estimating the intercept (𝛽0) in simple linear regression?
β^0= yˉ−β^1xˉ
Why is least squares estimation preferred?
It provides optimal estimates and coincides with maximum likelihood estimation under normality assumptions.
What R function is used to fit a simple linear regression model?
lm(), which stands for linear model.
What is the basic syntax for fitting a linear model in R?
simple.lm <- lm(y ~ x, data=exData)
How can you retrieve the coefficients (𝛽0,𝛽1) from a fitted linear model in R?
By printing the model object:simple.lm
Given the fitted model:
𝑦^𝑖=12.426+1.902xi
what is the predicted value when 𝑥=50?
y^=12.426+(1.902×50)=107.526
What function in R gives predicted values for the fitted regression model?
fitted()
Why is it risky to use a regression model to make predictions outside the range of observed 𝑥 values?
The relationship may not remain linear beyond the observed data, leading to inaccurate predictions.
What does the Residual Standard Error (RSE) in an R regression summary represent?
The estimated standard deviation of residuals, which measures the spread of observed values around the fitted regression line.
What does the Adjusted R-squared value account for in regression?
It adjusts for the number of predictors, providing a more accurate measure of model fit when multiple explanatory variables are present.
What does the Multiple R-squared value in an R regression summary indicate?
The proportion of variance in the response variable explained by the explanatory variable(s).
How is the F-statistic in an R regression summary interpreted?
It tests whether at least one predictor variable is significantly associated with the response variable.
What is the null hypothesis for testing the significance of a regression coefficient?
H0: βi =0, meaning the predictor has no effect on the response variable.
What does a very small p-value (e.g., < 0.001) for a regression coefficient indicate?
Strong evidence against 𝐻0, suggesting the predictor is statistically significant.
In the R summary output, what does the Significance Codes section indicate?
It categorizes p-values using ***, **, *, and . to show different levels of statistical significance.
What does a Residuals section in an R regression summary show?
The distribution of residuals (errors), including minimum, 1st quartile, median, 3rd quartile, and maximum values.
Given the regression equation:
WeightA^ =102.18+1.72×SST
what does the slope coefficient 1.72 mean?
For each 1-unit increase in SST, the predicted WeightA increases by 1.72 units on average.
What does the Residual Standard Error = 2.093 in an R summary output mean?
On average, the actual WeightA values deviate by about 2.093 units from the predicted values.
What is goodness of fit in regression?
It refers to how well the model explains the variability in the response variable.
How is R² interpreted in regression?
R² ranges from 0 to 1:
R² = 1 → Perfect fit
R² = 0 → Model explains no variability
R² = 0.7954 → Model explains 79.54% of the response variable’s variability.
What does the ANOVA table show in regression analysis?
It decomposes total variability into:
Model Sum of Squares (SSModel) → Explained variation
Residual Sum of Squares (SSRes) → Unexplained variation
F-statistic → Measures model significance
How is the F-statistic in ANOVA related to the t-statistic for a predictor?
F=t²
Example: If t = 11.99, then F = (11.99)² = 143.9.
What does a high F-statistic and a small p-value indicate?
It suggests that at least one predictor variable significantly explains variation in the response.
What does Residual Standard Error (RSE) represent in the regression summary?
It is the standard deviation of residuals, measuring how much actual values deviate from predictions.
What are the key takeaways from regression analysis?
Regression models explain relationships between variables.
R² measures how much variation is explained.
ANOVA tests overall model significance.
F-statistic determines if predictors improve the model.
RSE shows the spread of residuals.
How is Mean Squared Error (MSE) related to RSE?
MSE=RSE
Example: If RSE = 2.093, then MSE = (2.093)² = 4.38.