corrolation Flashcards

Question 1

Q

What is correlation in statistics?

Answer

A

Correlation quantifies the extent of association between two continuous variables.

Question 2

Q

How is correlation different from regression?

Answer

A

Correlation measures the strength of a relationship between two variables, while regression explains one variable in terms of another with an equation.

Question 3

Q

What is Pearson’s correlation coefficient (𝑟)?

Answer

A

A measure of linear correlation between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).

Question 4

Q

What does a correlation coefficient of 0 mean?

Answer

A

It means there is no linear relationship between the two variables.

Question 5

Q

What is a perfect correlation?

Answer

A

A perfect correlation occurs when all data points fall exactly on a straight line, with 𝑟 = ±1.

Question 6

Q

What does concurvity refer to?

Answer

A

It describes a non-linear association between two continuous variables.

Question 7

Q

In R, how can you compute the correlation coefficient between two variables, LLL and TotalHeight?

Answer

A

diffx <- hgt$LLL - mean(hgt$LLL)

diffy <- hgt$TotalHeight - mean(hgt$TotalHeight)

r <- sum(diffx * diffy) / sqrt(sum(diffx^2) * sum(diffy^2))
print(r)

Question 8

Q

What is a simpler way to compute correlation in R?

Answer

A

cor(x = hgt$TotalHeight, y = hgt$LLL)

Question 9

Q

What is covariance?

Answer

A

The numerator in the correlation formula, representing how two variables vary together.

Question 10

Q

How is correlation different from covariance?

Answer

A

Correlation standardizes covariance to a range of -1 to +1, making it comparable across different units.

Question 11

Q

What are the null (𝐻0) and alternative (𝐻1) hypotheses for testing correlation?

Answer

A

H0:ρ=0 (No association between variables)

𝐻1:𝜌≠0 (There is an association)

Question 12

Q

What test statistic is used to test correlation?

Answer

A

A 𝑡-test with 𝑛−2 degrees of freedom.

Question 13

Q

How do you compute a two-tailed 𝑝-value for correlation in R?

Answer

A

2 * pt(q = t_stat, df = n-2, lower.tail = FALSE)

Question 14

Q

What function in R performs a correlation test?

Answer

A

cor.test(x = hgt$TotalHeight, y = hgt$LLL)

Question 15

Q

What is the alternative hypothesis in a correlation test?

Answer

A

H: ρ≠0 (The true correlation is not zero, meaning an association exists).

Question 16

Q

How do we interpret a very small 𝑝-value in a correlation test?

Answer

A

It suggests strong evidence against the null hypothesis, meaning there is likely an association between the variables.

Question 17

Q

What does the confidence interval in a correlation test represent?

Answer

A

It provides a range within which the true population correlation coefficient (𝜌) is likely to lie.

Question 18

Q

Why is correlation not the same as causation?

Answer

A

Correlation only shows an association, but a causal link requires further evidence, such as controlled experiments.

Question 19

Q

What is a “spurious” or “nonsense” correlation?

Answer

A

A correlation between two variables that occurs due to chance or a hidden third variable rather than a causal relationship.

Question 20

Q

What are three possible explanations for a correlation?

Answer

A

Chance (random coincidence)

A third variable affecting both

Genuine causation

Question 21

Q

What is Anscombe’s quartet?

Answer

A

A set of four datasets that have the same correlation coefficient but different distributions, illustrating the limitations of correlation.

Question 22

Q

What is the correlation coefficient (𝑟) for each pair in Anscombe’s quartet?

Answer

A

r=0.816 for all pairs, despite vastly different data patterns.

Question 23

Q

What does the Anscombe’s quartet demonstrate about correlation?

Answer

A

That correlation alone does not capture the true nature of relationships between variables; visualization is essential.

Question 24

Q

What does the datasauRus dataset illustrate?

Answer

A

That datasets with different structures can have identical correlation values, emphasizing the need for visualization.

Question 25

Q

What are the three main reasons to study relationships between variables?

Answer

A

Description – To describe patterns in data.

Explanation – To understand causal relationships.

Prediction – To estimate unknown values.

Question 26

Q

What is simple linear regression?

Answer

A

A statistical method to model the relationship between one explanatory variable and one response variable.

Question 27

Q

What is the general form of a simple linear regression equation?

Answer

A

Y=β0+β1X+ε

Y is the response variable,

𝑋 is the explanatory variable,

𝛽0 is the intercept,

𝛽1 is the slope,

ε is the error term.

Question 28

Q

What are the assumptions of simple linear regression?

Answer

A

Linearity – The relationship between 𝑋 and 𝑌 is linear.

Independence – Observations are independent.

Homoscedasticity – Constant variance of errors.

Normality – Errors are normally distributed.

Question 29

Q

What is the dependent (response) variable in a regression model?

Answer

A

The variable we aim to explain or predict (denoted as 𝑌).

Question 30

Q

What is the independent (predictor) variable in a regression model?

Answer

A

The variable used to explain or predict the response variable (denoted as 𝑋)

Question 31

Q

What does the intercept (𝛽0) represent in a regression model?

Answer

A

It is the expected value of 𝑌 when 𝑋=0, or where the regression line crosses the y-axis.

Question 32

Q

When might the intercept (𝛽0) not be meaningful?

Answer

A

When the explanatory variable (𝑋) cannot realistically take a value of zero (e.g., age in a salary regression).

Question 33

Q

What does the slope (𝛽1) represent in a regression model?

Answer

A

It describes the expected change in 𝑌 for a one-unit increase in 𝑋.

Question 34

Q

What does the sign of the slope (𝛽1) indicate?

Answer

A

𝛽1>0 → Positive relationship

𝛽1=0 → No relationship

𝛽1<0 → Negative relationship

Question 35

Q

What is the error term (𝜖𝑖) in a regression model?

Answer

A

It represents the difference between the observed and predicted values of 𝑌, accounting for variability not explained by 𝑋.

Question 36

Q

What assumption is made about the error term in simple linear regression?

Answer

A

Errors (𝜖𝑖) are assumed to be normally distributed with mean zero: ϵi ∼N(0,σ^2)

Question 37

Q

What is the least squares (LS) criterion in regression?

Answer

A

It finds the line that minimizes the sum of squared residuals (SSR), ensuring the best fit for the data.

Question 38

Q

Why is the least squares method preferred?

Answer

A

It ensures that the fitted line has the smallest possible sum of squared differences between observed and predicted values, leading to optimal parameter estimates.

Question 39

Q

What does the least squares criterion minimize in regression?

Answer

A

The sum of squared residuals (SSR), ensuring the best-fitting line.

Question 40

Q

What are residuals in regression?

Answer

A

The vertical differences between observed data points and the fitted regression line.

Question 41

Q

What is the formula for estimating the intercept (𝛽0) in simple linear regression?

Answer

A

β^0= yˉ−β^1xˉ

Question 42

Q

Why is least squares estimation preferred?

Answer

A

It provides optimal estimates and coincides with maximum likelihood estimation under normality assumptions.

Question 43

Q

What R function is used to fit a simple linear regression model?

Answer

A

lm(), which stands for linear model.

Question 44

Q

What is the basic syntax for fitting a linear model in R?

Answer

A

simple.lm <- lm(y ~ x, data=exData)

Question 45

Q

How can you retrieve the coefficients (𝛽0,𝛽1) from a fitted linear model in R?

Answer

A

By printing the model object:simple.lm

Question 46

Q

Given the fitted model:
𝑦^𝑖=12.426+1.902xi

what is the predicted value when 𝑥=50?

Answer

A

y^=12.426+(1.902×50)=107.526

Question 47

Q

What function in R gives predicted values for the fitted regression model?

Question 48

Q

Why is it risky to use a regression model to make predictions outside the range of observed 𝑥 values?

Answer

A

The relationship may not remain linear beyond the observed data, leading to inaccurate predictions.

Question 49

Q

What does the Residual Standard Error (RSE) in an R regression summary represent?

Answer

A

The estimated standard deviation of residuals, which measures the spread of observed values around the fitted regression line.

Question 50

Q

What does the Adjusted R-squared value account for in regression?

Answer

A

It adjusts for the number of predictors, providing a more accurate measure of model fit when multiple explanatory variables are present.

Question 51

Q

What does the Multiple R-squared value in an R regression summary indicate?

Answer

A

The proportion of variance in the response variable explained by the explanatory variable(s).

Question 52

Q

How is the F-statistic in an R regression summary interpreted?

Answer

A

It tests whether at least one predictor variable is significantly associated with the response variable.

Question 53

Q

What is the null hypothesis for testing the significance of a regression coefficient?

Answer

A

H0: βi =0, meaning the predictor has no effect on the response variable.

Question 54

Q

What does a very small p-value (e.g., < 0.001) for a regression coefficient indicate?

Answer

A

Strong evidence against 𝐻0, suggesting the predictor is statistically significant.

Question 55

Q

In the R summary output, what does the Significance Codes section indicate?

Answer

A

It categorizes p-values using ***, **, *, and . to show different levels of statistical significance.

Question 56

Q

What does a Residuals section in an R regression summary show?

Answer

A

The distribution of residuals (errors), including minimum, 1st quartile, median, 3rd quartile, and maximum values.

Question 57

Q

Given the regression equation:
WeightA^ =102.18+1.72×SST
what does the slope coefficient 1.72 mean?

Answer

A

For each 1-unit increase in SST, the predicted WeightA increases by 1.72 units on average.

Question 58

Q

What does the Residual Standard Error = 2.093 in an R summary output mean?

Answer

A

On average, the actual WeightA values deviate by about 2.093 units from the predicted values.

Question 59

Q

What is goodness of fit in regression?

Answer

A

It refers to how well the model explains the variability in the response variable.

Question 60

Q

How is R² interpreted in regression?

Answer

A

R² ranges from 0 to 1:

R² = 1 → Perfect fit

R² = 0 → Model explains no variability

R² = 0.7954 → Model explains 79.54% of the response variable’s variability.

Question 61

Q

What does the ANOVA table show in regression analysis?

Answer

A

It decomposes total variability into:

Model Sum of Squares (SSModel) → Explained variation

Residual Sum of Squares (SSRes) → Unexplained variation

F-statistic → Measures model significance

Question 62

Q

How is the F-statistic in ANOVA related to the t-statistic for a predictor?

Answer

A

F=t²

Example: If t = 11.99, then F = (11.99)² = 143.9.

Question 63

Q

What does a high F-statistic and a small p-value indicate?

Answer

A

It suggests that at least one predictor variable significantly explains variation in the response.

Question 64

Q

What does Residual Standard Error (RSE) represent in the regression summary?

Answer

A

It is the standard deviation of residuals, measuring how much actual values deviate from predictions.

Answer 64

A

Regression models explain relationships between variables.

R² measures how much variation is explained.

ANOVA tests overall model significance.

F-statistic determines if predictors improve the model.

RSE shows the spread of residuals.

Answer 65

A

MSE=RSE
Example: If RSE = 2.093, then MSE = (2.093)² = 4.38.