corrolation Flashcards

1
Q

What is correlation in statistics?

A

Correlation quantifies the extent of association between two continuous variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is correlation different from regression?

A

Correlation measures the strength of a relationship between two variables, while regression explains one variable in terms of another with an equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Pearson’s correlation coefficient (𝑟)?

A

A measure of linear correlation between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a correlation coefficient of 0 mean?

A

It means there is no linear relationship between the two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a perfect correlation?

A

A perfect correlation occurs when all data points fall exactly on a straight line, with 𝑟 = ±1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does concurvity refer to?

A

It describes a non-linear association between two continuous variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In R, how can you compute the correlation coefficient between two variables, LLL and TotalHeight?

A

diffx <- hgt$LLL - mean(hgt$LLL)

diffy <- hgt$TotalHeight - mean(hgt$TotalHeight)

r <- sum(diffx * diffy) / sqrt(sum(diffx^2) * sum(diffy^2))
print(r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a simpler way to compute correlation in R?

A

cor(x = hgt$TotalHeight, y = hgt$LLL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is covariance?

A

The numerator in the correlation formula, representing how two variables vary together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is correlation different from covariance?

A

Correlation standardizes covariance to a range of -1 to +1, making it comparable across different units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the null (𝐻0) and alternative (𝐻1) hypotheses for testing correlation?

A

H0:ρ=0 (No association between variables)

𝐻1:𝜌≠0 (There is an association)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What test statistic is used to test correlation?

A

A 𝑡-test with 𝑛−2 degrees of freedom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you compute a two-tailed 𝑝-value for correlation in R?

A

2 * pt(q = t_stat, df = n-2, lower.tail = FALSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What function in R performs a correlation test?

A

cor.test(x = hgt$TotalHeight, y = hgt$LLL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the alternative hypothesis in a correlation test?

A

H: ρ≠0 (The true correlation is not zero, meaning an association exists).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we interpret a very small 𝑝-value in a correlation test?

A

It suggests strong evidence against the null hypothesis, meaning there is likely an association between the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the confidence interval in a correlation test represent?

A

It provides a range within which the true population correlation coefficient (𝜌) is likely to lie.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is correlation not the same as causation?

A

Correlation only shows an association, but a causal link requires further evidence, such as controlled experiments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a “spurious” or “nonsense” correlation?

A

A correlation between two variables that occurs due to chance or a hidden third variable rather than a causal relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are three possible explanations for a correlation?

A

Chance (random coincidence)

A third variable affecting both

Genuine causation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is Anscombe’s quartet?

A

A set of four datasets that have the same correlation coefficient but different distributions, illustrating the limitations of correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the correlation coefficient (𝑟) for each pair in Anscombe’s quartet?

A

r=0.816 for all pairs, despite vastly different data patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the Anscombe’s quartet demonstrate about correlation?

A

That correlation alone does not capture the true nature of relationships between variables; visualization is essential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the datasauRus dataset illustrate?

A

That datasets with different structures can have identical correlation values, emphasizing the need for visualization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the three main reasons to study relationships between variables?

A

Description – To describe patterns in data.

Explanation – To understand causal relationships.

Prediction – To estimate unknown values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is simple linear regression?

A

A statistical method to model the relationship between one explanatory variable and one response variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the general form of a simple linear regression equation?

A

Y=β0+β1X+ε

Y is the response variable,

𝑋 is the explanatory variable,

𝛽0 is the intercept,

𝛽1 is the slope,

ε is the error term.

28
Q

What are the assumptions of simple linear regression?

A

Linearity – The relationship between 𝑋 and 𝑌 is linear.

Independence – Observations are independent.

Homoscedasticity – Constant variance of errors.

Normality – Errors are normally distributed.

29
Q

What is the dependent (response) variable in a regression model?

A

The variable we aim to explain or predict (denoted as 𝑌).

30
Q

What is the independent (predictor) variable in a regression model?

A

The variable used to explain or predict the response variable (denoted as 𝑋)

31
Q

What does the intercept (𝛽0) represent in a regression model?

A

It is the expected value of 𝑌 when 𝑋=0, or where the regression line crosses the y-axis.

32
Q

When might the intercept (𝛽0) not be meaningful?

A

When the explanatory variable (𝑋) cannot realistically take a value of zero (e.g., age in a salary regression).

33
Q

What does the slope (𝛽1) represent in a regression model?

A

It describes the expected change in 𝑌 for a one-unit increase in 𝑋.

34
Q

What does the sign of the slope (𝛽1) indicate?

A

𝛽1>0 → Positive relationship

𝛽1=0 → No relationship

𝛽1<0 → Negative relationship

35
Q

What is the error term (𝜖𝑖) in a regression model?

A

It represents the difference between the observed and predicted values of 𝑌, accounting for variability not explained by 𝑋.

36
Q

What assumption is made about the error term in simple linear regression?

A

Errors (𝜖𝑖) are assumed to be normally distributed with mean zero: ϵi ∼N(0,σ^2)

37
Q

What is the least squares (LS) criterion in regression?

A

It finds the line that minimizes the sum of squared residuals (SSR), ensuring the best fit for the data.

38
Q

Why is the least squares method preferred?

A

It ensures that the fitted line has the smallest possible sum of squared differences between observed and predicted values, leading to optimal parameter estimates.

39
Q

What does the least squares criterion minimize in regression?

A

The sum of squared residuals (SSR), ensuring the best-fitting line.

40
Q

What are residuals in regression?

A

The vertical differences between observed data points and the fitted regression line.

41
Q

What is the formula for estimating the intercept (𝛽0) in simple linear regression?

A

β^0= yˉ−β^1xˉ

42
Q

Why is least squares estimation preferred?

A

It provides optimal estimates and coincides with maximum likelihood estimation under normality assumptions.

43
Q

What R function is used to fit a simple linear regression model?

A

lm(), which stands for linear model.

44
Q

What is the basic syntax for fitting a linear model in R?

A

simple.lm <- lm(y ~ x, data=exData)

45
Q

How can you retrieve the coefficients (𝛽0,𝛽1) from a fitted linear model in R?

A

By printing the model object:simple.lm

46
Q

Given the fitted model:
𝑦^𝑖=12.426+1.902xi

what is the predicted value when 𝑥=50?

A

y^=12.426+(1.902×50)=107.526

47
Q

What function in R gives predicted values for the fitted regression model?

48
Q

Why is it risky to use a regression model to make predictions outside the range of observed 𝑥 values?

A

The relationship may not remain linear beyond the observed data, leading to inaccurate predictions.

49
Q

What does the Residual Standard Error (RSE) in an R regression summary represent?

A

The estimated standard deviation of residuals, which measures the spread of observed values around the fitted regression line.

50
Q

What does the Adjusted R-squared value account for in regression?

A

It adjusts for the number of predictors, providing a more accurate measure of model fit when multiple explanatory variables are present.

51
Q

What does the Multiple R-squared value in an R regression summary indicate?

A

The proportion of variance in the response variable explained by the explanatory variable(s).

52
Q

How is the F-statistic in an R regression summary interpreted?

A

It tests whether at least one predictor variable is significantly associated with the response variable.

53
Q

What is the null hypothesis for testing the significance of a regression coefficient?

A

H0: βi =0, meaning the predictor has no effect on the response variable.

54
Q

What does a very small p-value (e.g., < 0.001) for a regression coefficient indicate?

A

Strong evidence against 𝐻0, suggesting the predictor is statistically significant.

55
Q

In the R summary output, what does the Significance Codes section indicate?

A

It categorizes p-values using ***, **, *, and . to show different levels of statistical significance.

56
Q

What does a Residuals section in an R regression summary show?

A

The distribution of residuals (errors), including minimum, 1st quartile, median, 3rd quartile, and maximum values.

57
Q

Given the regression equation:
WeightA^ =102.18+1.72×SST
what does the slope coefficient 1.72 mean?

A

For each 1-unit increase in SST, the predicted WeightA increases by 1.72 units on average.

58
Q

What does the Residual Standard Error = 2.093 in an R summary output mean?

A

On average, the actual WeightA values deviate by about 2.093 units from the predicted values.

59
Q

What is goodness of fit in regression?

A

It refers to how well the model explains the variability in the response variable.

60
Q

How is R² interpreted in regression?

A

R² ranges from 0 to 1:

R² = 1 → Perfect fit

R² = 0 → Model explains no variability

R² = 0.7954 → Model explains 79.54% of the response variable’s variability.

61
Q

What does the ANOVA table show in regression analysis?

A

It decomposes total variability into:

Model Sum of Squares (SSModel) → Explained variation

Residual Sum of Squares (SSRes) → Unexplained variation

F-statistic → Measures model significance

62
Q

How is the F-statistic in ANOVA related to the t-statistic for a predictor?

A

F=t²

Example: If t = 11.99, then F = (11.99)² = 143.9.

63
Q

What does a high F-statistic and a small p-value indicate?

A

It suggests that at least one predictor variable significantly explains variation in the response.

64
Q

What does Residual Standard Error (RSE) represent in the regression summary?

A

It is the standard deviation of residuals, measuring how much actual values deviate from predictions.

65
Q

What are the key takeaways from regression analysis?

A

Regression models explain relationships between variables.

R² measures how much variation is explained.

ANOVA tests overall model significance.

F-statistic determines if predictors improve the model.

RSE shows the spread of residuals.

66
Q

How is Mean Squared Error (MSE) related to RSE?

A

MSE=RSE
Example: If RSE = 2.093, then MSE = (2.093)² = 4.38.