correlation and regression Flashcards

1
Q

Pearson Bivariate Correlation Coefficient

Define and Calculate

A

it’s a number that shows how two things are related in a straight line (strength and direction linear relationship between two continuous variables)
calculate: It is calculated by dividing the covariance of the two variables by the product of their standard deviations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Pearson Bivariate Correlation Coefficient

range and interpretation

A

The coefficient ranges from -1 to +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Plotting Pearson

A

Linearity Assessment Plot: A plot used to assess the linearity of the relationship between two variables.
Helps determine if the relationship between variables is linear or nonlinear with a trend line.
Scatter plot is commonly used for this purpose.

points cluster around the trend line = stronger correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Z-scores

defintion

A

Z scores are a way to standardize data points by showing how many standard deviations they are from the mean.

a z score of +2 indicates that a data point is two standard deviations above the mean, while a z score of -1 indicates that a data point is one standard deviation below the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Standardizing with Z Scores

meaning, proedure and purpose

A

Standardizing allows fair comparison of data on a common scale.
Meaning: It transforms data to have a mean of 0 and a standard deviation of 1.Standardizing converts data into z scores for fair comparison.
Procedure: Subtract the mean and divide by the standard deviation.
Purpose: Makes different data comparable by putting them on the same scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why Standardize and Meaning

A

Standardizing allows fair comparison of data on a common scale.
It transforms data to have a mean of 0 and a standard deviation of 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Calculate Z-scores

A

Calculation:
(X- μ) / σ

X is the raw score, μ is the mean of the distribution, and is the standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Standard Deviation

Define

A

A measure of the amount of variation or dispersion in a set of values

It’s a measure of how spread out numbers are in a set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Interpretation of Standard Deviation

A

A low standard deviation indicates that the data points tend to be close to the mean, while a high standard deviation indicates that the data points are spread out over a wider range of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is it good for data to cluster around the mean?

Relaibility, comparability and predictability

A

Reliability: There are fewer extreme values or outliers that could skew the interpretation of the data.
Predictability: it makes it easier to predict future outcomes or estimate probabilities. This is because there is less uncertainty or variability in the data.
Comparability: When data points are spread out, it can be challenging to compare different groups or datasets, but when they are close to the mean, comparisons become more straightforward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Calculate Stanadard Deviaition

A

the square root of the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Varinace

Define

A

A measure of how spread out or dispersed the values in a data set are from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Calculate the Variance

A

taking the average of the squared differences between each data point and the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

interpretation of the variance

A

A larger variance indicates greater variability or dispersion in the data set, while a smaller variance suggests that the data points are closer to the mean.

e.g. If the variance of a set of test scores is 25, it means that, on average, each score differs from the mean by 25 squared units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Using Z-Scores to Identify and Deal with Outliers

A

Standardization: Transforming data into z-scores with a mean of 0 and standard deviation of 1.
Thresholds: Outliers defined as z-scores beyond a certain threshold (e.g., z > 2 or z < -2).
Comparability: Allows fair comparison of outliers across datasets.
Data Cleaning: Outliers identified using z-scores can be examined for errors or significance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Correlation Coefficient (r)

A

Measures the strength and direction of the linear relationship between two variables.
Range: Between -1 and +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Interpreation of correlation coefficent

A

Magnitude: It shows how strong the relationship is. If |r| is closer to 1, the relationship is stronger. If it’s closer to 0, the relationship is weaker.

e.g. Correlation Coefficient (-0.42): Indicates a moderate negative linear relationship between the variables being studied.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Significance Level (p-value)

A

It indicates the probability that the observed result (or more extreme) occurred by random chance, assuming the null hypothesis is true.

observed effect is likely to be genuine or random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Interpreation of p-value

A

lower p-value suggests stronger evidence against the null hypothesis, indicating that the observed result is unlikely to be due to chance

Typically, a significance level of 0.05 (or 5%) is used. If the p-value is less than this threshold, the result is considered statistically significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Cohen’s Rules of Thumb for Magnitude of Correlation

A

Definition: Guidelines for interpreting the strength of correlation coefficients.
Small: Magnitude around 0.10.
Moderate: Magnitude around 0.30.
Large: Magnitude around 0.50.

-0.42 = moderate negative correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Correlation Coefficient (0.15)

A

Since 0.15 is closer to 0.10, it would be considered a small correlation according to Cohen’s rules of thumb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Correlation Coefficient (-0.60)

A

Since -0.60 is greater than 0.50, it would be considered a large negative correlation

23
Q

Correlation Coefficient (0.35)

A

Since 0.35 falls between 0.30 and 0.50, it would likely be considered a moderate positive correlation

24
Q

Correlation Coefficient (-0.25)

A

Since -0.25 falls between -0.10 and -0.30, it would likely be considered a small negative correlation.

25
Q

Simple Regression Model

A

A statistical technique used to model the relationship between one predictor variable and one outcome variable.
To understand how changes in the predictor variable are associated with changes in the outcome variable.

26
Q

Simple Regression

A

Examines how changes in one variable (predictor) are associated with changes in another variable (outcome).
Used to predict the value of the outcome variable based on the value of the predictor variable.
Provides an equation that describes the relationship and allows for making predictions.
Can suggest causality if appropriate conditions are met, as it implies a directional relationship between variables

27
Q

Regression Equation

A

Y = bX+ c+ e

28
Q

Regression Equation Explained

A

Y represents the outcome variable (e.g., test scores).
X represents the predictor variable (e.g., study time).
b is the regression coefficient, which indicates the change in Y for a one-unit change in X.
c is the intercept, representing the value of Y when X is zero
e is the error term, representing the difference between the observed Y and the predicted Y based on the regression equation.

29
Q

Regression Model in an example: attention span = b(screen time) + c + e

what is b, c and e

A

The outcome variable is “attention span”.
The predictor variable is “screen time”.
b is the regression coefficient representing the effect of screen time on attention span.
c is the intercept representing the value of attention span when screen time is zero.
e is the error term representing unexplained variability in attention span.

30
Q

R-squared

A

It shows how well the predictor variable(s) explain the outcome variable in a regression model.

31
Q

Range of R-saqured

What is an R-squared value of 0.40 mean?

A

Between 0 and 1, where 0 means no explanation and 1 means full explanation.

40% of the outcome variation is explained by the predictor

32
Q

R-squared Relation to Correlation Coefficient

A

R-squared equals the square of the correlation coefficient.

Example: If the correlation coefficient is -0.42, the R-squared would be
(−0.42)^2 = 0.1764
17.64%

33
Q

Multiple regression

A

Statistical technique used to examine the relationship between one dependent variable and two or more independent variables
assumptions include linearity, independence of errors, homoscedasticity, normality

34
Q

How do you interpret the regression coefficients in multiple regression?

A

Regression coefficients represent the change in the dependent variable for a one-unit change in the predictor variable, holding other variables constant.

35
Q

How do you assess the overall fit of a multiple regression model

A

The overall fit can be assessed using measures such as R-squared and adjusted R-squared, which indicate the proportion of variance explained by the model.

36
Q

Importance of Checking Scatterplot before Reporting Correlation Coefficient

A

To visually inspect data and validate assumptions before interpreting correlation coefficient.
Allows visual inspection of assumptions.
Helps detect linearity and outliers.
Ensures accurate interpretation of correlation.

Before reporting a correlation coefficient between two variables, examining a scatterplot allows you to identify any nonlinear relationships or outlier points that may affect the interpretation of the correlation.

37
Q

Third Variable Problem in Correlation

A

The presence of a third variable that influences both variables being correlated, leading to a spurious or misleading correlation.

38
Q

How to mitigate Third Variable Problem in Correlation

A

Control for or consider potential third variables to accurately interpret the correlation between two variables. Use techniques like partial correlation or regression analysis to account for the influence of third variables.

39
Q

Partial Correlation

A

A statistical technique used to assess the relationship between two variables while controlling for the effects of one or more additional variables.

Calculates the correlation coefficient between two variables after statistically removing the influence of one or more covariates.

40
Q

Regression Analysis

A

A statistical method that examines the relationship between one dependent variable and one or more independent variables.

Identifies how changes in the independent variables are associated with changes in the dependent variable.

Fits a regression model to the data, estimating the coefficients that represent the strength and direction of the relationships between variables.

41
Q

Difference between Positive and Negative Correlation

A

Positive: Both variables increase together (e.g., height and weight).Negative: One variable increases, the other decreases (e.g., alcohol consumption and memory recall).

42
Q

Checking for Influence of Outliers on Correlation Coefficient

A

1) Plot the data using a scatterplot to visually identify outliers.
2) Calculate the correlation coefficient with and without outliers to observe changes in its magnitude.
3) Conduct sensitivity analyses by removing outliers and re-calculating the correlation coefficient.

If the correlation coefficient changes substantially after removing outliers, it suggests that outliers may have influenced the correlation.

43
Q

Cohen’s Criteria for Small, Medium, and Large Correlation Coefficients

A

Small: 0.1
Medium: 0.3
Large: 0.5

44
Q

interpretation of a confidence interval for a correlation coefficient

A

If the confidence interval includes zero, the correlation coefficient is not statistically significant at the specified confidence level.
If the confidence interval does not include zero, the correlation coefficient is statistically significant at the specified confidence level.
When we say “includes zero,” we mean the entire interval falls on one side of zero.

45
Q

Confidence Interval

A

When we calculate a correlation coefficient between two variables, we also calculate something called a confidence interval.
The confidence interval tells us a range of values within which we are reasonably confident the true correlation lies.

46
Q

A Correlation coefficient of 0.50 with a 95% confidence interval of (0.30, 0.70)

A

Correlation coefficient of 0.50 with a 95% confidence interval of (0.30, 0.70) indicates that we are 95% confident that the true correlation lies between 0.30 and 0.70.

47
Q

X and Y

Components of Linear Regression Model

A

Dependent Variable (Y): The variable being predicted or explained by the independent variables.
Independent Variables (X): The variables used to predict or explain changes in the dependent variable.

48
Q

Regression Coefficients (β)

Components of Linear Regression Model

A

The parameters representing the relationship between each independent variable and the dependent variable.

49
Q

Intercept (β₀)

Components of Linear Regression Model

A

The constant term in the regression equation, representing the value of the dependent variable when all independent variables are zero.

Components of Linear Regression Model

50
Q

Residuals (ε)

Components of Linear Regression Model

A

The differences between the observed and predicted values of the dependent variable.

51
Q

Error Term (ε)

Components of Linear Regression Model

A

Represents the variability in the dependent variable that cannot be explained by the independent variables.

52
Q

Correlation Matrix

A

A table that shows the correlation coefficients between multiple variables in a dataset.

Square matrix where each row and column represents a variable.
The cells contain correlation coefficients, showing the strength and direction of relationships between variables.

53
Q

Interpretation of Correlation Matrix

A

Values range from -1 to 1.
Positive values indicate a positive correlation (variables move in the same direction).
Negative values indicate a negative correlation (variables move in opposite directions).
Values closer to 1 or -1 represent stronger correlations, while values closer to 0 represent weaker correlations.

54
Q

Coefficients

Slope (b) and intercept (c)

A

Slope (b): Represents the change in the dependent variable for a one-unit change in the independent variable.
Intercept (c): Represents the predicted value of the dependent variable when all independent variables are zero.