correlation and regression Flashcards
Pearson Bivariate Correlation Coefficient
Define and Calculate
it’s a number that shows how two things are related in a straight line (strength and direction linear relationship between two continuous variables)
calculate: It is calculated by dividing the covariance of the two variables by the product of their standard deviations.
Pearson Bivariate Correlation Coefficient
range and interpretation
The coefficient ranges from -1 to +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.
Plotting Pearson
Linearity Assessment Plot: A plot used to assess the linearity of the relationship between two variables.
Helps determine if the relationship between variables is linear or nonlinear with a trend line.
Scatter plot is commonly used for this purpose.
points cluster around the trend line = stronger correlation
Z-scores
defintion
Z scores are a way to standardize data points by showing how many standard deviations they are from the mean.
a z score of +2 indicates that a data point is two standard deviations above the mean, while a z score of -1 indicates that a data point is one standard deviation below the mean.
Standardizing with Z Scores
meaning, proedure and purpose
Standardizing allows fair comparison of data on a common scale.
Meaning: It transforms data to have a mean of 0 and a standard deviation of 1.Standardizing converts data into z scores for fair comparison.
Procedure: Subtract the mean and divide by the standard deviation.
Purpose: Makes different data comparable by putting them on the same scale.
Why Standardize and Meaning
Standardizing allows fair comparison of data on a common scale.
It transforms data to have a mean of 0 and a standard deviation of 1.
Calculate Z-scores
Calculation:
(X- μ) / σ
X is the raw score, μ is the mean of the distribution, and is the standard deviation.
Standard Deviation
Define
A measure of the amount of variation or dispersion in a set of values
It’s a measure of how spread out numbers are in a set
Interpretation of Standard Deviation
A low standard deviation indicates that the data points tend to be close to the mean, while a high standard deviation indicates that the data points are spread out over a wider range of values.
Why is it good for data to cluster around the mean?
Relaibility, comparability and predictability
Reliability: There are fewer extreme values or outliers that could skew the interpretation of the data.
Predictability: it makes it easier to predict future outcomes or estimate probabilities. This is because there is less uncertainty or variability in the data.
Comparability: When data points are spread out, it can be challenging to compare different groups or datasets, but when they are close to the mean, comparisons become more straightforward.
Calculate Stanadard Deviaition
the square root of the variance
Varinace
Define
A measure of how spread out or dispersed the values in a data set are from the mean.
Calculate the Variance
taking the average of the squared differences between each data point and the mean
interpretation of the variance
A larger variance indicates greater variability or dispersion in the data set, while a smaller variance suggests that the data points are closer to the mean.
e.g. If the variance of a set of test scores is 25, it means that, on average, each score differs from the mean by 25 squared units
Using Z-Scores to Identify and Deal with Outliers
Standardization: Transforming data into z-scores with a mean of 0 and standard deviation of 1.
Thresholds: Outliers defined as z-scores beyond a certain threshold (e.g., z > 2 or z < -2).
Comparability: Allows fair comparison of outliers across datasets.
Data Cleaning: Outliers identified using z-scores can be examined for errors or significance.
Correlation Coefficient (r)
Measures the strength and direction of the linear relationship between two variables.
Range: Between -1 and +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.
Interpreation of correlation coefficent
Magnitude: It shows how strong the relationship is. If |r| is closer to 1, the relationship is stronger. If it’s closer to 0, the relationship is weaker.
e.g. Correlation Coefficient (-0.42): Indicates a moderate negative linear relationship between the variables being studied.
Significance Level (p-value)
It indicates the probability that the observed result (or more extreme) occurred by random chance, assuming the null hypothesis is true.
observed effect is likely to be genuine or random
Interpreation of p-value
lower p-value suggests stronger evidence against the null hypothesis, indicating that the observed result is unlikely to be due to chance
Typically, a significance level of 0.05 (or 5%) is used. If the p-value is less than this threshold, the result is considered statistically significant
Cohen’s Rules of Thumb for Magnitude of Correlation
Definition: Guidelines for interpreting the strength of correlation coefficients.
Small: Magnitude around 0.10.
Moderate: Magnitude around 0.30.
Large: Magnitude around 0.50.
-0.42 = moderate negative correlation
Correlation Coefficient (0.15)
Since 0.15 is closer to 0.10, it would be considered a small correlation according to Cohen’s rules of thumb
Correlation Coefficient (-0.60)
Since -0.60 is greater than 0.50, it would be considered a large negative correlation
Correlation Coefficient (0.35)
Since 0.35 falls between 0.30 and 0.50, it would likely be considered a moderate positive correlation
Correlation Coefficient (-0.25)
Since -0.25 falls between -0.10 and -0.30, it would likely be considered a small negative correlation.
Simple Regression Model
A statistical technique used to model the relationship between one predictor variable and one outcome variable.
To understand how changes in the predictor variable are associated with changes in the outcome variable.
Simple Regression
Examines how changes in one variable (predictor) are associated with changes in another variable (outcome).
Used to predict the value of the outcome variable based on the value of the predictor variable.
Provides an equation that describes the relationship and allows for making predictions.
Can suggest causality if appropriate conditions are met, as it implies a directional relationship between variables
Regression Equation
Y = bX+ c+ e
Regression Equation Explained
Y represents the outcome variable (e.g., test scores).
X represents the predictor variable (e.g., study time).
b is the regression coefficient, which indicates the change in Y for a one-unit change in X.
c is the intercept, representing the value of Y when X is zero
e is the error term, representing the difference between the observed Y and the predicted Y based on the regression equation.
Regression Model in an example: attention span = b(screen time) + c + e
what is b, c and e
The outcome variable is “attention span”.
The predictor variable is “screen time”.
b is the regression coefficient representing the effect of screen time on attention span.
c is the intercept representing the value of attention span when screen time is zero.
e is the error term representing unexplained variability in attention span.
R-squared
It shows how well the predictor variable(s) explain the outcome variable in a regression model.
Range of R-saqured
What is an R-squared value of 0.40 mean?
Between 0 and 1, where 0 means no explanation and 1 means full explanation.
40% of the outcome variation is explained by the predictor
R-squared Relation to Correlation Coefficient
R-squared equals the square of the correlation coefficient.
Example: If the correlation coefficient is -0.42, the R-squared would be
(−0.42)^2 = 0.1764
17.64%
Multiple regression
Statistical technique used to examine the relationship between one dependent variable and two or more independent variables
assumptions include linearity, independence of errors, homoscedasticity, normality
How do you interpret the regression coefficients in multiple regression?
Regression coefficients represent the change in the dependent variable for a one-unit change in the predictor variable, holding other variables constant.
How do you assess the overall fit of a multiple regression model
The overall fit can be assessed using measures such as R-squared and adjusted R-squared, which indicate the proportion of variance explained by the model.
Importance of Checking Scatterplot before Reporting Correlation Coefficient
To visually inspect data and validate assumptions before interpreting correlation coefficient.
Allows visual inspection of assumptions.
Helps detect linearity and outliers.
Ensures accurate interpretation of correlation.
Before reporting a correlation coefficient between two variables, examining a scatterplot allows you to identify any nonlinear relationships or outlier points that may affect the interpretation of the correlation.
Third Variable Problem in Correlation
The presence of a third variable that influences both variables being correlated, leading to a spurious or misleading correlation.
How to mitigate Third Variable Problem in Correlation
Control for or consider potential third variables to accurately interpret the correlation between two variables. Use techniques like partial correlation or regression analysis to account for the influence of third variables.
Partial Correlation
A statistical technique used to assess the relationship between two variables while controlling for the effects of one or more additional variables.
Calculates the correlation coefficient between two variables after statistically removing the influence of one or more covariates.
Regression Analysis
A statistical method that examines the relationship between one dependent variable and one or more independent variables.
Identifies how changes in the independent variables are associated with changes in the dependent variable.
Fits a regression model to the data, estimating the coefficients that represent the strength and direction of the relationships between variables.
Difference between Positive and Negative Correlation
Positive: Both variables increase together (e.g., height and weight).Negative: One variable increases, the other decreases (e.g., alcohol consumption and memory recall).
Checking for Influence of Outliers on Correlation Coefficient
1) Plot the data using a scatterplot to visually identify outliers.
2) Calculate the correlation coefficient with and without outliers to observe changes in its magnitude.
3) Conduct sensitivity analyses by removing outliers and re-calculating the correlation coefficient.
If the correlation coefficient changes substantially after removing outliers, it suggests that outliers may have influenced the correlation.
Cohen’s Criteria for Small, Medium, and Large Correlation Coefficients
Small: 0.1
Medium: 0.3
Large: 0.5
interpretation of a confidence interval for a correlation coefficient
If the confidence interval includes zero, the correlation coefficient is not statistically significant at the specified confidence level.
If the confidence interval does not include zero, the correlation coefficient is statistically significant at the specified confidence level.
When we say “includes zero,” we mean the entire interval falls on one side of zero.
Confidence Interval
When we calculate a correlation coefficient between two variables, we also calculate something called a confidence interval.
The confidence interval tells us a range of values within which we are reasonably confident the true correlation lies.
A Correlation coefficient of 0.50 with a 95% confidence interval of (0.30, 0.70)
Correlation coefficient of 0.50 with a 95% confidence interval of (0.30, 0.70) indicates that we are 95% confident that the true correlation lies between 0.30 and 0.70.
X and Y
Components of Linear Regression Model
Dependent Variable (Y): The variable being predicted or explained by the independent variables.
Independent Variables (X): The variables used to predict or explain changes in the dependent variable.
Regression Coefficients (β)
Components of Linear Regression Model
The parameters representing the relationship between each independent variable and the dependent variable.
Intercept (β₀)
Components of Linear Regression Model
The constant term in the regression equation, representing the value of the dependent variable when all independent variables are zero.
Components of Linear Regression Model
Residuals (ε)
Components of Linear Regression Model
The differences between the observed and predicted values of the dependent variable.
Error Term (ε)
Components of Linear Regression Model
Represents the variability in the dependent variable that cannot be explained by the independent variables.
Correlation Matrix
A table that shows the correlation coefficients between multiple variables in a dataset.
Square matrix where each row and column represents a variable.
The cells contain correlation coefficients, showing the strength and direction of relationships between variables.
Interpretation of Correlation Matrix
Values range from -1 to 1.
Positive values indicate a positive correlation (variables move in the same direction).
Negative values indicate a negative correlation (variables move in opposite directions).
Values closer to 1 or -1 represent stronger correlations, while values closer to 0 represent weaker correlations.
Coefficients
Slope (b) and intercept (c)
Slope (b): Represents the change in the dependent variable for a one-unit change in the independent variable.
Intercept (c): Represents the predicted value of the dependent variable when all independent variables are zero.