correlation and regression Flashcards
Pearson Bivariate Correlation Coefficient
Define and Calculate
it’s a number that shows how two things are related in a straight line (strength and direction linear relationship between two continuous variables)
calculate: It is calculated by dividing the covariance of the two variables by the product of their standard deviations.
Pearson Bivariate Correlation Coefficient
range and interpretation
The coefficient ranges from -1 to +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.
Plotting Pearson
Linearity Assessment Plot: A plot used to assess the linearity of the relationship between two variables.
Helps determine if the relationship between variables is linear or nonlinear with a trend line.
Scatter plot is commonly used for this purpose.
points cluster around the trend line = stronger correlation
Z-scores
defintion
Z scores are a way to standardize data points by showing how many standard deviations they are from the mean.
a z score of +2 indicates that a data point is two standard deviations above the mean, while a z score of -1 indicates that a data point is one standard deviation below the mean.
Standardizing with Z Scores
meaning, proedure and purpose
Standardizing allows fair comparison of data on a common scale.
Meaning: It transforms data to have a mean of 0 and a standard deviation of 1.Standardizing converts data into z scores for fair comparison.
Procedure: Subtract the mean and divide by the standard deviation.
Purpose: Makes different data comparable by putting them on the same scale.
Why Standardize and Meaning
Standardizing allows fair comparison of data on a common scale.
It transforms data to have a mean of 0 and a standard deviation of 1.
Calculate Z-scores
Calculation:
(X- μ) / σ
X is the raw score, μ is the mean of the distribution, and is the standard deviation.
Standard Deviation
Define
A measure of the amount of variation or dispersion in a set of values
It’s a measure of how spread out numbers are in a set
Interpretation of Standard Deviation
A low standard deviation indicates that the data points tend to be close to the mean, while a high standard deviation indicates that the data points are spread out over a wider range of values.
Why is it good for data to cluster around the mean?
Relaibility, comparability and predictability
Reliability: There are fewer extreme values or outliers that could skew the interpretation of the data.
Predictability: it makes it easier to predict future outcomes or estimate probabilities. This is because there is less uncertainty or variability in the data.
Comparability: When data points are spread out, it can be challenging to compare different groups or datasets, but when they are close to the mean, comparisons become more straightforward.
Calculate Stanadard Deviaition
the square root of the variance
Varinace
Define
A measure of how spread out or dispersed the values in a data set are from the mean.
Calculate the Variance
taking the average of the squared differences between each data point and the mean
interpretation of the variance
A larger variance indicates greater variability or dispersion in the data set, while a smaller variance suggests that the data points are closer to the mean.
e.g. If the variance of a set of test scores is 25, it means that, on average, each score differs from the mean by 25 squared units
Using Z-Scores to Identify and Deal with Outliers
Standardization: Transforming data into z-scores with a mean of 0 and standard deviation of 1.
Thresholds: Outliers defined as z-scores beyond a certain threshold (e.g., z > 2 or z < -2).
Comparability: Allows fair comparison of outliers across datasets.
Data Cleaning: Outliers identified using z-scores can be examined for errors or significance.
Correlation Coefficient (r)
Measures the strength and direction of the linear relationship between two variables.
Range: Between -1 and +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.
Interpreation of correlation coefficent
Magnitude: It shows how strong the relationship is. If |r| is closer to 1, the relationship is stronger. If it’s closer to 0, the relationship is weaker.
e.g. Correlation Coefficient (-0.42): Indicates a moderate negative linear relationship between the variables being studied.
Significance Level (p-value)
It indicates the probability that the observed result (or more extreme) occurred by random chance, assuming the null hypothesis is true.
observed effect is likely to be genuine or random
Interpreation of p-value
lower p-value suggests stronger evidence against the null hypothesis, indicating that the observed result is unlikely to be due to chance
Typically, a significance level of 0.05 (or 5%) is used. If the p-value is less than this threshold, the result is considered statistically significant
Cohen’s Rules of Thumb for Magnitude of Correlation
Definition: Guidelines for interpreting the strength of correlation coefficients.
Small: Magnitude around 0.10.
Moderate: Magnitude around 0.30.
Large: Magnitude around 0.50.
-0.42 = moderate negative correlation
Correlation Coefficient (0.15)
Since 0.15 is closer to 0.10, it would be considered a small correlation according to Cohen’s rules of thumb