Exploratory Data Anyalsis 6.4 Correlations Flashcards
What is correlation?
Correlation is a statistical measure that describes the relationship between two variables. It tells us whether and how strongly the two variables are related to each other.
What is the correlation coefficient?
A single number that ranges from -1 to 1, summarizing the strength and direction of a correlation.
What does a correlation close to -1 indicate?
A strong negative relationship.
What does a correlation of 0 indicate?
No relationship.
What does a correlation close to 1 indicate?
A strong positive relationship.
What is Pearson’s correlation?
Pearson’s correlation measures how strongly two variables are related and whether the relationship is positive or negative. It tells you if an increase in one variable is associated with an increase or decrease in another.
Value Range: -1 to 1
Positive Correlation (r > 0): Both variables increase together
Example: Height & Weight (taller people tend to weigh more)
Negative Correlation (r < 0): One increases, the other decreases
Example: Study Time & Video Game Time (more studying, less gaming)
No Correlation (r ≈ 0): No relationship
Example: Shoe size & IQ
What is Spearman’s correlation?
- Unlike Pearson’s correlation, which checks for a straight-line relationship, Spearman’s looks at ranked data.
- It’s useful when data isn’t normally distributed or has outliers because it focuses on order rather than exact values.
When should Spearman’s correlation be used?
Use Spearman’s correlation when:
- The relationship between two variables is not linear but still follows a pattern.
- Your data has outliers that could affect Pearson’s correlation.
- Your data is not normally distributed (e.g., skewed or ranked).
- You’re working with ordinal (ranked) data, like survey ratings (e.g., 1st, 2nd, 3rd place)
What is a correlation plot?
- A tool in Exploratory Data Analysis that calculates the correlations between all variables in a dataset.
- A correlation plot is a visual representation that shows the relationship between multiple variables. It uses colors or values to display how strongly pairs of variables are correlated with each other.
What does correlation not imply?
Causation. Correlation does not mean that one variable causes the other.
What is an example of a misleading correlation?
The flow of water in a stream and the amount of water in a puddle may be correlated, but both could be influenced by rainfall.
What is Anscombe’s Quartet?
- A set of four datasets that have the same statistical properties but different distributions, illustrating how correlation can be misleading.
- data visualization is essential for understanding data.