Relationships between variables Flashcards
What is a relationship between 2 continuous variables called?
Bivariate
Why do you do investigate bivariate relationships
- Are the 2 variables associated – SCATTER PLOT
- Enable the value of one variable to be predicted from any known value of the other – REGRESSION
- Look for agreement between two variables – 2 different methods used to measure same thing
Scatter plots
- Graph describes relationship between 2 variables
- Independent variable on X axis – (causes a change)
- Dependent variable on Y axis –(outcome variable)
Correlation assumptions
- All values must be independent – e.g. cant correlate repeat measurements over time
- Sample must be random from population – e.g. cant select specific individuals for inclusion
What data distribution do you need for correlations
- You can calculate coefficient for any 2 continuous variables
- Pearson correlation = both variables should be normally distributed
- If not – then transform data
- Spearman’s rank correlation – where variable distributions cannot be normalised by transformation
Hypothesis testing for correlations
- Pearsons correlation – null hypothesis – NO LINEAR association between the 2 variables
- Spearman’s rank correlation – null hypothesis – no association between 2 variables
- BEWARE MULTIPLE CORRELATIONS – data dredging is when you do lots of correlations = for every 20 correlation tests one will show association by chance – so you will need to adjust to account for this– you could reduce the p value
What is residual value?
Difference between actual value and fitted value on line
Line is fitted to minimise residual
Linear regression
The equation of a straight line is y=a+bx In this model: y is our response variable (weight) x is our predictor variable (age) and a and b are model parameters b is the slope of the line a is the intercept of the line on the y-axis where x=0
For each of our data points, there is an additional term for this equation to complete our model:
y=a+bx+e
Here, this ‘e’ term is known as the ‘residual error’ (the red dotted line).
Linear regression model y=a+bx+e Values for a and b are calculated to minimize the total e (minimize ∑▒e)
Linear regression assumptions
- Bivariate relationship between predictor and response variable is linear
- The RESIDUALS are independent of each other and have normal distribution
Linear regression – hypothesis testing
- Null hypothesis that b = 0
* No slope so no relationship
correlation vs regression
C - Summarises strength and direction of relationship between 2 variables as a single value
R = correlation coefficient
R - Model
Uses one variable as the predictor x and the other as response y
Finds an equation that best describes the relationship between 2 variables
C - Doesn’t allow prediction of one variable from other
R - Allows one variable to be predicted from the other
C - Null hypothesis - no linear relationship between variables
R - Null hypothesis = coefficients associated with variables = 0
Correlation and causation
Correlation and regression show a link BUT don’t explain reason for the link
Adjusting a correlation for one other variable
For example, you have 3 variables: age, number of medicines and number of drugs – certain factors could be influencing certain things – so what can you do?
Partial correlation coefficient
• Estimated correlation between 2 variables assuming that the 3rd variable is the same
• Partial correlation between age and number of medicines adjusting for (measure of comorbidity)
• If correlation remains after adjustment for 3rd variable this indicated that the association is independent of third variable
Investigating relationships between multiple variables – some approaches
Hypothesis testing
• Null hypothesis – multiple regression
• Bayesian approach – model selection based on prior probabilities
Data reduction
• e.g. PCA
Hypothesis free e.g.
• Data mining – extracting and discovering patterns in large data sets
• Artificial intelligence, machine learning
• network mapping
Multiple regression analysis
Regression model
• One continuous dependent outcome variable described by multiple predictor variables
What can it do?
• Find relationship between variables without prior expectation
• Identify independent relationships adjusted for confounders
• Develop a prognostic tool for predicting a dependent variable of interest
Linear regression is used to predict the continuous dependent variable using a given set of independent variables
Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables