Relationships between variables Flashcards
What is a relationship between 2 continuous variables called?
Bivariate
Why do you do investigate bivariate relationships
- Are the 2 variables associated – SCATTER PLOT
- Enable the value of one variable to be predicted from any known value of the other – REGRESSION
- Look for agreement between two variables – 2 different methods used to measure same thing
Scatter plots
- Graph describes relationship between 2 variables
- Independent variable on X axis – (causes a change)
- Dependent variable on Y axis –(outcome variable)
Correlation assumptions
- All values must be independent – e.g. cant correlate repeat measurements over time
- Sample must be random from population – e.g. cant select specific individuals for inclusion
What data distribution do you need for correlations
- You can calculate coefficient for any 2 continuous variables
- Pearson correlation = both variables should be normally distributed
- If not – then transform data
- Spearman’s rank correlation – where variable distributions cannot be normalised by transformation
Hypothesis testing for correlations
- Pearsons correlation – null hypothesis – NO LINEAR association between the 2 variables
- Spearman’s rank correlation – null hypothesis – no association between 2 variables
- BEWARE MULTIPLE CORRELATIONS – data dredging is when you do lots of correlations = for every 20 correlation tests one will show association by chance – so you will need to adjust to account for this– you could reduce the p value
What is residual value?
Difference between actual value and fitted value on line
Line is fitted to minimise residual
Linear regression
The equation of a straight line is y=a+bx In this model: y is our response variable (weight) x is our predictor variable (age) and a and b are model parameters b is the slope of the line a is the intercept of the line on the y-axis where x=0
For each of our data points, there is an additional term for this equation to complete our model:
y=a+bx+e
Here, this ‘e’ term is known as the ‘residual error’ (the red dotted line).
Linear regression model y=a+bx+e Values for a and b are calculated to minimize the total e (minimize ∑▒e)
Linear regression assumptions
- Bivariate relationship between predictor and response variable is linear
- The RESIDUALS are independent of each other and have normal distribution
Linear regression – hypothesis testing
- Null hypothesis that b = 0
* No slope so no relationship
correlation vs regression
C - Summarises strength and direction of relationship between 2 variables as a single value
R = correlation coefficient
R - Model
Uses one variable as the predictor x and the other as response y
Finds an equation that best describes the relationship between 2 variables
C - Doesn’t allow prediction of one variable from other
R - Allows one variable to be predicted from the other
C - Null hypothesis - no linear relationship between variables
R - Null hypothesis = coefficients associated with variables = 0
Correlation and causation
Correlation and regression show a link BUT don’t explain reason for the link
Adjusting a correlation for one other variable
For example, you have 3 variables: age, number of medicines and number of drugs – certain factors could be influencing certain things – so what can you do?
Partial correlation coefficient
• Estimated correlation between 2 variables assuming that the 3rd variable is the same
• Partial correlation between age and number of medicines adjusting for (measure of comorbidity)
• If correlation remains after adjustment for 3rd variable this indicated that the association is independent of third variable
Investigating relationships between multiple variables – some approaches
Hypothesis testing
• Null hypothesis – multiple regression
• Bayesian approach – model selection based on prior probabilities
Data reduction
• e.g. PCA
Hypothesis free e.g.
• Data mining – extracting and discovering patterns in large data sets
• Artificial intelligence, machine learning
• network mapping
Multiple regression analysis
Regression model
• One continuous dependent outcome variable described by multiple predictor variables
What can it do?
• Find relationship between variables without prior expectation
• Identify independent relationships adjusted for confounders
• Develop a prognostic tool for predicting a dependent variable of interest
Linear regression is used to predict the continuous dependent variable using a given set of independent variables
Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables
Risk vs odds
RISK
Absolute risk – probability of an event occurring in a population
Calculated as number of people with event/total number of people
Relative risk ratio – probability of event occurring in one group compared to another
Absolute risk 1/absolute risk 2
Easier to explain and understand
ODDS
Chance of an event occurring vs not occurring in a population
Calculated as number of people with event /number of people with no event
Odds ratio – chance of an event occurring between 2 groups
Odds group 1/odds group 2
Needed for more complex statistical analysis
e.g., fitting statistical models to investigate how covariates and predictors influence the chance of an event occurring
Derivation vs validation
derivation = Depends on the available dataset and its quirks
validation = Checks that the model works/is generalisable
Internal validity – split one dataset into derivation/validation cohorts – reduces power and doesn’t provide external validity (reduced power means smaller sample size so harder chance of detecting true effect)
External validity – check applicability of model in diff dataset/cohort
sensitivity
ability of test to correctly identify patients with a disease
true positive/all positive
specificity
ability of test to correctly identify people without a disease
true negative/all negative outcome
true negative
true negative/ all negative predictions
true positive
true positive/all positive predictions
ROC curve interpretation
It also shows the ROC curve (the closer this is to the top left hand corner the better the score predicts the outcome). Area under the curve is 0.724, which means our score has a 72.4% chance that the prediction score will be able to distinguish between a patient likely to die and one likely to survive (a score of 0.5 is 50:50 ie useless, a score of 1 is perfect; 0.7-0.8 is generally considered acceptable)