Relationships between variables Flashcards

Question 1

Q

What is a relationship between 2 continuous variables called?

Answer

A

Bivariate

Question 2

Q

Why do you do investigate bivariate relationships

Answer

A

Are the 2 variables associated – SCATTER PLOT
Enable the value of one variable to be predicted from any known value of the other – REGRESSION
Look for agreement between two variables – 2 different methods used to measure same thing

Question 3

Q

Scatter plots

Answer

A

Graph describes relationship between 2 variables
Independent variable on X axis – (causes a change)
Dependent variable on Y axis –(outcome variable)

Question 4

Q

Correlation assumptions

Answer

A

All values must be independent – e.g. cant correlate repeat measurements over time
Sample must be random from population – e.g. cant select specific individuals for inclusion

Question 5

Q

What data distribution do you need for correlations

Answer

A

You can calculate coefficient for any 2 continuous variables
Pearson correlation = both variables should be normally distributed
If not – then transform data
Spearman’s rank correlation – where variable distributions cannot be normalised by transformation

Question 6

Q

Hypothesis testing for correlations

Answer

A

Pearsons correlation – null hypothesis – NO LINEAR association between the 2 variables
Spearman’s rank correlation – null hypothesis – no association between 2 variables
BEWARE MULTIPLE CORRELATIONS – data dredging is when you do lots of correlations = for every 20 correlation tests one will show association by chance – so you will need to adjust to account for this– you could reduce the p value

Question 7

Q

What is residual value?

Answer

A

Difference between actual value and fitted value on line

Line is fitted to minimise residual

Question 8

Q

Linear regression

Answer

A

The equation of a straight line is
y=a+bx
In this model:
	y is our response variable (weight)
	x is our predictor variable (age)
and
	a and b are model parameters
	b is the slope of the line
	a is the intercept of the line on the y-axis where x=0

For each of our data points, there is an additional term for this equation to complete our model:
y=a+bx+e
Here, this ‘e’ term is known as the ‘residual error’ (the red dotted line).

Linear regression model y=a+bx+e
Values for a and b are calculated to minimize the total e (minimize ∑▒e)

Question 9

Q

Linear regression assumptions

Answer

A

Bivariate relationship between predictor and response variable is linear
The RESIDUALS are independent of each other and have normal distribution

Question 10

Q

Linear regression – hypothesis testing

Answer

A

Null hypothesis that b = 0

* No slope so no relationship

Question 11

Q

correlation vs regression

Answer

A

C - Summarises strength and direction of relationship between 2 variables as a single value
R = correlation coefficient

R - Model
Uses one variable as the predictor x and the other as response y
Finds an equation that best describes the relationship between 2 variables

C - Doesn’t allow prediction of one variable from other

R - Allows one variable to be predicted from the other

C - Null hypothesis - no linear relationship between variables

R - Null hypothesis = coefficients associated with variables = 0

Question 12

Q

Correlation and causation

Answer

A

Correlation and regression show a link BUT don’t explain reason for the link

Question 13

Q

Adjusting a correlation for one other variable

Answer

A

For example, you have 3 variables: age, number of medicines and number of drugs – certain factors could be influencing certain things – so what can you do?

Partial correlation coefficient
• Estimated correlation between 2 variables assuming that the 3rd variable is the same
• Partial correlation between age and number of medicines adjusting for (measure of comorbidity)
• If correlation remains after adjustment for 3rd variable this indicated that the association is independent of third variable

Question 14

Q

Investigating relationships between multiple variables – some approaches

Answer

A

Hypothesis testing
• Null hypothesis – multiple regression
• Bayesian approach – model selection based on prior probabilities

Data reduction
• e.g. PCA

Hypothesis free e.g.
• Data mining – extracting and discovering patterns in large data sets
• Artificial intelligence, machine learning
• network mapping

Question 15

Q

Multiple regression analysis

Answer

A

Regression model
• One continuous dependent outcome variable described by multiple predictor variables

What can it do?
• Find relationship between variables without prior expectation
• Identify independent relationships adjusted for confounders
• Develop a prognostic tool for predicting a dependent variable of interest

Linear regression is used to predict the continuous dependent variable using a given set of independent variables
Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables

Question 16

Q

Risk vs odds

Answer

Study These Flashcards

A

RISK
Absolute risk – probability of an event occurring in a population
Calculated as number of people with event/total number of people
Relative risk ratio – probability of event occurring in one group compared to another
Absolute risk 1/absolute risk 2
Easier to explain and understand

ODDS
Chance of an event occurring vs not occurring in a population
Calculated as number of people with event /number of people with no event
Odds ratio – chance of an event occurring between 2 groups
Odds group 1/odds group 2
Needed for more complex statistical analysis
e.g., fitting statistical models to investigate how covariates and predictors influence the chance of an event occurring

Question 17

Q

Derivation vs validation

Answer

Study These Flashcards

A

derivation = Depends on the available dataset and its quirks

validation = Checks that the model works/is generalisable

Internal validity – split one dataset into derivation/validation cohorts – reduces power and doesn’t provide external validity (reduced power means smaller sample size so harder chance of detecting true effect)

External validity – check applicability of model in diff dataset/cohort

Question 18

Q

sensitivity

Answer

Study These Flashcards

A

ability of test to correctly identify patients with a disease

true positive/all positive

Question 19

Q

specificity

Answer

Study These Flashcards

A

ability of test to correctly identify people without a disease

true negative/all negative outcome

Question 20

Q

true negative

Answer

Study These Flashcards

A

true negative/ all negative predictions

Question 21

Q

true positive

Answer

Study These Flashcards

A

true positive/all positive predictions

Question 22

Q

ROC curve interpretation

Answer

Study These Flashcards

A

It also shows the ROC curve (the closer this is to the top left hand corner the better the score predicts the outcome). Area under the curve is 0.724, which means our score has a 72.4% chance that the prediction score will be able to distinguish between a patient likely to die and one likely to survive (a score of 0.5 is 50:50 ie useless, a score of 1 is perfect; 0.7-0.8 is generally considered acceptable)

Relationships between variables Flashcards

(22 cards)