WEEK 4 Flashcards

1
Q

what is relative risk?
what does a RR of 1 indicate

A

Calculated as the ratio of “the risk of developing the disease among patients in the exposed group” to “the risk of developing the disease among the patients in the unexposed group”.
The outcome variable must be binary (have only two categories), but the exposure variable could have two or more categories.
TIP: An easy way to remember RR is Relative Risk is the risk of getting a disease.

If the Relative Risk is:

1 = Patients in both groups have the same risk.

> 1 Patients in the exposed group are at increased risk compared to those in the unexposed group.

< 1 Patients in the exposed are at lower risk than the patients in the unexposed.
i.e. The exposure is “protective”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is odds ratio?

A

Odds Ratio (OR)
The Odds Ratio (OR) is generally appropriate for case control studies.
(i.e. Start with cases and controls, and ask both groups about numerous exposures.)
The formula for the Odds Ratio is:
(Odds of being exposed for cases: a / c ) /
(Odds of being exposed for controls: b / d)
i.e. The ratio of the “Odds of exposure of cases” divided by the “odds of exposure of controls”.
TIP: An easy way to remember OR = Odds Ratio is the odds of exposure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Interpreting the Odds Ratio in words

A

The correct way to word the Odds Ratio is:
The odds of exposure in cases is ___ times that of controls.
i.e. Phrase as odds of exposure in cases/non-cases, as this technically matches the study design better.
However, given the Relative Risk is written in terms of outcome (i.e. outcome in exposed vs unexposed groups), the Odds Ratio is often written in this way as well because its more intuitive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

odds ratio results

A

If the OR is:

= 1 The odds of exposure is the same in cases and controls.

> 1 The odds of exposure is higher in the cases than controls

< 1 The odds of exposure is lower in the cases than controls.
(i.e. More exposure in controls than cases)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hypothesis test for quantifying risk: Odds ratio

A

95% CI and p-value for OR
We can evaluate the significance of OR by calculating 95% confidence intervals and the p-value by performing a hypothesis test.

Similar to RR, the sampling distribution of OR is not normally distributed; however, the natural logarithm of OR (ln(OR)) follows the normal distribution.

Below are all the formulas you will need:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Steps for calculating 95% CI for OR

A

If the confidence interval excludes the value of 1 (a value of OR different from 1 is an indication of higher/lower odds), we say that the odds of exposure in the cases and controls are significantly different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Calculating the p-value
Odds Ratio
Null and alternative hypothesis

A

We can also calculate the p-value by performing a hypothesis test.
The null and alternative hypotheses are as follows:
Null hypothesis: The odds of exposure between cases and controls in the study population are the same, that is, population OR = 1.
Alternative hypothesis: The odds of exposure between cases and controls in the study population are different, that is, the population OR ≠ 1.
The test statistic for testing these hypotheses is the Z-statistic, which follows the standard normal distribution and is given by:

Z-statistic = ln(OR)/SE.

The p-value is obtained using Table A1: Normal distribution probability table.

Remember: Table A1 gives the area in one tail.
We need two tails to obtain the p-value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Calculating the p-value for OR

A

We can also calculate the p-value by performing a hypothesis test.
The null and alternative hypotheses are as follows:
Null hypothesis: The odds of exposure between cases and controls in the study population are the same, that is, population OR = 1.
Alternative hypothesis: The odds of exposure between cases and controls in the study population are different, that is, the population OR ≠ 1.
The test statistic for testing these hypotheses is the Z-statistic, which follows the standard normal distribution and is given by:

Z-statistic = ln(OR)/SE.

The p-value is obtained using Table A1: Normal distribution probability table.

Remember: Table A1 gives the area in one tail.
We need two tails to obtain the p-value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Relative Risk (RR)

A

Calculated as the ratio of “the risk of developing the disease among patients in the exposed group” to “the risk of developing the disease among the patients in the unexposed group”.
The outcome variable must be binary (have only two categories), but the exposure variable could have two or more categories.
TIP: An easy way to remember RR is Relative Risk is the risk of getting a disease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

RR OF 1

A

If the Relative Risk is:

1 = Patients in both groups have the same risk.

> 1 Patients in the exposed group are at increased risk compared to those in the unexposed group.

< 1 Patients in the exposed are at lower risk than the patients in the unexposed.
i.e. The exposure is “protective”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Calculating 95%CI RR

A

Interpretation of 95% CI:
If the confidence interval excludes the value of 1 (a value of RR different from 1 is an indication of higher/lower risk), we say that the risk of developing disease in the exposed and unexposed groups are significantly different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hypothesis test & p-value (RR)

A

We can also calculate the p-value by performing hypothesis test. The null and alternative hypotheses are as follows:
Null hypothesis: The risk of disease between exposed and unexposed groups in the study population are the same, that is, population RR = 1.
Alternative hypothesis: The risk of disease between exposed and unexposed groups in the study population are different, that is, the population RR ≠ 1.
The test statistic for testing these hypotheses is the Z-statistic, which follows the standard normal distribution and is given by:

Z-statistic = ln(RR)/SE.

The p-value is obtained using the normal distribution probability table.

Remember: Table A1 gives the area in one tail.
We need two tails to obtain the p-value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Chi-square analysis.

A

Chi-square (χ2) compares two categorical variables to see if the variation in data is due to chance, or due to the variables being tested.
It is a statistical test commonly used to compare the data of observed frequencies with what we would expect to occur if the null hypothesis was true.
Thus the “expected frequencies” are the frequencies that would occur if the frequency of an event was the same in each group.
Chi-square tests are commonly used to evaluate contingency tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Contingency tables

A

When data has been grouped into categories, we often arrange the counts (frequencies) in a tabular format known as a contingency table or two-way table.
In the simplest case, two dichotomous random variables are involved; the rows of the table represent the categories of one variable (e.g., exposure), and the columns represent the categories of the other variable (e.g., outcome).
The entries in the table are the frequencies (also known as observed frequencies) that correspond to a particular combination of categories.
In a contingency table usually the outcome is presented on the column and the exposure on the row.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hypothesis testing steps for χ 2 test

A

Step 1: Establish study design

A common research question could be “is there any association between the two categorical variables (outcome and exposure)?”

Step 2: Set up hypotheses and determine level of significance

This time, the hypotheses are about an “association” between the two categorical variables.
H0: There is no association between the two variables.
Ha: There is an association between the two variables
p-value: 0.05 unless otherwise specified.

Step 3: Select the appropriate test statistic

Step 4: Compute statistic

Step 5: Conclusions

The χ2 test of independence is used to test whether the distribution of the outcome variable is similar across the comparison groups.
Either “do not reject” (if p > 0.05) or “reject” (if p < 0.05) the null hypothesis.
The test assesses whether there is a statistically significant difference in the distribution of the outcome across exposure groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A chi-squared test is not valid if

A

A chi-squared test is not valid if more than 20% of the cells have expected frequency smaller than 5.

In the above example “0%” cells have expected frequency smaller than 5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Collapsing rows and/or columns

A

For a cross table with multiple rows and/or columns if more than 20% of the cells have expected frequency smaller than 5, two or more adjacent rows and/or adjacent columns are collapsed (added together).
We recalculate the expected cell frequency for the new table (for new cells only), and collapsing is continued until 20% or less of the cells have expected frequencies less than 5.
Note that if a cross table with multiple rows and /or columns is collapsed down to a two-by-two table but still more than 20% of the cells in the table have expected frequency less than 5, then continuity corrected chi-squared or Fisher’s exact test is used to analyse data in the reduced 2-by-2 table

18
Q

when to apply Yates (Continuity) Corrected Chi-Square

A

If more than 20% of the cells in a 2-by-2 table have expected frequency less than 5, the sample size is not large enough to assume that the chi-square statistic follows the chi-square distribution.

Hence the Pearson chi-square test discussed in the above section is not valid.

The continuity corrected chi-square test gives a more accurate p-value than the Pearson chi-square if more than 20% of the cells have expected frequency less than 5.

19
Q

what is a regression model?

A

A regression model describes the relationship between the outcome variable and one or more risk factors.
The outcome variable must be continuous; however, the risk factors may be numerical or categorical.
The regression model can be either simple or multiple (multivariable).
In simple linear regression, there is only one risk factor.
Multiple linear regression is a regression in which there are at least two risk factors.
The risk factor is also called the covariate/independent variable (x).
The outcome is also called the response/dependent variable (y).

20
Q

Correlation and regression analysis

A

The relationship between outcome and a covariate can be either linear or non-linear.

The strength of the linear association between two variables can be quantified by the correlation coefficient.

The value of the correlation coefficient could be positive, negative, or zero.

A scatter-plot can be a helpful tool in determining the strength of the relationship between two variables, as well as to check the linearity of association.

21
Q

Correlation- definition

A

Measures the numerical strength and direction of linear relationships.
This measure (correlation co-efficient) is denoted the letter “r”.
A perfect positive correlation where all data points lie exactly on the line is r = 1.
A perfect negative correlation where all data points lie exactly on the line is r = -1.

22
Q

How to calculate the correlation co-efficient

A

Draw a scatterplot.
Draw a line of best fit (trend line) that best describes the direction of the data.
Examine how close the data points cluster around the line.
Approximate the value of the correlation co-efficient.
Interpret
The following figure has a line of r-values to help you assign numbers to word descriptors.

23
Q

Hypothesis and calculation for correlation

A
24
Q

Interpret this results

Given the sample correlation coefficient of 0.79 and the number of patients in the sample as 20, we can calculate the value of the test statistic and then compute the p – value using a t distribution table
Using the t-distribution table with 18 (20 – 2 = 18) degrees of freedom and a t-score of 5.49, we have a p-value of less than 0.001.

A

This means that less than 1 out of 1000 similar studies would have obtained a value of a correlation coefficient like ours (0.79) by chance or sampling variability, hence, we exclude random variation (chance) as an explanation for the correlation coefficient of 0.79.
Thus the correlation coefficient between a husband’s and wife’s SBPs is different from zero and is statistically significant.

25
Q

Simple linear regression equation

A

Y=B0+B1x
Expected Birth Weight = β0 + β1 × Oestriol level
This equation is known as the simple linear regression equation.
In the model, β1 is the beta (β) coefficient.
The baseline effect (baseline weight) often denoted by “β0” is the value of the outcome when the covariate has no effect on the outcome and is often known as the constant

26
Q

Beta coefficient

A

The beta coefficient represents the amount of change in outcome variable for every unit change in the covariate, that is, the effect of the covariate on the outcome.

The value of β1 can be positive, negative; small, or large.
A positive beta means that the outcome increases as the covariate increases.
A negative beta is an indication of an outcome decreasing as the covariate increases and vice versa.
A beta coefficient of nearly zero means that the covariate has very little effect on the outcome.
A large beta coefficient is an indication of a strong effect of the covariate on the outcome.

27
Q

Hypothesis Test & 95% CI for Beta Coefficient

A

The formula for the test statistic for hypothesis test is as follows:

t-score = Sample Statistic/SE = β1/SE

Here, β1 is the beta coefficient and SE is the standard error of beta coefficient.

The t- score follows the t-distribution with n – 2 degrees of freedom.
Thus, the p-value can be obtained from the t-distribution table.

The 95% confidence interval can be calculated using the following formula:

Statistic + Margin of Error

The multiplier value is obtained from the t-distribution table with n − 2 degrees of freedom.

28
Q

Assumptions
Assumptions for performing linear regression analysis are as follows:

A
  1. The outcome variable follows the normal distribution (for example, the birth weight follows the normal distribution). This assumption can be evaluated by presenting the data on a histogram or on a box-plot.
  2. The relationship between the outcome and covariate is linear, that is, there is a sharp upward or downward trend of the data.

3.The constant variance of outcome across different values of the covariate, that is, the variability in the outcome remains the same across different values of the covariate.

29
Q

Model prediction

A

Here are the main points summarized:

  1. Equation from Linear Regression Output:
    • Linear regression provides an equation to predict outcomes based on predictor variables.
  2. Prediction Using the Equation:
    • The equation derived from the regression output can be used to predict outcomes for given values of the predictor variable.
  3. Example Equation:
    • Example equation from the Oestriol - birthweight dataset: ( \text{Birth Weight} = 2.183 + 0.05957 \times \text{Oestriol Level} ).
  4. Application of the Equation:
    • The equation can predict the birth weight of a baby for a woman near full term if her Oestriol Level is known.
  5. Example Prediction:
    • Example: If Oestriol Level is 40 mg/24 hr, the predicted birth weight is 4.57 kg.
  6. Residuals:
    • The observed birth weight may differ from the predicted value, resulting in residuals or prediction errors.
  7. Assessment of Residuals:
    • Residuals in a linear regression model are expected to be random and follow a normal distribution.
    • Randomness can be assessed by plotting residuals against predicted values.
    • Normality can be evaluated through box plots or histograms of residuals.

These points summarize the process of using linear regression output to predict outcomes and assess model fit through residuals.

30
Q

Model evaluation steps_ regression

A

There are 3 ways to evaluate the model:

Correlation of determination (R2)
Residual plot
Normal probability plot

31
Q

Coefficient of determination vs. correlation coefficient

A

For a simple linear regression model, it is the square of the correlation coefficient, r.
The correlation coefficient, r, may take any value in the range -1 to +1.
The coefficient of determination, R2, must lie between 0 and +1.
The coefficient of determination can be interpreted as the proportion of the variability among the observed values of the outcome that is explained by the covariates.

R2 determines how well the regression line represents the data, if the data points lie close to the trend line, it would be able to explain all of the variation.

The larger the value of R2, the better the prediction performance of the regression model.

“Adjusted R2” compares regressions with a number of different covariates. It is generated from a statistics package, and can be interpreted in the same way as R2.

32
Q

Calculating R2

A

As stated above, it is easier to square r (correlation coefficient) to calculate R2.

For example if r = 0.7, then R2 = 0.49.
Convert that into a percentage to get 49%.

We can say that 49 % of the variation in the dependent variable is accounted for by the variable in the independent variable. The rest of the variation, 51 %, is unexplained.

32
Q

A difference between observed and predicted values is known as the —- or —–.

A

A difference between observed and predicted values is known as the residual or prediction error.

For a linear regression model, the residuals are random and follow a normal distribution.

The randomness can be assessed by plotting the residuals (vertical axis) and predicted values (horizontal axis) on a scatter-plot, and the normality can be evaluated by presenting the residuals on a box-plot or histogram.
Residual = Observed (actual) - Predicted (from equation)

33
Q

Residual plot

A

A residual is the difference between observed outcome and model-predicted outcome.
A residual plot serves the following two purposes:
It can help us detect outlying observations in the sample.
If the residuals do not exhibit a random scatter but instead follow a distinct trend or funnel shape, this suggests that the relationship between the outcome and the covariate is not linear.

34
Q

Interpret this residual plot

A

For the residual plot of Oestriol - birthweight data:

The dots on the plot are random.
There is no trend to the residuals.
This means that there are no major outliers & the assumption is met.

35
Q

Regression analysis with categorical Variable

A

We have discussed a linear regression model with continuous covariates; however, in medical research, we often run regression analysis with categorical covariates.
A categorical variable with two categories is called a binary variable.
A binary variable is usually coded by 0 and 1, where 0 is for the baseline (unexposed or reference) category and 1 is for the non-baseline (exposed) category.
The beta coefficient is the difference between the mean outcome for non-baseline and baseline category, that is, β = mean of exposed – mean of unexposed

36
Q

Multiple Linear Regression Model

A

In multiple regression, we estimate the effect of one covariate after allowing (adjusting) for the effect of other covariates.

The first step of analysing data using multiple regression analysis is to plot each covariate and outcome on a scatter plot.
If the scatter plot shows a linear association between outcome and covariate, the covariate is identified for entry into the multiple regression model.

37
Q

which variables hold be selected in a regression analysis.

A

In medical studies, we often collect many variables from each patient; however, not all these variables have significant effect on the outcome variable.
Variables that do not have significant association with the outcome variable are called noise variables.
We prefer to include the variables in the model that are truly independent predictors.
Most statistical packages use the automatic variable selection method (stepwise forward or backward) for identifying significant covariates in the model.
These methods automatically determine which variables to add or drop from the model.
Stepwise procedures may also run the risk of selecting noise variables in the model and are considered useful only for exploratory purposes.

38
Q

Backward elimination

A

Backward elimination starts with all variables and drops one at a time, in the order they are worst by some criterion, such as the p-value. A variable with a p-value smaller than 5% is usually considered to be significant. In summary, the backward elimination method works as in the following steps:

Add all potential covariates into the regression model and run the regression; identify the insignificant covariate (that is, a p-value > 0.05) with the highest p-value and drop that covariate.
Run the model again with remaining covariates and identify the insignificant covariate (that is, a p-value > 0.05) with the highest p-value and drop that covariate.
Repeat Steps 1 and 2 until all the covariates in the model are significant.

39
Q

Logistic regression

A

Logistic regression is similar to multiple linear regression, although the response variable is binary (Yes / No) rather than continuous.

The output of logistic regression is an Odds Ratio.

The table below summarises the similarities and differences between linear and logistic regression