Advanced Analysis and Hypothesis Tests Flashcards

1
Q

What is a t-distribution?

A

Similar to the standard Normal distribution but is family of curves dependent on the degrees of freedom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is hypothesis testing?

A

using data to “weigh up the evidence” and using the evidence to decide whether to reject a pre-defined statement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the five steps of hypothesis testing?

A
  1. State the null hypothesis
  2. Calculate the appropriate test statistic
  3. Obtain a P value for the test statistic
  4. Make the decision whether to reject the null hypothesis based on P value
  5. State the conclusion in terms of the original research question
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the null hypothesis?

A
  • A statement about the value of a population parameter or the difference between groups
  • usually the negation of the research hypothesis
  • usually “the effect/association of interest is zero
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the alternative hypothesis?

A
  • Opposite of the null hypothesis

* Usually related to the research question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we calculate the test statistic?

A

Test statistic = observed value - hypothesised value/

standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the relationship between the test statistic and the null hypothesis?

A

The bigger the test statistic (+/-), the more evidence there is against the null hypothesis. The value of the test statistic is used to decide whether to reject the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the goal of estimation?

A

We want to estimate the population parameter based on the sample statistic.
• The sample must therefore be representative of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is estimation different from hypothesis testing?

A

Hypothesis testing is concerned with using the data to ‘weigh up the evidence’ and make a decision whether to reject a pre-specified statement (the null hypothesis) or not, whereas estimation gives us a ‘best estimate’ for the population value along with a range of likely values (confidence Intervals)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the definition of a population parameter?

A

A measurable characteristic of the population (e.g. mean = μ, proportion = π, standard deviation = σ). Values obtained from a sample are estimates of the
population parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are sample statistics?

A

Sample statistics are estimates of results that would have been obtained had the whole population been studied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two different kinds of estimation?

A

Point estimation and interval estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a confidence interval?

A

a range of values in which we have confidence that the population true value lies. It quantifies uncertainty and indicates the precision of our sample statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a point estimate?

A

An example would be a mean - it is just one value and doesn’t take into account that this value would change from sample to sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When does the width of the CI increase?

A

When there is:

  • a small sample size
  • lots of variability in the data
  • the level of confidence (eg 99%) increases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When do we use the t-distribution?

A

When the sample size is small, say under 30

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the formula for the t-distribution?

A
t = (x̄ – μ) / (s/√n)
x̄ is the sample mean
μ is the population mean
s is the standard deviation
n is the size of the given sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What impacts the width of a CI?

A
  • Precision of the estimate (s.e.)

* Level of confidence (multiplier)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are poor and high precision and how do they relate to the concept of a CI?

A

• Poor precision (large SE): wide interval
•High precision (small SE): narrow interval
•As sample size increases, standard error (SE)
decreases which leads to greater precision and
narrower intervals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

The larger the confidence, the….

A

….greater the interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The narrower the interval, the…

A

…lower the confidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Can you use a CI for a proportion?

A

Binomial proportions are not from the normal distribution but:
• If the sample size is greater than 30 and 0.1 < p < 0.9, we can use our standard formula for the confidence interval p +1.96SE( p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the chi-squared test?

A

The chi-squared test of association(for categorical data) is a test for the comparison of two attributes in a sample of data to determine if there is any relationship between them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What would be the null hypothesis in the context of using the chi-squared test?

A

Ho = there is no association between the classification of the two attributes under investigation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the chi-squared test based on?

A

The difference between the observed and expected frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What happens within the chi-squared test?

A

Under the null hypothesis this test statistic follows the
Chi-squared distribution
o The value of the test statistic is then compared with the appropriate Chi-squared distribution (first proposed by Pearson)
o The greater the differences between the observed and expected statistics, the larger the Chi-squared statistic is, the more evidence that the two variables are associated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How do you calculate the expected frequencies in a 2x2 table for a chi-squared test?

A

Expected freq. = (relevant row total × relevant column total)/ total sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How do you calculate the chi-squared statistic for a 2x2 table?

A

The chi-squared value is obtained by calculating:
(observed - expected)2/expected
for each of the four cells in the contingency table and
then summing them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How would you then either reject the null hypothesis or fail to reject the null hypothesis using the chi-squared test?

A

Compare the Chi-squared test statistic with the tabulated values of the Chi squared distribution corresponding to given two-tailed p values for different degrees of freedom. The bigger the difference between the test statistic and the p-value, the more evidence against the null (you would fail to reject)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is Yates’ correction for a 2x2 table?

A
  • When the number of events/sample is low, a continuity correction is usually made by subtracting 0.5 to each element in the calculation. This correction is referred to as Yate’s continuity correction
  • It is intended for use with ‘small’ samples i.e. total sample size <40 or expected numbers are small (cell frequency <5)
    o The correction reduces the value of Chi-square and prevents overestimation of statistical significance for small data sets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is Fisher’s exact test, and when is it used?

A

o The Fisher’s exact test to compare two proportions is needed when the numbers in the 2 x 2 table are very small (i.e. expected frequency of less than 5)
o For the Chi squared test to be valid, most cells should have an expected frequency of more than 5 and total sample size of approximately 40

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Can the chi-squared statistic be used for larger contingency tables?

A

Yes!

  • Larger tables are called r x c tables, where r denotes the number of rows in the table and c the number of columns.
  • the calculation for the expected frequencies then becomes: Expected number = column total x row total/overall total
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the chi-squared test for linear trend, and when is it appropriate?

A

o Appropriate for ordered categorical (ordinal) exposure variables (e.g. lifetime partners, age- group, cholesterol levels).
o Not appropriate for variables in which there is no natural order e.g. marital status, ethnic group, country of residence.
o The ꭕ2 test for trend is a more sensitive test that assesses whether there is an increasing (or decreasing) trend in the proportions over the exposure categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What does the chi-squared test presume of it’s observations?

A

That they are independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What test do you use for categorical variables/observations which are NOT independent?

A

McNemar’s test - this would be appropriate for paired data, such as matching in a case control trial, before and after measurements, comparisons between 2 observers - eg 2 radiographers using x-rays to diagnose TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are some examples of continuous data?

A

weight, age, blood pressure, antibody levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What do you need to check for continuous data?

A

The shape of the frequency distribution - this indicates what summary measures should be used on the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are some examples of how continuous data is displayed?

A

Histogram, scatter-plot, line plot, box plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are some recommendations for how continuous data can be summarized?

A
  • For normally distributed data: Mean and SD

- For non-normal data: Median and interquartile range (25th -75th percentile)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

When is it appropriate to use Student’s T-Test?

A

For the comparison of means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

When is appropriate to use a one-tailed t-test?

A

o Imagine you have developed a new drug that you believe is an improvement over an existing drug. So you opt for a one-tailed test. Therefore, you fail to test for the possibility that the new drug is less effective than the existing drug. The consequences in this example are extreme, but they illustrate a danger of inappropriate use of a one-tailed test.

o Imagine you have a new drug which is cheaper than the existing drug and, you believe, no less effective. You do not care if it is more effective. You only wish to show that it is not less effective. In this scenario, a one-tailed test would be appropriate (the consequences of not testing the effect in the other direction are negligible and ethical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are paired t-tests based on?

A

o A paired t-test is based on differences within each subject
o Each subject acts as their own control
o Measurements on the same subject are not independent
o Measurements on different subjects are independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What are the underlaying assumptions of t-tests?

A

o Means of the populations being compared should follow normal distributions. Fortunately, it can be proved that this will be approximately true if you have enough data.
o The data used should either be sampled independently or fully paired (for a paired test).
o In Student’s t-test original formulation the variances of the populations being compared should be equal. However, modern statistical software are allows for unequal variances (in R, the default option for t.test is “var.equal=FALSE” which allows for unequal variances).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What if you are comparing more than one means? Which test would you use

A

ANOVA (analysis of variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is one-way ANOVA used for?

A

o One-way ANOVA is used to compare the mean of a numerical outcome variable in the groups defined by an exposure level with two or more categories.
o It is called one-way as the exposure groups are classified by just one variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is the definition of precision in the context of diagnostics?

A

How close diagnostic test results are to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is the definition of sensitivity in the context of diagnostics?

A

The proportion of people with the disease or condition that test positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is the definition of specificity in the context of diagnostics?

A

The proportion of people without the disease or condition that test negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is the formula to calculate sensitivity?

A

A/(A+C)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is the formula to calculate specificity?

A

B/(B+D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is the positive predictive value, and how is it calculated?

A

Proportion of people testing positive who have the condition. It is calculated as A/(A+B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is the negative predictive value and how is it calculated?

A

Proportion of people testing negative who do not have the disease. It is calculated as D/(B+D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is the crucial difference to remember between sens/spec and predictive values?

A

Sensitivity and specificity depend on the test itself - whereas NPV and PPV depend on the prevalence of a condition or disease among the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What are the four main barriers to the development and use of diagnostics in LMICs?

A

1) Lack of investment and innovation
2) Limited access to diagnostic tests
3) Lack of regulatory control and quality standards
for evaluation
4) Infrastructure and human resource capacity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is a reference standard?

A

The best test we have available to

estimate an individual’s disease status

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is the index test?

A

A new or improved test which is tested against the reference standard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is economic evaluation in the context of test diagnostics?

A

“… the comparative analysis of alternative courses
of action in terms of both their costs and
consequences.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What does correlation do?

A

Measures the strength of linear association between two continuous variables (exposure and outcome)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What are the four components of the Pearson correlation coefficient?

A
  • True value in the population (⍴)
  • Estimated in sample by r
  • Can take values between -1 and 1
  • It is only valid within the range of values in the sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is the r score if there is no correlation?

A

r=0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is the r score of an imperfect positive correlation

A

0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is the r score of a perfect positive correlation?

A

r=1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What is the r score of an imperfect negative correlation?

A

-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is the r score of a perfect negative correlation?

A

r= -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What does r =-1 indicate?

A

A perfect negative linear relationship; as the value of one variable increases, the value of another decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What does r=1 indicate?

A

A perfect positive linear relationship. As
the value of one variable increases the value of the
other increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What does r=0 indicate?

A

There is no linear relationship between the 2 continuous variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What are arbitrary labels for strength of positive correlation

A

0 - 0.19 very weak

  1. 2 - 0.39 weak
  2. 4 - 0.59 moderate
  3. 6 - 0.79 strong
  4. 8 – 1.0 very strong
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

How would you word a hypothesis test for a correlation coeffiecient?

A

H0 : ⍴ = 0 (no linear relationship in the population)

H1 : ⍴ ≠ 0 (linear relationship exists in the population)

70
Q

What is association NOT?

A

Causation!

71
Q

What does correlation NOT imply?

A

Causation!

72
Q

What does correlation measure?

A

Strength of linear association! (between 2 continuous variables - outcome and exposure)

73
Q

When is correlation inappropriate?

A

For non-linear relationships, more than one observation from each individual, and for data with a lot of outliers (can have a powerful effect on the correlation coefficient, esp with a small sample)

74
Q

What is simple linear regression?

A

o Simple linear regression describes the relationship
between two continuous variables.
o Simple linear regression gives the equation of the
straight line that best describes the linear association
between two continuous variables.
o It enables the prediction of one variable using
information from another variable.

75
Q

what is the dependent variable in simple linear regression?

A

The dependent variable is the variable to be predicted
(i.e., the particular outcome of interested)
It is denoted as Y

76
Q

what is the independent variable in simple linear regression?

A

The independent variable or explanatory variable is the variable used for predicting the particular outcome. It is denoted as X

77
Q

How is simple linear regression explained in terms of x and y?

A

Regression of Y on X

78
Q

In simple linear regression, on which axis is the exposure variable (the independent variable) plotted?

A

The horizontal axis (x)

79
Q

In simple linear regression, on which axis is the outcome variable (the dependent variable) plotted?

A

The vertical axis (y)

80
Q

What does the linear regression give us?

A

The equation of the straight line that best describes the linear association between the outcome (y) and the exposure (x)

81
Q

In the context of simple linear regression, how would you word the interpretation of a Ho (where the Ho is that there is no linear relationship) using the test statistic obtained?

A

There is evidence against the null hypothesis that there is no linear relationship in the population

82
Q

In simple linear regression, how do you make the intercept meaningful?

A

By centering the exposure variable - which is when you subtract the mean so that the new exposure variable has a mean of 0

83
Q

What is the equation for the regression line?

A

Y=Bo+B1X

84
Q

What do the components of the equation of the regression line stand for?

A

Bo is the intercept (the value of Yi when Xi = 0)
B1 is the slope of the line (the increase in Y for every unit increase in X)
Y is the dependent variable (the variable of interest), and X is the independent variable

85
Q

What are residuals in the context of linear regression?

A

The difference between the observed value and the predicted value (as calculated from the regression equation) - basically between the point value and the best fit line

Residual = Observed (Y) - Predicted (Y’)

The methods of least squares attempts to minimize the sum of squared residuals

86
Q

What does examining residuals help you do in the context of simple linear regression?

A

To test the quality of the fit of the model (the best fit line)

87
Q

In addition to residuals, what is another method you can use to test the quality of the fit of the model?

A

To look at the coefficient of determination (the R squared). This is interpreted as the % of variance in the dependent variable (Y), that can be explained by the independent variable (X),

88
Q

What does the R squared equal?

A

The regression sum of squares divided by the total sum of squares

89
Q

What is an adjusted R square?

A

It takes into account the number of explanatory variables (Xs) and the sample size

90
Q

What are the three assumptions underpinning linear regression?

A
  • There should be a linear relationship between the dependent variable and the independent variable
  • The residuals should be normally distributed
  • The variance of the dependent variable (Y) values should be the same for all values of the independent variable (X)
91
Q

How do you check the assumptions in simple linear regression?

A

o Linearity should be assessed prior to carrying out linear regression
o After the regression model has been fitted to the data it is essential to check that the assumptions of linear regression have not been violated
o If any of the assumptions have been violated then inference on the basis of the regression model is likely to be invalid

92
Q

What is multiple linear regression?

A
  • To examine the dependency of a numerical outcome variable on several exposure variables
  • Independent variables can be continuous, binary, categorical or ordinal
  • It can be used for prediction and adjustment for confounding
93
Q

What is the equation of the multiple linear regression model?

A

Y=Bo + B1X1 + B2X2

The intercept Bo is the value of the outcome Y when both
exposure variables X1 and X2 are zero.

94
Q

What is FEV1 in the context of multiple linear regression?

A

It is the value of the outcome variable

95
Q

What kind of data is Y (the dependent variable, the one of interest) in linear regression?

A

Continuous

96
Q

What kind of data is Y (the dependent variable, the one of interest) in logistic regression?

A

Binary

97
Q

What kind of data is Y (the dependent variable, the one of interest) in poisson regression?

A

count/rate

98
Q

What kind of data is Y (the dependent variable, the one of interest) in survival analysis?

A

time to event

99
Q

How is logistic regression different to linear regression?

A

In linear regression, the outcome variable (Y’) is quantitative, but in logistic regression, it is qualitative

100
Q

Summarize linear regression in terms of y, a, b and explain

A

Y’ = a+bX. Change in Y due to 1 unite increase in X=b

101
Q

Summarize logistic regression in terms of y, a, b and explain

A

Logodds = a+bX

Change in logodds due to one unit increase in X=b

102
Q

What does Logit transformation do?

A

Transforms the probability (p, or risk) to log odds

103
Q

Log odds isn’t intuitive, so we…

A

“transform” back to odds using exponential function

104
Q

What kinds of studies are associated with logistic regression?

A

Case control (for confounding), and cohort studies

105
Q

What kind of advanced analysis might be associated with RCTs?

A

linear regression

106
Q

What is the definition of prevalence?

A

The frequency of an event of interest - for example a disease, condition, or characteristic - in a population

107
Q

What is the definition of point prevalence?

A

The frequency of an event of interest - for example disease, condition, or characteristic - in a population at ONE POINT in time

108
Q

What is the definition of period prevalence?

A

The frequency of an event of interest - for example disease, condition, or characteristic - at any point during a period of time in the recent past

109
Q

What is the definition of incidence?

A

The measure of occurrence of new cases over time

110
Q

For rare events, odds are….

A

…approximately equal to risks

111
Q

When do we use poisson regression?

A

For modelling data where a rate ratio is the outcome, and for count data

112
Q

What is the kind of data in poisson regression?

A

count data!

113
Q

What Is count data?

A

Data generated by a process that results in only non-negative integers

114
Q

What are some examples of count data?

A

the number of particles found in a unit of space (eg number of malaria parasites in a blood smear), number of daily births in a ward, number of crimes on a block, number of radioactive particles from a particular source

115
Q

What are two common attributes of count data?

A

They are typically skewed
They are discrete
They only take positive values

116
Q

Why is the poisson distribution used for count data?

A

Because it is typically skewed, the normal distribution is usually not appropriate

117
Q

What kind of distribution is the poisson distribution?

A

theoretical

118
Q

When is the poisson distribution approporiate?

A
  • randomly
  • independent
  • At a constant underlying rate over time
119
Q

How is the poisson distribution described?

A

rate of mean number of occurrences of an event per unit time

120
Q

What is the unique property of the poisson distribution?

A

the mean and the variance are equal!

121
Q

What are some examples of count data which are NOT Poisson?

A

infectious diseases occurring in clusters

physical events, such as parasitic eggs, which tend to group together

122
Q

What kinds of events is the poisson distribution suitable for modelling?

A

rare events

123
Q

What is the poisson regression formula?

A

rate = number of events (r)/ total person-time (T)

124
Q

What are the two main assumptions of poisson distribution?

A

 Events are independent (assessed based on the knowledge of study
design and data collection process)
 Equidispersion: mean = variance (can check the data)

125
Q

For poisson regression, the parameter estimates are interpreted in the same fashion as which other regression?

A

logistic regression (the model is fit on a log-scale)

126
Q

In the context of poisson regression, what is over-dispersion?

A

The variance is larger than the mean

127
Q

In the context of poisson regression, what is under-dispersion?

A

The variance is smaller than the mean

128
Q

When can the poisson distribution be used for modelling rates?

A

If the events occur:

  • independently
  • at a constant underlaying rate
129
Q

In the context of possion distribution, what is normally a problem?

A

Over dispersion

130
Q

What kind of outcome is linear regression used for?

A

continuous outcome (quanitative)

131
Q

What kind of outcome is logistic regression used for?

A

binary outcome (qualitative)

132
Q

What is poisson regression used for?

A

rates or events during an exposure period

133
Q

How does survival analysis differ from poisson?

A

In poisson, the data has an underlying rate which is constant under time, but this may not always be reasonable to presume. That is where survival analysis comes in.

134
Q

What are the two measures for measuring disease occurence, allowing for the rate of occurence to change over time?

A
  • The hazard function, h(t)
    This is the instantaneous rate of the event occurring at time T
  • The survivor function S (t)
    This is the probability that an individual will survive (i.e has not experienced the event of interest) up to and including time t
135
Q

In the survivor function, what does the Y axis indicate?

A

% alive

136
Q

In the survivor function, what does the X axis indicate?

A

time

137
Q

In the context of survival analysis, what is censoring?

A

when a participant is censored, they did not experience the event during the study period, so the exact survival time is unknown

138
Q

What is right censoring?

A

When an individual hasn’t had the event during the study, but could still go on past the study (eg those still alive at the end of the study). They could also be lost to follow up!

139
Q

What is left censoring?

A

When an event happens before entry into the study

140
Q

By what is survival data defined?

A

time when the event occurs, event indicator (an indicator of whether the event has occurred or not)

141
Q

What do vertical tick marks indicate on a K-M curve?

A

Censoring

142
Q

When does the curve drop on a K-M curve?

A

When there is an event

143
Q

Why can’t we use a mean-to-time event t-test or linear regression to compare groups?

A

It ignores censoring!

144
Q

What does a log rank test do?

A

Evaluates whether or not K-M survival curves for 2 or more groups are statistically significant

145
Q

What are the limitations of K-M curves?

A

They are mainly descriptive

Cannot control for all covariates - just subgroup analyses

Cannot accommodate time-dependent variables

146
Q

What is Cox’s proportional hazards regression?

A
  • a regression model for survival data (TIME TO EVENT DATA)
  • It provides an estimate of the hazard ratio and it’s CI
  • It simultaneously explores the effects of several variables on survival
147
Q

Which other ratio is the hazard ratio interpreted like?

A

The risk ratio (relative risk)

148
Q

What are the assumptions associated with Cox’s proportional hazard regression?

A

 We assume that the ratio of the hazards remains constant
(or proportional) over time, even if the underlying hazards
change
 This can also be checked by plotting the log (-log())
transformed survivor estimate for each of the groups

149
Q

What does the cox regression model assume?

A

That hazards are propotions, the hazard rate is constant, all censoring is indepedent of outcomes

150
Q

What test does the K-M survival curve use?

A

the log rank test to compare survival between two groups

151
Q

What is the outcome of survival analysis?

A

time to an event

152
Q

What are the assumptions of K-M survival analysis?

A

1) Survival Probabilities are the same for all the samples who joined late in the study and those who have joined early. The Survival analysis which can affect is not assumed to change.
2) Occurrence of Event are done at a specified time.
3) Censoring of the study does not depend on the outcome. The Kaplan Meier method doesn’t depend on the outcome of interest. The censoring is INDEPENDENT of outcome
4) Censoring is similar in all groups

153
Q

What needs to be performed on a K-M analysis to make any inferences?

A

The log-rank test

154
Q

Why is presenting an adjusted OR score important?

A

It’s particularly useful for helping us understand how a predictor variable affects the odds of an event occurring, after adjusting for the effect of other predictor variables

155
Q

What the the assumptions of a Cox regression?

A

The hazards are proportional
The hazard rate is constant
Any censoring must be independent of outcome

156
Q

What kind of events is Poisson regression used for modelling?

A

Rare events

157
Q

What are the assumptions of poisson regression?

A

there is a constant underlying rate which is fixed over time
The data of the response variable is count data
The mean and the variance are equal (v unique!)
The distribution of counts follows a poisson distribution
Observations are independent

158
Q

What kind of data is Cox regression used for?

A

time-to-event

159
Q

What is the full name of Cox’s regression?

A

proportional hazards regression.

160
Q

What kind of data is used in Poisson regression?

A

Count data

161
Q

What kind of response variable data is used in traditional linear regression?

A

continuous data

162
Q

Regression is….

A

…..a statistical method that can be used to determine the relationship between one or more predictor variables and a response variable.

163
Q

The response variable is….

A

….the dependent variable!

164
Q

What are the assumptions of logistic regression?

A
  1. Response variable (dependent variable) is binary (categorical)
  2. Observations are independent
  3. There are no extreme outliers
  4. There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable
  5. Sample size is sufficiently large
165
Q

What are the assumptions of linear regression?

A
  1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.
  2. Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.
  3. Homoscedasticity: The residuals have constant variance at every level of x.
  4. Normality: The residuals of the model are normally distributed.
166
Q

What are the limitations of cox’s regression?

A
  • Possible to miss important variables
167
Q

What are the limitations of logistic regression?

A
  1. Main limitation of Logistic Regression is the assumption of linearity between the dependent variable and the independent variables. In the real world, the data is rarely linearly separable. Most of the time data would be a jumbled mess.
  2. If the number of observations are lesser than the number of features, Logistic Regression should not be used, otherwise it may lead to overfit.
  3. Logistic Regression can only be used to predict discrete functions. Therefore, the dependent variable of Logistic Regression is restricted to the discrete number set. This restriction itself is problematic, as it is prohibitive to the prediction of continuous data.
168
Q

What are the limitations of linear regression?

A

The main limitation is the assumption of linearity between the dependent variable and the independent variables
Very sensitive to outliers

169
Q

What are the limitations of poisson regression?

A

Heterogeneity in the data — there is more than one process that is generating the data. For example, the data might be collected on more than one group of people, unknowingly
Overdispersion — when the variance of the fitted model is larger than what is expected by the assumptions (the mean and the variance are equal)

170
Q

What are the limitations of k-m survival analysis?

A

1) We need to perform the Log Rank Test to make any kind of inferences.
2) Kaplan Meier’s results can be easily biased. The Kaplan Meier is a univariate approach to solving the problem
3) Removal of Censored Data will cause to change in the shape of the curve. This will create biases in model fit-up
4) Statistical tests and observations become mislead if the Dichotomizing of Continuous Variable is performed.
5) By dichotomizing means we take statistical measures such as median to create groups but this may lead to problems in the data set.

171
Q

What is poisson distribution?

A

a probability distribution that is used to model the probability that a certain number of events occur during a fixed time interval.

172
Q

What does Cox regression do that K-M cannot?

A

Create a multivariate analysis