Advanced Analysis and Hypothesis Tests Flashcards
What is a t-distribution?
Similar to the standard Normal distribution but is family of curves dependent on the degrees of freedom.
What is hypothesis testing?
using data to “weigh up the evidence” and using the evidence to decide whether to reject a pre-defined statement
What are the five steps of hypothesis testing?
- State the null hypothesis
- Calculate the appropriate test statistic
- Obtain a P value for the test statistic
- Make the decision whether to reject the null hypothesis based on P value
- State the conclusion in terms of the original research question
What is the null hypothesis?
- A statement about the value of a population parameter or the difference between groups
- usually the negation of the research hypothesis
- usually “the effect/association of interest is zero
What is the alternative hypothesis?
- Opposite of the null hypothesis
* Usually related to the research question
How do we calculate the test statistic?
Test statistic = observed value - hypothesised value/
standard error
What is the relationship between the test statistic and the null hypothesis?
The bigger the test statistic (+/-), the more evidence there is against the null hypothesis. The value of the test statistic is used to decide whether to reject the null hypothesis.
What is the goal of estimation?
We want to estimate the population parameter based on the sample statistic.
• The sample must therefore be representative of the population
How is estimation different from hypothesis testing?
Hypothesis testing is concerned with using the data to ‘weigh up the evidence’ and make a decision whether to reject a pre-specified statement (the null hypothesis) or not, whereas estimation gives us a ‘best estimate’ for the population value along with a range of likely values (confidence Intervals)
What is the definition of a population parameter?
A measurable characteristic of the population (e.g. mean = μ, proportion = π, standard deviation = σ). Values obtained from a sample are estimates of the
population parameters.
What are sample statistics?
Sample statistics are estimates of results that would have been obtained had the whole population been studied
What are the two different kinds of estimation?
Point estimation and interval estimate
What is a confidence interval?
a range of values in which we have confidence that the population true value lies. It quantifies uncertainty and indicates the precision of our sample statistic
What is a point estimate?
An example would be a mean - it is just one value and doesn’t take into account that this value would change from sample to sample
When does the width of the CI increase?
When there is:
- a small sample size
- lots of variability in the data
- the level of confidence (eg 99%) increases
When do we use the t-distribution?
When the sample size is small, say under 30
What is the formula for the t-distribution?
t = (x̄ – μ) / (s/√n) x̄ is the sample mean μ is the population mean s is the standard deviation n is the size of the given sample
What impacts the width of a CI?
- Precision of the estimate (s.e.)
* Level of confidence (multiplier)
What are poor and high precision and how do they relate to the concept of a CI?
• Poor precision (large SE): wide interval
•High precision (small SE): narrow interval
•As sample size increases, standard error (SE)
decreases which leads to greater precision and
narrower intervals
The larger the confidence, the….
….greater the interval
The narrower the interval, the…
…lower the confidence
Can you use a CI for a proportion?
Binomial proportions are not from the normal distribution but:
• If the sample size is greater than 30 and 0.1 < p < 0.9, we can use our standard formula for the confidence interval p +1.96SE( p)
What is the chi-squared test?
The chi-squared test of association(for categorical data) is a test for the comparison of two attributes in a sample of data to determine if there is any relationship between them
What would be the null hypothesis in the context of using the chi-squared test?
Ho = there is no association between the classification of the two attributes under investigation
What is the chi-squared test based on?
The difference between the observed and expected frequencies
What happens within the chi-squared test?
Under the null hypothesis this test statistic follows the
Chi-squared distribution
o The value of the test statistic is then compared with the appropriate Chi-squared distribution (first proposed by Pearson)
o The greater the differences between the observed and expected statistics, the larger the Chi-squared statistic is, the more evidence that the two variables are associated
How do you calculate the expected frequencies in a 2x2 table for a chi-squared test?
Expected freq. = (relevant row total × relevant column total)/ total sample size
How do you calculate the chi-squared statistic for a 2x2 table?
The chi-squared value is obtained by calculating:
(observed - expected)2/expected
for each of the four cells in the contingency table and
then summing them.
How would you then either reject the null hypothesis or fail to reject the null hypothesis using the chi-squared test?
Compare the Chi-squared test statistic with the tabulated values of the Chi squared distribution corresponding to given two-tailed p values for different degrees of freedom. The bigger the difference between the test statistic and the p-value, the more evidence against the null (you would fail to reject)
What is Yates’ correction for a 2x2 table?
- When the number of events/sample is low, a continuity correction is usually made by subtracting 0.5 to each element in the calculation. This correction is referred to as Yate’s continuity correction
- It is intended for use with ‘small’ samples i.e. total sample size <40 or expected numbers are small (cell frequency <5)
o The correction reduces the value of Chi-square and prevents overestimation of statistical significance for small data sets
What is Fisher’s exact test, and when is it used?
o The Fisher’s exact test to compare two proportions is needed when the numbers in the 2 x 2 table are very small (i.e. expected frequency of less than 5)
o For the Chi squared test to be valid, most cells should have an expected frequency of more than 5 and total sample size of approximately 40
Can the chi-squared statistic be used for larger contingency tables?
Yes!
- Larger tables are called r x c tables, where r denotes the number of rows in the table and c the number of columns.
- the calculation for the expected frequencies then becomes: Expected number = column total x row total/overall total
What is the chi-squared test for linear trend, and when is it appropriate?
o Appropriate for ordered categorical (ordinal) exposure variables (e.g. lifetime partners, age- group, cholesterol levels).
o Not appropriate for variables in which there is no natural order e.g. marital status, ethnic group, country of residence.
o The ꭕ2 test for trend is a more sensitive test that assesses whether there is an increasing (or decreasing) trend in the proportions over the exposure categories.
What does the chi-squared test presume of it’s observations?
That they are independent
What test do you use for categorical variables/observations which are NOT independent?
McNemar’s test - this would be appropriate for paired data, such as matching in a case control trial, before and after measurements, comparisons between 2 observers - eg 2 radiographers using x-rays to diagnose TB
What are some examples of continuous data?
weight, age, blood pressure, antibody levels
What do you need to check for continuous data?
The shape of the frequency distribution - this indicates what summary measures should be used on the data
What are some examples of how continuous data is displayed?
Histogram, scatter-plot, line plot, box plot
What are some recommendations for how continuous data can be summarized?
- For normally distributed data: Mean and SD
- For non-normal data: Median and interquartile range (25th -75th percentile)
When is it appropriate to use Student’s T-Test?
For the comparison of means
When is appropriate to use a one-tailed t-test?
o Imagine you have developed a new drug that you believe is an improvement over an existing drug. So you opt for a one-tailed test. Therefore, you fail to test for the possibility that the new drug is less effective than the existing drug. The consequences in this example are extreme, but they illustrate a danger of inappropriate use of a one-tailed test.
o Imagine you have a new drug which is cheaper than the existing drug and, you believe, no less effective. You do not care if it is more effective. You only wish to show that it is not less effective. In this scenario, a one-tailed test would be appropriate (the consequences of not testing the effect in the other direction are negligible and ethical)
What are paired t-tests based on?
o A paired t-test is based on differences within each subject
o Each subject acts as their own control
o Measurements on the same subject are not independent
o Measurements on different subjects are independent
What are the underlaying assumptions of t-tests?
o Means of the populations being compared should follow normal distributions. Fortunately, it can be proved that this will be approximately true if you have enough data.
o The data used should either be sampled independently or fully paired (for a paired test).
o In Student’s t-test original formulation the variances of the populations being compared should be equal. However, modern statistical software are allows for unequal variances (in R, the default option for t.test is “var.equal=FALSE” which allows for unequal variances).
What if you are comparing more than one means? Which test would you use
ANOVA (analysis of variance)
What is one-way ANOVA used for?
o One-way ANOVA is used to compare the mean of a numerical outcome variable in the groups defined by an exposure level with two or more categories.
o It is called one-way as the exposure groups are classified by just one variable.
What is the definition of precision in the context of diagnostics?
How close diagnostic test results are to each other
What is the definition of sensitivity in the context of diagnostics?
The proportion of people with the disease or condition that test positive
What is the definition of specificity in the context of diagnostics?
The proportion of people without the disease or condition that test negative
What is the formula to calculate sensitivity?
A/(A+C)
What is the formula to calculate specificity?
B/(B+D)
What is the positive predictive value, and how is it calculated?
Proportion of people testing positive who have the condition. It is calculated as A/(A+B)
What is the negative predictive value and how is it calculated?
Proportion of people testing negative who do not have the disease. It is calculated as D/(B+D)
What is the crucial difference to remember between sens/spec and predictive values?
Sensitivity and specificity depend on the test itself - whereas NPV and PPV depend on the prevalence of a condition or disease among the population
What are the four main barriers to the development and use of diagnostics in LMICs?
1) Lack of investment and innovation
2) Limited access to diagnostic tests
3) Lack of regulatory control and quality standards
for evaluation
4) Infrastructure and human resource capacity
What is a reference standard?
The best test we have available to
estimate an individual’s disease status
What is the index test?
A new or improved test which is tested against the reference standard
What is economic evaluation in the context of test diagnostics?
“… the comparative analysis of alternative courses
of action in terms of both their costs and
consequences.”
What does correlation do?
Measures the strength of linear association between two continuous variables (exposure and outcome)
What are the four components of the Pearson correlation coefficient?
- True value in the population (⍴)
- Estimated in sample by r
- Can take values between -1 and 1
- It is only valid within the range of values in the sample
What is the r score if there is no correlation?
r=0
What is the r score of an imperfect positive correlation
0
What is the r score of a perfect positive correlation?
r=1
What is the r score of an imperfect negative correlation?
-1
What is the r score of a perfect negative correlation?
r= -1
What does r =-1 indicate?
A perfect negative linear relationship; as the value of one variable increases, the value of another decreases
What does r=1 indicate?
A perfect positive linear relationship. As
the value of one variable increases the value of the
other increases
What does r=0 indicate?
There is no linear relationship between the 2 continuous variables
What are arbitrary labels for strength of positive correlation
0 - 0.19 very weak
- 2 - 0.39 weak
- 4 - 0.59 moderate
- 6 - 0.79 strong
- 8 – 1.0 very strong
How would you word a hypothesis test for a correlation coeffiecient?
H0 : ⍴ = 0 (no linear relationship in the population)
H1 : ⍴ ≠ 0 (linear relationship exists in the population)
What is association NOT?
Causation!
What does correlation NOT imply?
Causation!
What does correlation measure?
Strength of linear association! (between 2 continuous variables - outcome and exposure)
When is correlation inappropriate?
For non-linear relationships, more than one observation from each individual, and for data with a lot of outliers (can have a powerful effect on the correlation coefficient, esp with a small sample)
What is simple linear regression?
o Simple linear regression describes the relationship
between two continuous variables.
o Simple linear regression gives the equation of the
straight line that best describes the linear association
between two continuous variables.
o It enables the prediction of one variable using
information from another variable.
what is the dependent variable in simple linear regression?
The dependent variable is the variable to be predicted
(i.e., the particular outcome of interested)
It is denoted as Y
what is the independent variable in simple linear regression?
The independent variable or explanatory variable is the variable used for predicting the particular outcome. It is denoted as X
How is simple linear regression explained in terms of x and y?
Regression of Y on X
In simple linear regression, on which axis is the exposure variable (the independent variable) plotted?
The horizontal axis (x)
In simple linear regression, on which axis is the outcome variable (the dependent variable) plotted?
The vertical axis (y)
What does the linear regression give us?
The equation of the straight line that best describes the linear association between the outcome (y) and the exposure (x)
In the context of simple linear regression, how would you word the interpretation of a Ho (where the Ho is that there is no linear relationship) using the test statistic obtained?
There is evidence against the null hypothesis that there is no linear relationship in the population
In simple linear regression, how do you make the intercept meaningful?
By centering the exposure variable - which is when you subtract the mean so that the new exposure variable has a mean of 0
What is the equation for the regression line?
Y=Bo+B1X
What do the components of the equation of the regression line stand for?
Bo is the intercept (the value of Yi when Xi = 0)
B1 is the slope of the line (the increase in Y for every unit increase in X)
Y is the dependent variable (the variable of interest), and X is the independent variable
What are residuals in the context of linear regression?
The difference between the observed value and the predicted value (as calculated from the regression equation) - basically between the point value and the best fit line
Residual = Observed (Y) - Predicted (Y’)
The methods of least squares attempts to minimize the sum of squared residuals
What does examining residuals help you do in the context of simple linear regression?
To test the quality of the fit of the model (the best fit line)
In addition to residuals, what is another method you can use to test the quality of the fit of the model?
To look at the coefficient of determination (the R squared). This is interpreted as the % of variance in the dependent variable (Y), that can be explained by the independent variable (X),
What does the R squared equal?
The regression sum of squares divided by the total sum of squares
What is an adjusted R square?
It takes into account the number of explanatory variables (Xs) and the sample size
What are the three assumptions underpinning linear regression?
- There should be a linear relationship between the dependent variable and the independent variable
- The residuals should be normally distributed
- The variance of the dependent variable (Y) values should be the same for all values of the independent variable (X)
How do you check the assumptions in simple linear regression?
o Linearity should be assessed prior to carrying out linear regression
o After the regression model has been fitted to the data it is essential to check that the assumptions of linear regression have not been violated
o If any of the assumptions have been violated then inference on the basis of the regression model is likely to be invalid
What is multiple linear regression?
- To examine the dependency of a numerical outcome variable on several exposure variables
- Independent variables can be continuous, binary, categorical or ordinal
- It can be used for prediction and adjustment for confounding
What is the equation of the multiple linear regression model?
Y=Bo + B1X1 + B2X2
The intercept Bo is the value of the outcome Y when both
exposure variables X1 and X2 are zero.
What is FEV1 in the context of multiple linear regression?
It is the value of the outcome variable
What kind of data is Y (the dependent variable, the one of interest) in linear regression?
Continuous
What kind of data is Y (the dependent variable, the one of interest) in logistic regression?
Binary
What kind of data is Y (the dependent variable, the one of interest) in poisson regression?
count/rate
What kind of data is Y (the dependent variable, the one of interest) in survival analysis?
time to event
How is logistic regression different to linear regression?
In linear regression, the outcome variable (Y’) is quantitative, but in logistic regression, it is qualitative
Summarize linear regression in terms of y, a, b and explain
Y’ = a+bX. Change in Y due to 1 unite increase in X=b
Summarize logistic regression in terms of y, a, b and explain
Logodds = a+bX
Change in logodds due to one unit increase in X=b
What does Logit transformation do?
Transforms the probability (p, or risk) to log odds
Log odds isn’t intuitive, so we…
“transform” back to odds using exponential function
What kinds of studies are associated with logistic regression?
Case control (for confounding), and cohort studies
What kind of advanced analysis might be associated with RCTs?
linear regression
What is the definition of prevalence?
The frequency of an event of interest - for example a disease, condition, or characteristic - in a population
What is the definition of point prevalence?
The frequency of an event of interest - for example disease, condition, or characteristic - in a population at ONE POINT in time
What is the definition of period prevalence?
The frequency of an event of interest - for example disease, condition, or characteristic - at any point during a period of time in the recent past
What is the definition of incidence?
The measure of occurrence of new cases over time
For rare events, odds are….
…approximately equal to risks
When do we use poisson regression?
For modelling data where a rate ratio is the outcome, and for count data
What is the kind of data in poisson regression?
count data!
What Is count data?
Data generated by a process that results in only non-negative integers
What are some examples of count data?
the number of particles found in a unit of space (eg number of malaria parasites in a blood smear), number of daily births in a ward, number of crimes on a block, number of radioactive particles from a particular source
What are two common attributes of count data?
They are typically skewed
They are discrete
They only take positive values
Why is the poisson distribution used for count data?
Because it is typically skewed, the normal distribution is usually not appropriate
What kind of distribution is the poisson distribution?
theoretical
When is the poisson distribution approporiate?
- randomly
- independent
- At a constant underlying rate over time
How is the poisson distribution described?
rate of mean number of occurrences of an event per unit time
What is the unique property of the poisson distribution?
the mean and the variance are equal!
What are some examples of count data which are NOT Poisson?
infectious diseases occurring in clusters
physical events, such as parasitic eggs, which tend to group together
What kinds of events is the poisson distribution suitable for modelling?
rare events
What is the poisson regression formula?
rate = number of events (r)/ total person-time (T)
What are the two main assumptions of poisson distribution?
Events are independent (assessed based on the knowledge of study
design and data collection process)
Equidispersion: mean = variance (can check the data)
For poisson regression, the parameter estimates are interpreted in the same fashion as which other regression?
logistic regression (the model is fit on a log-scale)
In the context of poisson regression, what is over-dispersion?
The variance is larger than the mean
In the context of poisson regression, what is under-dispersion?
The variance is smaller than the mean
When can the poisson distribution be used for modelling rates?
If the events occur:
- independently
- at a constant underlaying rate
In the context of possion distribution, what is normally a problem?
Over dispersion
What kind of outcome is linear regression used for?
continuous outcome (quanitative)
What kind of outcome is logistic regression used for?
binary outcome (qualitative)
What is poisson regression used for?
rates or events during an exposure period
How does survival analysis differ from poisson?
In poisson, the data has an underlying rate which is constant under time, but this may not always be reasonable to presume. That is where survival analysis comes in.
What are the two measures for measuring disease occurence, allowing for the rate of occurence to change over time?
- The hazard function, h(t)
This is the instantaneous rate of the event occurring at time T - The survivor function S (t)
This is the probability that an individual will survive (i.e has not experienced the event of interest) up to and including time t
In the survivor function, what does the Y axis indicate?
% alive
In the survivor function, what does the X axis indicate?
time
In the context of survival analysis, what is censoring?
when a participant is censored, they did not experience the event during the study period, so the exact survival time is unknown
What is right censoring?
When an individual hasn’t had the event during the study, but could still go on past the study (eg those still alive at the end of the study). They could also be lost to follow up!
What is left censoring?
When an event happens before entry into the study
By what is survival data defined?
time when the event occurs, event indicator (an indicator of whether the event has occurred or not)
What do vertical tick marks indicate on a K-M curve?
Censoring
When does the curve drop on a K-M curve?
When there is an event
Why can’t we use a mean-to-time event t-test or linear regression to compare groups?
It ignores censoring!
What does a log rank test do?
Evaluates whether or not K-M survival curves for 2 or more groups are statistically significant
What are the limitations of K-M curves?
They are mainly descriptive
Cannot control for all covariates - just subgroup analyses
Cannot accommodate time-dependent variables
What is Cox’s proportional hazards regression?
- a regression model for survival data (TIME TO EVENT DATA)
- It provides an estimate of the hazard ratio and it’s CI
- It simultaneously explores the effects of several variables on survival
Which other ratio is the hazard ratio interpreted like?
The risk ratio (relative risk)
What are the assumptions associated with Cox’s proportional hazard regression?
We assume that the ratio of the hazards remains constant
(or proportional) over time, even if the underlying hazards
change
This can also be checked by plotting the log (-log())
transformed survivor estimate for each of the groups
What does the cox regression model assume?
That hazards are propotions, the hazard rate is constant, all censoring is indepedent of outcomes
What test does the K-M survival curve use?
the log rank test to compare survival between two groups
What is the outcome of survival analysis?
time to an event
What are the assumptions of K-M survival analysis?
1) Survival Probabilities are the same for all the samples who joined late in the study and those who have joined early. The Survival analysis which can affect is not assumed to change.
2) Occurrence of Event are done at a specified time.
3) Censoring of the study does not depend on the outcome. The Kaplan Meier method doesn’t depend on the outcome of interest. The censoring is INDEPENDENT of outcome
4) Censoring is similar in all groups
What needs to be performed on a K-M analysis to make any inferences?
The log-rank test
Why is presenting an adjusted OR score important?
It’s particularly useful for helping us understand how a predictor variable affects the odds of an event occurring, after adjusting for the effect of other predictor variables
What the the assumptions of a Cox regression?
The hazards are proportional
The hazard rate is constant
Any censoring must be independent of outcome
What kind of events is Poisson regression used for modelling?
Rare events
What are the assumptions of poisson regression?
there is a constant underlying rate which is fixed over time
The data of the response variable is count data
The mean and the variance are equal (v unique!)
The distribution of counts follows a poisson distribution
Observations are independent
What kind of data is Cox regression used for?
time-to-event
What is the full name of Cox’s regression?
proportional hazards regression.
What kind of data is used in Poisson regression?
Count data
What kind of response variable data is used in traditional linear regression?
continuous data
Regression is….
…..a statistical method that can be used to determine the relationship between one or more predictor variables and a response variable.
The response variable is….
….the dependent variable!
What are the assumptions of logistic regression?
- Response variable (dependent variable) is binary (categorical)
- Observations are independent
- There are no extreme outliers
- There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable
- Sample size is sufficiently large
What are the assumptions of linear regression?
- Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.
- Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.
- Homoscedasticity: The residuals have constant variance at every level of x.
- Normality: The residuals of the model are normally distributed.
What are the limitations of cox’s regression?
- Possible to miss important variables
What are the limitations of logistic regression?
- Main limitation of Logistic Regression is the assumption of linearity between the dependent variable and the independent variables. In the real world, the data is rarely linearly separable. Most of the time data would be a jumbled mess.
- If the number of observations are lesser than the number of features, Logistic Regression should not be used, otherwise it may lead to overfit.
- Logistic Regression can only be used to predict discrete functions. Therefore, the dependent variable of Logistic Regression is restricted to the discrete number set. This restriction itself is problematic, as it is prohibitive to the prediction of continuous data.
What are the limitations of linear regression?
The main limitation is the assumption of linearity between the dependent variable and the independent variables
Very sensitive to outliers
What are the limitations of poisson regression?
Heterogeneity in the data — there is more than one process that is generating the data. For example, the data might be collected on more than one group of people, unknowingly
Overdispersion — when the variance of the fitted model is larger than what is expected by the assumptions (the mean and the variance are equal)
What are the limitations of k-m survival analysis?
1) We need to perform the Log Rank Test to make any kind of inferences.
2) Kaplan Meier’s results can be easily biased. The Kaplan Meier is a univariate approach to solving the problem
3) Removal of Censored Data will cause to change in the shape of the curve. This will create biases in model fit-up
4) Statistical tests and observations become mislead if the Dichotomizing of Continuous Variable is performed.
5) By dichotomizing means we take statistical measures such as median to create groups but this may lead to problems in the data set.
What is poisson distribution?
a probability distribution that is used to model the probability that a certain number of events occur during a fixed time interval.
What does Cox regression do that K-M cannot?
Create a multivariate analysis