Advanced Analysis and Hypothesis Tests Flashcards

Question

What is the chi-squared test based on?

Answer 1

The difference between the observed and expected frequencies

Answer 2

Under the null hypothesis this test statistic follows the Chi-squared distribution o The value of the test statistic is then compared with the appropriate Chi-squared distribution (first proposed by Pearson) o The greater the differences between the observed and expected statistics, the larger the Chi-squared statistic is, the more evidence that the two variables are associated

Answer 3

Expected freq. = (relevant row total × relevant column total)/ total sample size

Answer 4

The chi-squared value is obtained by calculating: (observed - expected)2/expected for each of the four cells in the contingency table and then summing them.

Answer 5

Compare the Chi-squared test statistic with the tabulated values of the Chi squared distribution corresponding to given two-tailed p values for different degrees of freedom. The bigger the difference between the test statistic and the p-value, the more evidence against the null (you would fail to reject)

Answer 6

- When the number of events/sample is low, a continuity correction is usually made by subtracting 0.5 to each element in the calculation. This correction is referred to as Yate’s continuity correction - It is intended for use with ‘small’ samples i.e. total sample size <40 or expected numbers are small (cell frequency <5) o The correction reduces the value of Chi-square and prevents overestimation of statistical significance for small data sets

Answer 7

o The Fisher’s exact test to compare two proportions is needed when the numbers in the 2 x 2 table are very small (i.e. expected frequency of less than 5) o For the Chi squared test to be valid, most cells should have an expected frequency of more than 5 and total sample size of approximately 40

Answer 8

Yes! - Larger tables are called r x c tables, where r denotes the number of rows in the table and c the number of columns. - the calculation for the expected frequencies then becomes: Expected number = column total x row total/overall total

Answer 9

o Appropriate for ordered categorical (ordinal) exposure variables (e.g. lifetime partners, age- group, cholesterol levels). o Not appropriate for variables in which there is no natural order e.g. marital status, ethnic group, country of residence. o The ꭕ2 test for trend is a more sensitive test that assesses whether there is an increasing (or decreasing) trend in the proportions over the exposure categories.

Answer 10

That they are independent

Answer 11

McNemar's test - this would be appropriate for paired data, such as matching in a case control trial, before and after measurements, comparisons between 2 observers - eg 2 radiographers using x-rays to diagnose TB

Answer 12

weight, age, blood pressure, antibody levels

Answer 13

The shape of the frequency distribution - this indicates what summary measures should be used on the data

Answer 14

Histogram, scatter-plot, line plot, box plot

Answer 15

- For normally distributed data: Mean and SD | - For non-normal data: Median and interquartile range (25th -75th percentile)

Answer 16

For the comparison of means

Answer 17

o Imagine you have developed a new drug that you believe is an improvement over an existing drug. So you opt for a one-tailed test. Therefore, you fail to test for the possibility that the new drug is less effective than the existing drug. The consequences in this example are extreme, but they illustrate a danger of inappropriate use of a one-tailed test. o Imagine you have a new drug which is cheaper than the existing drug and, you believe, no less effective. You do not care if it is more effective. You only wish to show that it is not less effective. In this scenario, a one-tailed test would be appropriate (the consequences of not testing the effect in the other direction are negligible and ethical)

Answer 18

o A paired t-test is based on differences within each subject o Each subject acts as their own control o Measurements on the same subject are not independent o Measurements on different subjects are independent

Answer 19

o Means of the populations being compared should follow normal distributions. Fortunately, it can be proved that this will be approximately true if you have enough data. o The data used should either be sampled independently or fully paired (for a paired test). o In Student’s t-test original formulation the variances of the populations being compared should be equal. However, modern statistical software are allows for unequal variances (in R, the default option for t.test is “var.equal=FALSE” which allows for unequal variances).

Answer 20

ANOVA (analysis of variance)

Answer 21

o One-way ANOVA is used to compare the mean of a numerical outcome variable in the groups defined by an exposure level with two or more categories. o It is called one-way as the exposure groups are classified by just one variable.

Answer 22

How close diagnostic test results are to each other

Answer 23

The proportion of people with the disease or condition that test positive

Answer 24

The proportion of people without the disease or condition that test negative

Answer 25

Proportion of people testing positive who have the condition. It is calculated as A/(A+B)

Answer 26

Proportion of people testing negative who do not have the disease. It is calculated as D/(B+D)

Answer 27

Sensitivity and specificity depend on the test itself - whereas NPV and PPV depend on the prevalence of a condition or disease among the population

Answer 28

1) Lack of investment and innovation 2) Limited access to diagnostic tests 3) Lack of regulatory control and quality standards for evaluation 4) Infrastructure and human resource capacity

Answer 29

The best test we have available to | estimate an individual’s disease status

Answer 30

A new or improved test which is tested against the reference standard

Answer 31

“... the comparative analysis of alternative courses of action in terms of both their costs and consequences.”

Answer 32

Measures the strength of linear association between two continuous variables (exposure and outcome)

Answer 33

- True value in the population (⍴) - Estimated in sample by r - Can take values between -1 and 1 - It is only valid within the range of values in the sample

Answer 34

A perfect negative linear relationship; as the value of one variable increases, the value of another decreases

Answer 35

A perfect positive linear relationship. As the value of one variable increases the value of the other increases

Answer 36

There is no linear relationship between the 2 continuous variables

Answer 37

0 - 0.19 very weak 0. 2 - 0.39 weak 0. 4 - 0.59 moderate 0. 6 - 0.79 strong 0. 8 – 1.0 very strong

Answer 38

H0 : ⍴ = 0 (no linear relationship in the population) | H1 : ⍴ ≠ 0 (linear relationship exists in the population)

Answer 39

Causation!

Answer 40

Causation!

Answer 41

Strength of linear association! (between 2 continuous variables - outcome and exposure)

Answer 42

For non-linear relationships, more than one observation from each individual, and for data with a lot of outliers (can have a powerful effect on the correlation coefficient, esp with a small sample)

Answer 43

o Simple linear regression describes the relationship between two continuous variables. o Simple linear regression gives the equation of the straight line that best describes the linear association between two continuous variables. o It enables the prediction of one variable using information from another variable.

Answer 44

The dependent variable is the variable to be predicted (i.e., the particular outcome of interested) It is denoted as Y

Answer 45

The independent variable or explanatory variable is the variable used for predicting the particular outcome. It is denoted as X

Answer 46

Regression of Y on X

Answer 47

The horizontal axis (x)

Answer 48

The vertical axis (y)

Answer 49

The equation of the straight line that best describes the linear association between the outcome (y) and the exposure (x)

Answer 50

There is evidence against the null hypothesis that there is no linear relationship in the population

Answer 51

By centering the exposure variable - which is when you subtract the mean so that the new exposure variable has a mean of 0

Answer 52

Bo is the intercept (the value of Yi when Xi = 0) B1 is the slope of the line (the increase in Y for every unit increase in X) Y is the dependent variable (the variable of interest), and X is the independent variable

Answer 53

The difference between the observed value and the predicted value (as calculated from the regression equation) - basically between the point value and the best fit line Residual = Observed (Y) - Predicted (Y') The methods of least squares attempts to minimize the sum of squared residuals

Answer 54

To test the quality of the fit of the model (the best fit line)

Answer 55

To look at the coefficient of determination (the R squared). This is interpreted as the % of variance in the dependent variable (Y), that can be explained by the independent variable (X),

Answer 56

The regression sum of squares divided by the total sum of squares

Answer 57

It takes into account the number of explanatory variables (Xs) and the sample size

Answer 58

- There should be a linear relationship between the dependent variable and the independent variable - The residuals should be normally distributed - The variance of the dependent variable (Y) values should be the same for all values of the independent variable (X)

Answer 59

o Linearity should be assessed prior to carrying out linear regression o After the regression model has been fitted to the data it is essential to check that the assumptions of linear regression have not been violated o If any of the assumptions have been violated then inference on the basis of the regression model is likely to be invalid

Answer 60

- To examine the dependency of a numerical outcome variable on several exposure variables - Independent variables can be continuous, binary, categorical or ordinal - It can be used for prediction and adjustment for confounding

Answer 61

Y=Bo + B1X1 + B2X2 The intercept Bo is the value of the outcome Y when both exposure variables X1 and X2 are zero.

Answer 62

It is the value of the outcome variable

Answer 63

Continuous

Answer 64

count/rate

Answer 65

time to event

Answer 66

In linear regression, the outcome variable (Y') is quantitative, but in logistic regression, it is qualitative

Answer 67

Y' = a+bX. Change in Y due to 1 unite increase in X=b

Answer 68

Logodds = a+bX | Change in logodds due to one unit increase in X=b

Answer 69

Transforms the probability (p, or risk) to log odds

Answer 70

"transform" back to odds using exponential function

Answer 71

Case control (for confounding), and cohort studies

Answer 72

linear regression

Answer 73

The frequency of an event of interest - for example a disease, condition, or characteristic - in a population

Answer 74

The frequency of an event of interest - for example disease, condition, or characteristic - in a population at ONE POINT in time

Answer 75

The frequency of an event of interest - for example disease, condition, or characteristic - at any point during a period of time in the recent past

Answer 76

The measure of occurrence of new cases over time

Answer 77

...approximately equal to risks

Answer 78

For modelling data where a rate ratio is the outcome, and for count data

Answer 79

count data!

Answer 80

Data generated by a process that results in only non-negative integers

Answer 81

the number of particles found in a unit of space (eg number of malaria parasites in a blood smear), number of daily births in a ward, number of crimes on a block, number of radioactive particles from a particular source

Answer 82

They are typically skewed They are discrete They only take positive values

Answer 83

Because it is typically skewed, the normal distribution is usually not appropriate

Answer 84

theoretical

Answer 85

- randomly - independent - At a constant underlying rate over time

Answer 86

rate of mean number of occurrences of an event per unit time

Answer 87

the mean and the variance are equal!

Answer 88

infectious diseases occurring in clusters physical events, such as parasitic eggs, which tend to group together

Answer 89

rare events

Answer 90

rate = number of events (r)/ total person-time (T)

Answer 91

 Events are independent (assessed based on the knowledge of study design and data collection process)  Equidispersion: mean = variance (can check the data)

Answer 92

logistic regression (the model is fit on a log-scale)

Answer 93

The variance is larger than the mean

Answer 94

The variance is smaller than the mean

Answer 95

If the events occur: - independently - at a constant underlaying rate

Answer 96

Over dispersion

Answer 97

continuous outcome (quanitative)

Answer 98

binary outcome (qualitative)

Answer 99

rates or events during an exposure period

Answer 100

In poisson, the data has an underlying rate which is constant under time, but this may not always be reasonable to presume. That is where survival analysis comes in.

Answer 101

- The hazard function, h(t) This is the instantaneous rate of the event occurring at time T - The survivor function S (t) This is the probability that an individual will survive (i.e has not experienced the event of interest) up to and including time t

Answer 102

when a participant is censored, they did not experience the event during the study period, so the exact survival time is unknown

Answer 103

When an individual hasn't had the event during the study, but could still go on past the study (eg those still alive at the end of the study). They could also be lost to follow up!

Answer 104

When an event happens before entry into the study

Answer 105

time when the event occurs, event indicator (an indicator of whether the event has occurred or not)

Answer 106

When there is an event

Answer 107

It ignores censoring!

Answer 108

Evaluates whether or not K-M survival curves for 2 or more groups are statistically significant

Answer 109

They are mainly descriptive Cannot control for all covariates - just subgroup analyses Cannot accommodate time-dependent variables

Answer 110

- a regression model for survival data (TIME TO EVENT DATA) - It provides an estimate of the hazard ratio and it's CI - It simultaneously explores the effects of several variables on survival

Answer 111

The risk ratio (relative risk)

Answer 112

 We assume that the ratio of the hazards remains constant (or proportional) over time, even if the underlying hazards change  This can also be checked by plotting the log (-log()) transformed survivor estimate for each of the groups

Answer 113

That hazards are propotions, the hazard rate is constant, all censoring is indepedent of outcomes

Answer 114

the log rank test to compare survival between two groups

Answer 115

time to an event

Answer 116

1) Survival Probabilities are the same for all the samples who joined late in the study and those who have joined early. The Survival analysis which can affect is not assumed to change. 2) Occurrence of Event are done at a specified time. 3) Censoring of the study does not depend on the outcome. The Kaplan Meier method doesn’t depend on the outcome of interest. The censoring is INDEPENDENT of outcome 4) Censoring is similar in all groups

Answer 117

The log-rank test

Answer 118

It’s particularly useful for helping us understand how a predictor variable affects the odds of an event occurring, after adjusting for the effect of other predictor variables

Answer 119

The hazards are proportional The hazard rate is constant Any censoring must be independent of outcome

Answer 120

Rare events

Answer 121

there is a constant underlying rate which is fixed over time The data of the response variable is count data The mean and the variance are equal (v unique!) The distribution of counts follows a poisson distribution Observations are independent

Answer 122

time-to-event

Answer 123

proportional hazards regression.

Answer 124

Count data

Answer 125

continuous data

Answer 126

.....a statistical method that can be used to determine the relationship between one or more predictor variables and a response variable.

Answer 127

....the dependent variable!

Answer 128

1. Response variable (dependent variable) is binary (categorical) 2. Observations are independent 3. There are no extreme outliers 4. There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable 5. Sample size is sufficiently large

Answer 129

1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y. 2. Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data. 3. Homoscedasticity: The residuals have constant variance at every level of x. 4. Normality: The residuals of the model are normally distributed.

Answer 130

- Possible to miss important variables

Answer 131

1. Main limitation of Logistic Regression is the assumption of linearity between the dependent variable and the independent variables. In the real world, the data is rarely linearly separable. Most of the time data would be a jumbled mess. 2. If the number of observations are lesser than the number of features, Logistic Regression should not be used, otherwise it may lead to overfit. 3. Logistic Regression can only be used to predict discrete functions. Therefore, the dependent variable of Logistic Regression is restricted to the discrete number set. This restriction itself is problematic, as it is prohibitive to the prediction of continuous data.

Answer 132

The main limitation is the assumption of linearity between the dependent variable and the independent variables Very sensitive to outliers

Answer 133

Heterogeneity in the data — there is more than one process that is generating the data. For example, the data might be collected on more than one group of people, unknowingly Overdispersion — when the variance of the fitted model is larger than what is expected by the assumptions (the mean and the variance are equal)

Answer 134

1) We need to perform the Log Rank Test to make any kind of inferences. 2) Kaplan Meier’s results can be easily biased. The Kaplan Meier is a univariate approach to solving the problem 3) Removal of Censored Data will cause to change in the shape of the curve. This will create biases in model fit-up 4) Statistical tests and observations become mislead if the Dichotomizing of Continuous Variable is performed. 5) By dichotomizing means we take statistical measures such as median to create groups but this may lead to problems in the data set.

Answer 135

a probability distribution that is used to model the probability that a certain number of events occur during a fixed time interval.

Answer 136

Create a multivariate analysis

Advanced Analysis and Hypothesis Tests Flashcards

(172 cards)