706 exam Flashcards

Question 1

Q

What is the precision method way of determining a sample size?

Answer

A

Trying to establish a sample size to meet a requirement on the precision of estimates (as measured by confidence intervals)

Question 2

Q

What is the equation for SE?

Question 3

Q

How do you calcuate the sample mean of a binary variable?

Answer

A

The sample mean of a binary variable Y estimates the probability that Y=1. For binary variables the sample mean is a proportion and the proportion estimates the probability.

Question 4

Q

If you code ethnicity 1, 2 and 3- can you then include this coded variable as a predictor in a regression model?

Answer

A

No. Doing this forces a structure on the model that is unlikely to be true. The 1v 3 effect is twice the effect of 1 v2.

Question 5

Q

What is logistic regression and what is the statistical framework?

Answer

A

The outcome in logistic regression is binary and uses counts of occurance. The binomial model provides the logistic framework.

Question 6

Q

What is the problems associated with missing data?

Answer

A

Loss of statistical power
distortion of analyses
Create bias

Question 7

Q

How do you calculate a confidence interval?

Answer

A

q +/- 1.96 SE (q)

Question 8

Q

What is a t-statistic?

Answer

A

The calculated difference represented in units of standard error. The greater the magnitude of T, the greater the evidence against the null hypothesis.

Question 9

Q

How do you calculate relative risk?

Answer

A

Probability of an event occuring for group A divided by probability for group B

Question 10

Q

What is a chi-square test?

Answer

A

The chi-square test is used to assess whether two categorical variables are unrelated to each other. The ‘chi-square statistic’ is a measure of the discrepancy between expected and observed cell values. A measure of the discrepancy between expected and observed “chi-square statistic” χ2. If χ2 is large it indicates a big discrepancy between what we observed and what we would have expected under the hypothesis of independence.

Question 11

Q

How do you calculate the odds ratio from a coefficent in a regression analysis?

Question 12

Q

How do you get the probability of two things happening simultaneously?

Question 13

Q

how does p relate to x in a logistic model?

Answer

A

p is always between 0 and 1

Question 14

Q

What is the central limit theorem?

Answer

A

The sampling distribution of sample means tend to a normal distribution as n gets large, regardless of underlying distribution

Question 15

Q

What does a MSE of 17 tell you for a model?

Answer

A

For a given combination of factors the actual values will be distributed +/- 34 units about the mean value.

Question 16

Q

What are the key assumptions of ordinary regression?

Answer

A

n observations are independent of each other
the effects add together
the residuals are normally distributed with constant variance. You can check this with a Q-Q plot

Question 17

Q

What judgements do you make when doing a regression model?

Answer

A

Modelling requires judgements about how to include variables: should continuous variables be categorized, should dummy variables be used for ordinal scale, which variables should be included in the model, should those that are not statistically significant be dropped, should interaction terms be included.

Question 18

Q

How do you increase the power of a study?

Answer

A

Sample size
The size of the effect
Significance level
The endpoint being studied
The statistical test being used. Ie generally if the assumptions of a parametric statistical test hold, the parametric statistical test will be more powerful than a non-parametric one. A parametric test is based on tests of the parameters of normal distribution so are based on the assumption that the underlying distribution is normal.

Question 19

Q

What is a parametric test?

Answer

A

A parametric test is based on tests of the parameters of normal distribution so are based on the assumption that the underlying distribution is normal.

Question 20

Q

What is the Hosmer-Lemeshow test?

Answer

A

a goodness of fit test in logistic regression. It is calculated by comparing predicted and actual counts.

Question 21

Q

You calcuate a high p value of 0.66 for you Hosmer-Lemeshow test. What does this tell you?

Answer

A

The regression modes is a good fit.

Question 22

Q

Is the null hypothesis defined in terms of population or sample quantities?

Answer

A

population

Question 23

Q

What is a z score and how do you calculate one?

Answer

A

Z-scores are expressed in terms of standard deviations from their means. Resultantly, these z-scores have a distribution with a mean of 0 and a standard deviation of 1.

z= estimate- null value/ SE

Question 24

Q

Is the RR a good approximation for OR?

Answer

A

Only if it’s a rare disease

Question 25

Q

What is the formula to calcuate a chi-square statistic?

Question 26

Q

What is the distinction between the precision and power methods for sample size determination.

Answer

A

Precision is based on fixing a sample to attempt to achieve a certain degree of precision of an estimate, precision as determined by confidence interval. It focuses on the precision of estimation of an effect.

Power, on the other hand, focuses on hypothesis testing, and trying to avoid saying “no statistically significant difference” when in fact there is a difference. Power is the probability of saying an effect exists when it actually does exist.

Question 27

Q

Why is a regression model an artifical construct?

Answer

A

A regression model is a mathematical formulae relating an outcome variable Y to a set of other variables X1 X2 X3…The formula is a human construct that is almost certainly a simplification of reality.

Question 28

Q

What are frequentist statistics?

Answer

A

Frequentist statistics regards probabilities as “long run relative frequencies”. Frequentist statistics are based on the notion of probability (as in P values and CI’s) as a frequency of occurrence measure.

Question 29

Q

What is standard error?

Answer

A

SE is a measure of sampling variability. It is an intrinsic feature of the variability of any statistic that is calculated iin repeatedly drawn samples. In itself it is not an “error”, there is nothing wrong about it. Its use as an “error” arises when SE is used in CI calculation; if a CI is considered as a measure of degree of likely error in estimation.

Question 30

Q

What is a confidence interval?

Answer

A

an estimate of the interval μ − 1.96σ/√n within which there is a 95% chance that the sample mean ȳ will lie. But as the CI is an estimate of this interval, we do not know whether the 95% probability is correct.

However, if you repeated the study over and over again, calculating a 95% con- fidence interval each time, we would expect that about 95 of 100 such intervals would cover the true mean μ.

Question 31

Q

What is the _cons value in a regression model?

Answer

A

the _cons term is the “intercept” of the model It gives the mean value of your y variable when all the other variables are zero. It is often unhelpful. The p-value associated with it is a test that the intercept is zero; it is not a sensible question to ask.

Question 32

Q

How do you add an interaction in a regression model and why would you want to do this?

Answer

A

The model is additive. To include an interaction you multiply x1 and x2 in the model. You do this if you think the effects interact in some way.

Question 33

Q

What is the point of making a model?

Answer

A

they provided a framework to estimate simultaneously the effects of any number of variables on an outcome.

Question 34

Q

Why is a confidence interval associated with an estimate q often of the form q +/- 1.96SE(q)

Answer

A

Because of the central limit theorem it is often safe to assume that the sampling distribution of estimates is approximately normally distributed, and that the approximation improve with larger n. Because of this, the distribution of q is centred around the true value with standard deviation equal to standard error SE(q). This makes an estimate of the interval contain 95% of the distribution q+/- 1.96 SE(q), where 1.96 is critical 2.5% probability (in each tail) value from z distribution.

Question 35

Q

What is the difference between the standard devation and SEM?

Answer

A

The standard deviation (SD) measures the amount of variability, or dispersion, for a subject set of data from the mean, while the standard error of the mean (SEM) measures how far the sample mean of the data is likely to be from the true population mean.

Question 36

Q

What is a p value?

Answer

A

It is about the probability of the data configuration, if the null hypothesis is true.

Question 37

Q

when do you multiply probabilities?

Answer

A

only for independent events

so diastolic bp and systolic bp are not independent in the same person

Question 38

Q

when can you add probabilities?

Answer

A

If the two events are mutually exclusive.

If A and B can occur together then: P(A or B) = P(A) + P(B) - P(A and B)

Question 39

Q

What does logistic regression do?

Answer

A

Models the probability of your binary variable in terms of the other variables listed.

Question 40

Q

Why would you use the post-estimation command test after a regression model?

Answer

A

Post-estimation command test can be done to test whether the non-significant or questionable significant variables can be removed from the model. This is presumably to build a model which only has significant predictor variables in it.

Question 41

Q

Is there a way that you can know the discrepancy between the sample mean and the population mean?

Answer

A

No. SEM is a measure of sampling variability.

Question 42

Q

How do you get the standard error on a proportion?

Question 43

Q

How do you calculate degrees of freedom for a chi-square test?

Answer

A

the table has (r − 1)(c − 1) degrees of freedom where r is row and c is column.

Question 44

Q

What are the assumptions associated with the ordinary multiple linear regression model?

Answer

A

First that the model, with its linear structure is a true representation of the mean value of Y. Also that the direction of effect is that Y depends on X.
The tests of significance are based on assumption that residuals are normally distributed about the line(plane).
Also variance of residuals does not depend on X varaible combination – said to be homoskedasistic (word and correct spelling! not expected in answer).
The observations in the data are also assumed to be “independent” that is they cosnstute a random sample of people (if “person” is basic unit). This may be violated in repeated mesureents on same person, or if individuals somehow related. a random sample

Question 45

Q

What are predictor variables?

Answer

A

X variables that predict Y

Question 46

Q

How do you calculate the probability from an odds ratio in regression analysis?

Answer

A

e^x/1+e^x