Biostats Final Exam Flashcards

1
Q

Idea of the Chi-Square test for Goodness of Fit: when to use it

A
  • used when the data are categorical
  • measures how different the observed data are from what we would expect if Ho was true

not symmetric –> p value always area to the RIGHT of test statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Chi-Square Statistic

A

(x^2)

  • The chi-square statistic compares observed and expected counts
  • Observed counts: the actual number of observations of each type (ex: number of babies born on Monday)
  • Expected counts: the number of observations that we would expect to see of each type if the null hypothesis was true (ex: number you would expect to find/usually have calculated)

Large values for chi squared represent strong deviations from the expected distribution under Ho, and will tend to be statistically significant (large test stat –> small p values –> probably a significant result)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Chi-Square Distributions

A

the chi-square distributions are a family of distributions that take only positive values, are skewed to the right, and are described by a specific degrees of freedom

  • p values always area to right of test statistic
  • highly right-skewed
  • won’t exist if we have negative value
  • looks different for every degree of freedom
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Goodness of Fit Hypothesis

A

The chi-square test can be used for one categorical variable (1 SRS) with any number of levels (k). The null hypothesis can be that all population proportions are equal (uniform hypothesis) or that they are equal to some specific values (as long as the sum of all the population proportions in Ho = 1)

  • Ho: p1 = p2 = p3 =p4 = p5 = p6 = p7 = 1/7
  • Ho: pA=1/4, pB=1/2, pC=1/4

For 1 SRS of size n with k levels of a categorical variable:

  • When testing Ho: p1=p2=…..=pk (a uniform distribution), the expected counts are all= n/k
  • When testing Ho: p1=p1Ho and p2=p2Ho…. and pk=pkHO, the expected counts in each level i are expected count(i)=n piHo
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Conditions for the goodness of fit test

A

The chi-square test for goodness of fit is used when we have a single SRS from a population and the variable is categorical with k mutually exclusive levels

We can safely use the chi-square test when:

  • all expected counts have values > or = 1.0
  • no more than 20% of the k expected counts have values < 5
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Chi-square test for goodness of fit: overview, degrees of freedom, p value

A

The chi-square statistic for goodness of fit with k proportions measures how much observed counts differ from expected counts. It follows the chi-square distribution with k-1 degrees of freedom
- The p value is the tail under the chi-squared distribution with df= k - 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Interpreting the chi-squared statistic

A

The individual values summed in the chi-square statistic are the chi-square components (or contributions). When the test is statistically significant.

  • The largest components indicate which condition(s) are most different from the expected Ho. Compare the observed and expected counts to interpret the findings.
  • You can also compare the actual proportions quantitatively in a graph.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Chi-squared test – Lack of significance: avoid a logical fallacy

A
  • A non-significant P value is not conclusive: Ho could be true or not.

This is particularly relevant in the chi-squared. goodness of fit test where we are often interested in Ho, that the data fat a particular mode.

  • A significant p-value suggests that the data do not follow that model
  • But finding a non-significant P-value is NOT a validation of the null hypothesis and does NOT suggest that the data do follow the hypothesized model. It only shows that the data are not inconsistent with the model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Two-way tables

A

An experiment has a two-way or block design if two categorical factors are studied with several levels of each factor.
- Compare TWO categorical variables.

Two way tables organize data about two categorical variables with any number of levels/treatments obtained from a two-way, or block, design.

When you see a two-way table, you should think of a chi-squared test (mostly likely, test of independence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Marginal distribution

A

The marginal distributions (in the “margins” of the table) summarize each factor independently (summary of one column/row over the total value of all participants)
- With two factors, there are two marginal distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Conditional distribution

A

The cells of the two-way table represent the intersection of a given level of one factor with a given level of the other factor. This can be used to compute the conditional distribution (one individual value within the table over total for that row or column).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Two-way table: Hypotheses

A

A two-way table has r rows and c columns.

Ho: There is no association. between the row and column variables in the table.

Ha: There is an association/relationship between the two variables.

We will compare actual counts from the sample data with expected counts given the null hypothesis of no relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Two-way table: Expected counts

A

The expected count in any cell of a two-way table when Ho is true is:

expected count = (row total x column total) / table total

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Conditions for the Chi-Squared Test of independence

A

The chi-square test for two-way tables looks for evidence of association between two categorical variables (factors) in sample data. The samples can be drawn either:

  • By randomly selecting SRSs from different populations (or from a population subjected to different treatments) - ex: girls vaccinated for HPV or not among 8th graders and 12th graders
  • Or by taking one SRS and classifying the individuals according to two categorical variables (factors) - ex: obesity and ethnicity among high school students

We can safely used the chi-square test of independence when:

  • very few (no more than 1 in 5) expected counts are <5
  • all expected counts are > or = 1.0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The chi-square test for two-way tables: hypotheses, degrees of freedom, p-value

A

Ho: rows and column variables are independent (there is no association between the row and column variables)
Ha: row and column variables are dependent (there is an association)

The x^2 statistic is summed over all r x c cells in the table. (x^2 = sum of (observed - expected count)^2/expected count – formula sheet)

When Ho is true, the chi-squared statistic follows ~ chi-squared distribution with (r-1)(c-1) degrees of freedom.

P-value: P(chi-squared variable > or = calculated)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Chi-square test for 2-way tables: Interpreting the chi-squared statistic

A

When chi-squared test is statistically significant:
- the largest components indicate which condition(s) are most different from Ho. You can also compare the observed and expected counts or compare the computed proportions in a graph.

  • Reject null –> conclude relationship between the two categorical variables
  • largest: probably most related
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

ANOVA test: brief description

A

analysis of variance test (compare 3+ groups)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Comparing Several Means

A
  • When comparing >2 populations, the question is not only whether each population mean, mu i, is different from others, but also whether they are significantly different when taken as a group.
  • extension of a 2-sample design
  • compare 3+ groups using mean (average)

ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Handling Multiple Comparisons Statistically

A
  • The first step in examining multiple populations statistically is to test for an overall statistical significance as evidence of any difference among the parameters we want to compare –> ANOVA F TEST
  • AFTER: If the overall test showed statistical significance, then a detailed follow-up analysis can examine all pair-wise parameter comparisons to define which parameters differ from which and by how much —> more complex methods

Essentially:

(1) Compare all groups together - Are means all equal to each other or is at least 1 different!! (ex: did none of the drugs work or at least 1?)
(2) If reject Ho, find out which groups more different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Sample variance + standard deviation

A

*** may be given s, have to know s^2 or vice versa

Sample variance: s^2

Sample standard deviation: s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ANOVA test: hypotheses

A

Ho: always a statement about the population (mu, NOT x bar)

examples:
Ho: mu1 = mu2 = m3
Ha: at least one mu(i) is different (Ho is not true)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Factor

A

a variable that can take on of several levels used to differentiate one group from another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

An experiment has a one-way or completely randomized design if:

A

if several levels of one factor are being studied and the individuals are randomly assigned to its levels

  • Ex: one way = 4 levels of nematode quantity in seedling growth experiment
  • Ex: two way = 2 seed species and 4 levels of nematodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

One-way ANOVA

A

used for completely randomized, one-way designs

- need quantitative variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

The ANOVA F Test

A

The analysis of variance F test compares the variation due to specific sources (levels of the factor) with the variation among individuals who should be similar (individuals in the same sample).

Ho: All means (mu i) are equal
Ha: NOT ALL means (mu i) are equal (Ho is not true).

The analysis of variance F statistic for computing several means is:
F = variation among the sample means (variation between groups) / variation among individuals in the set of samples (variation within) = MSG/MSE

F test: looks at variation between groups vs. variation within groups
- how do the means vary? how do the observations within each sample vary?

K: # of levels (# of groups comparing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

ANOVA F Test: variability + F value trends

A
  • Variability of means smaller than variability within samples —-> F tends to be small
  • Variability of means larger than variability within samples –> F tends to be large

Small test statistic (F) corresponds to high p-value
- Large F –> small p-value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

ANOVA F Test assumptions

A
  1. The k samples must be independent STSs. The individuals in each sample are completely unrelated (randomization + no overlap between groups)
  2. Each population represented by the k samples must be Normally distributed. However, the test is robust to deviations from Normality (skew, mild outliers) for large-enough samples, thanks to the central limit theorem
  3. The ANOVA F-test requires that all k populations have the same standard deviation sigma.
    - — There are inference tests for this, but they tend to be sensitive to deviations from the Normality assumption or require equal sample sizes.
    - — A simple and conservative approach: The ANOVA F test is approximately correct when the largest sample standard deviation is no more than ~ twice as large as the smallest sample standard deviation.

Equal sample sizes make the ANOVA more robust to deviations from the equal sigma rule (if sample sizes equal, test more robust)

Summary:
1. Independent random sample
2. Normal populations or large enough sample size
3. Equal population standard deviation (largest sd/smallest sd <2. If >2, don’t do ANOVA test)
+ Have to check normality for each group

Have to satisfy all three assumptions to be able to do ANOVA test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

ANOVA test calculations

A
  • We have k independent SRSs, from k populations or treatments. The (i)th population has a Normal distribution with unknown mean mu(i). All k populations have the same standard deviation sigma, unknown.
k = number of groups being compared 
N = total sample size (n for all groups added together)
k-1 = numerator degrees of freedom
N-k = demoninatory df 
Ho: mu1=mu2...muk

F (test statistic) = MSG/MSE = (SSG/(k-1))/(SSE/(N-k))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Characteristics of F distribution

A
  • highly right skewed (not symmetrical)
  • can only take on positive values (0, infinity)
  • need two degrees of freedom (numerator and denominator degrees of freedom)

The F distribution is asymmetrical and has two distinct degrees of freedom. This was discovered by Fisher, hence the label “F”

30
Q

MSG and MSE

A

MSG: the mean square for groups; a variance of the means weighted for sample size. It measures the variability between the groups

MSE: the mean square for error or pooled sample variance (s subscript p ^2) is the average sample variance weighted for sample sizes. It measures the variability within each of the groups.
- MSE: mean square error of pooled sample variance

x bar = overall average if you combine all data not paying attention to which group something is in

31
Q

If given standard deviation

A

square it to find variance!!!!

s^2 = variance of group 1

32
Q

SSE

A

sum of squares of error

ANOVA Test

33
Q

Using the F table for an ANOVA test

A

calculate the value of F for our sample data and then look up the corresponding area under the curve in the F table

34
Q

Two-Sample t Test and ANOVA test

A

If comparing two groups: could do ANOVA (making constant standard deviation assumption) OR two sample T test (will get same conclusion)

A two-sample t test assuming equal variance with a two-sided Ha and an ANOVA comparing only 2 groups will give the same p-value

One-way ANOVA:

  • Ho: u1=u2
  • Ha: u1 not equal to u2
  • F statistic

t Test assuming equal variance:

  • Ho: u1=u2
  • Ha: u1 not equal to u2
  • t statistic

The t test is more flexible: You may choose a one-sided alternative or you may want to run a t test assuming unequal variance if you are not sure that the 2 populations have the same standard deviation sigma.
- fewer assumptions in T test (more flexible, can have directionality)

35
Q

Interpreting Results from an ANOVA

A
  • Ho states that all mu are equal, while Ha states that Ho is not true –> A significant P-value leads you to reject Ho indicating that not all mu are equal.

What do you do next? Which means are different from which?

  • You can gain insight by looking at summary graphs (box plot, mean +/- standard deviation) or finding the confidence interval for each mean mu
  • There are also some tests of statistical significance designed specifically for multiple tests
36
Q

If you get significant result in ANOVA, what can you do?

A
  1. Two sample to look at what graphs are most different

2. graphs (histograms, box plots)

37
Q

ANOVA: Pooled Variance and Confidence Intervals

A

(pooled together all groups)

MSE, the mean square for error or pooled sample variance (s subscript p ^2) estimates the common variance sigma^2 of the k population.

Thus, we can easily calculate separate level C confidence intervals for each population mean mu.

x bar +/- t* square root of (MSE/ni)

38
Q

Looking at a scatterplot

A
  • Most scatterplots are created from sample data

Is the observed relationship statistically significant (not entirely explained by chance events due to random sampling)?

What is the population mean response mu y as a function of the explanatory variable x?

mu y = B0 + B1x

How does changing x relate to/affect y?

39
Q

The regression model

A

The least-squares regression line y hat = b0 + b1x is a mathematical model of the relationship between two quantitative variables:

“sample data = fit + residual”

The regression line is the fit. For each data point in the sample, the residual is the difference: y - y hat

Weak relationship ==> residuals (vertical distances) larger

For any fixed x, the responses y follow a Normal distribution with standard deviation sigma

At the population level, the model becomes: yi = (B0 +B1x1) + (E1) with residuals E1 independent and Normally distributed N(0, sigma)

The population mean response mu y is: uy = B0 + B1x

y hat: unbiased estimate for the mean response uy

b0: unbiased estimate for intercept B0
b1: unbiased estimate for slope B1

40
Q

Regression Model: assumptions

A
  • mean of 0 (residuals)
  • normally distributed (residuals)
  • standard deviation (spread of residuals) constant for every x
  • independence is key
  • key assumption is constant standard deviation (sigma) - sigma needs to be estimated

Regression assumes equal variance of Y (sigma is the same for all values of x)

41
Q

Regression standard error

A

Regression standard error (s) for n sample data points is computed from the residuals (yi - yhat):

s = square root of (sum of residual^2/n-2) = square root of (sum of (yi-yhati)^2)/(n-2) —> not on formula sheet - will never have to calculate this but identify it

s is an unbiased estimate of the regression standard deviation sigma

42
Q

Regression Model: Confidence Interval for the Slope B1

A

Estimating the regression parameter B for the slope is a case of one-sample interference with sigma unknown. Hence, we rely on t distributions.

The standard error of the slope b is: SEb1 = s/ square root of (sum of (x-xbar)^2) where s is the regression standard errorr —– equative on formula sheet

Thus, a level C confidence interval for the slope B is: estimate +/- tSEestimate……… b1 +/- tSEb1

t* is critical for t(df= n-2) density curve with C% between -t* and +t*

43
Q

Testing the hypothesis of no relationship

A

To test for a significant relationship, we ask if the parameter for the slope B1 is equal to zero, using a one-sample t test.

The standard error of the slope b is: SEb1 = s/ square root of (sum of (x-xbar)^2)) —- given on formula sheet

We test the hypotheses Ho: B1 = 0 versus a one or two-sided Ha

We compute t=b1/SEb1 which has the t(n-2) distribution to find the P-value of the test (GIVEN)

44
Q

Inference for Prediction

A

One use of regression is for prediction within range: y hat = b0 + b1x. But this prediction depends on the particular sample drawn. We need statistical inference to generalize our conclusions.

To estimate an individual response y for a given value of x, we use a prediction interval (PI) for one single new observation

Prediction = predicting for one new observation

If we randomly sampled many times, there would be many different values of y obtained for a particular x following N(0, sigma) around the mean response mu y.

45
Q

Confidence Interval for My

A

We may also want to predict the population mean value of y (mu y) for any value of x within the range of data tested.

Using inference, we calculate a level C confidence interval for the population mean mu x of all responses y when x takes the value x*: This interval is centered on y bar, the unbiased estimate of mu y.

The true value of the population mean mu y at any given value of x will indeed be within our confidence interval in C% of all intervals computed from many different random samples.

interval for mean response y when we have a subset of a population

A level C prediction interval for a single observation on y when x takes the value x* is:
y hat +/- t* SEyhat where SEyhat = standard error for a prediction (all these formulas on formula sheet)

A level C confidence interval for the mean response mu y at a given value x* of x is:
y hat +/- t* Se mu

Use t* for a t distribution with df = n-2 (this is the df for both)

Note that the standard error for a single prediction is larger than the standard error for the mean response.

Note: Seyhat that always > Se mu hat (prediction interval always wider)

46
Q

Conditions for inference: assumptions for linear regressions

A
  • the observations are independent
  • there is some linear relationship (check with scatterplot, correlation coefficient)
  • the standard deviation of y, sigma, is the same for all values of x
  • the response y varies Normal around its mean

(constant standard deviation)

47
Q

Using Residual Plots to check for regression validity

A

The residuals (y - y hat) give useful information about the contribution of individual data points to the overall pattern of scatter.

Residuals: vertical distances between line and point

We view the residuals in a residual plot.
- If residuals are scattered randomly around 0 with uniform variation, it indicates that the data fit a linear model, have Normally distributed residuals for each value of x, and constant standard deviation sigma.

Residuals are randomly scattered: good
Curved pattern of residuals: relationship is not linear
Change in variability across plot: sigma not equal for all values of x

48
Q

Rank tests vs. Normal tests

A

For strongly skewed data, we prefer the median to the mean for describing the center of the data.

Hypotheses for rank tests rely on the median and data ranks.

49
Q

Ranks

A

To rank sample observations, first arrange them in order from smallest to largest. The rank of each observation is its position in this ordered list, starting with rank 1 or the smallest observation.

First we will assume there are no ties. Then we will examine how to deal with them.

50
Q

Equivalent ranks test for one-sample t test, two-sample t test, and one-way ANOVA F test

A

One-sample t test: use Wilcoxon signed rank test (not on our exam?)

Two-sample t test: Wilcoxon rank sum test (Mann-Whitney test) —- two independent samples

One-way ANOVA F test: Kruskal-Wallis test —- several independent samples

51
Q

Conditions for Wilcoxon test

A
  • data come from a randomized comparative experiment

- continuous distribution

52
Q

Multivariable Linear Regression

A

Response variable: continuous variable

Explanatory variables: continuous or categorical variables

Cholesterol ~ treatment + gender + race + age

53
Q

Fisher’s Exact Test

A

Response variable: categorical

Explanatory variable: categorical

  • Test that also uses this situation: Chi-squared test
  • Fisher’s exact test is used when the expected count assumption is not met
54
Q

Logistic Regression

A

Response variables: categorical

Explanatory variable(s): continuous or categorical variables

We can put in multiple explanatory variables so that, similarly to multivariable logistic regression, we can control for other variables like gender, race, age, etc.

Cholesterol {0,1} ~ treatment + gender + race + age

55
Q

Histograms

A

A histogram is a graph in which the horizontal scale represents the classes of data values and the vertical scale represents frequencies. The heights of the bars correspond to the frequency values and the bars are drawn adjacent to each other.

  • tells us shape and distribution
  • This is a summary graph for a single variable.
  • Histograms are useful to understand the pattern of variability in the data, especially for large data sets.
56
Q

Standard deviation vs. variance

A

SD: a measure of variation of values about the mean (s)

Variance: a measure of variation equal to the square of the standard deviation (s^2)

57
Q

Find suspected outliers

A

IQR = Q3-Q1

Suspected low outlier: any value < Q1 - 1.5 IQR
Suspected high outlier: any value > Q3 + 1.5 IQR

58
Q

linear correlation coefficient

A

(r)

  • measures the strength of the linear association between paired x and y quantitative values in a sample. r is a sample statistic representing the population correlation coefficient, p.
  • r is NOT resistant to outliers
59
Q

simple linear regression

A

the regression equation expresses an association between x and y

  • x is the independent/predictor/explanatory variable
  • y is the dependent/response variable
60
Q

least squares regression line

A

line of best fit
- the unique line such that the sum of the vertical distances between the data points and the line is zero, and the sum of the squared vertical distances is the smallest possible

y hat = b0 + b1x

61
Q

residuals

A

y - y hat

vertical distance from each point to the least-squares regression line

62
Q

simple random sample (SRS)

A
  • randomly selected individuals
  • each individual in the population has the same probability of being in the sample
  • all possible samples of size n have the same chance of being drawn
63
Q

case control studies

A

start with 2 random samples of individuals with different outcomes, and look for exposure factors in the subjects’ past (going into past of people with disease)

64
Q

cohort study

A

enlists individuals of common demographic, and keeps track of them over a long period of time (see if they develop a disease/condition)

65
Q

cross-sectional studies

A

measure the exposure and outcome at the same time (ex: survey)

66
Q

confidence interval

A

a range of values with an associated probability, or confidence level C

  • this probability (value between 0 and 1) quantifies the chance that the interval contains the unknown population parameter
  • you have confidence C that mu falls within the interval computed
67
Q

null hypothesis

A

a very specific statement about a parameter of the population

68
Q

alternative hypothesis

A

a more general statement that complements yet is mutually exclusive with the null hypothesis

69
Q

p value

A

the probability, if Ho was true, of obtaining a sample statistic at least as extreme (in the direction of Ha) as the one obtained

  • the area under the curve
  • a value between 0 and 1
70
Q

significance level (a)

A

the largest p-value tolerated for rejecting Ho (how much evidence against Ho we require)

  • this value is decided arbitrarily before conducting the test
  • default is 5%
  • when p is less than or equal to alpha, we reject Ho