basic data Flashcards

part 2 stats

1
Q

types of data

A

Nominal Data
This describes categorical data without an order. Examples include blood groups (O, A, B, AB), eye colour and marital status.

Ordinal Data
Ordinal data are also categorical, but in this case categories have an order and can be ranked. Examples include stages of breast cancer. Importantly the “distances” between the different groups can be variable. For example, Likert responses may have the options “strongly agree”, “agree”, “neither agree nor disagree”, “disagree” and “strongly disagree”. Clearly this can be ordered, so it is an example of ordinal data, but it is apparent that the difference in agreement between “agree” and “strongly agrees” may not be the same as that between “agree” and “neither agree nor disagree”.

Binary data
Binary, or dichotomous, data have only two possible outcomes. Common examples are Yes/No or True/False responses, but they could also include other common epidemiological outcomes, such as “survived” and “not survived”.

Numeric data
Numeric data can be discrete or continuous. Discrete data have fixed values. Examples include shoe size or number of people. Continuous data can take any value, frequently within a given range. Examples include weight and length (where the range would be from zero to, theoretically, infinity).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

types of data scales

A

(i) A nominal scale uses numbers purely as a label and there is no intrinsic order to the values, for example, ethnic group. A nominal variable, is used for mutually exclusive, but not ordered, categories. For example, a study might compare five different countries. You can code the five countries with numbers, but the numerical order is arbitrary.

(ii) Ordinal scales are qualitative, and ordered, but without any mathematical relationship between the points, for example, social class. An ordinal variable, is one where the order matters but not the difference between values.

(iii) Interval scales are ordered but the intervals between consecutive points on the scale are equal. That is, interval scales are where the difference between two values is meaningful (e.g. temperature in centigrade or Fahrenheit).

(iv) Ratio scales are interval scales but with a true zero, e.g. weight. That is, ratio scales have all the properties of interval scales, and also have a clear definition of zero (e.g. height or weight).

https://www.fph.org.uk/media/1223/june-2011-final.pdf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how can you measure the spread of data

A
  1. range
  2. IQR
  3. variance/ SD
  4. coeffient of variation (sd/n)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is variance and standard deviation?

A

in formula sheet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is standard error of the mean?

A

standard deviation of the sample distribution

in formula sheet

95% of sample means will fall within 1.96 SEM of the population mean –> pop mean within 1.96 SEM of the sample mean 95% of the time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the normal distribution

A

symetrical around the mean (median and mode), bell shaped

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is the p value

A

the probability of getting the observed value, or one that is more extreme, if the null hypothesis were correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

unpaired z test

A

parametric test to see if difference between 2 groups if large n

1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.

  • A z-score of 1.96 is equivalent to a two-tailed p-value of 0.05; therefore, a z-score >1.96 can be considered statistically significant at the 5% level
  • for proportions se calculated by
    -se≈ √(p(1−p)/n1)+(p(1−p)/n2)

where p = average proportion for the two groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

paired z test

A

paraemtric if large and paired data

1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.

where d = mean of the differences between the samples,
D= hypothesised mean of the differences (usually this is zero),
n = is the sample size and
σ2 = is the population variance of the differences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

unparied t test

A

if small n (<30 normally)

parametric test

1) The data must be normally distributed.
2) All data points must be independent.
3) For each sample the variances must be equal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ANOVA

A

parametric test to compare mean of one exposure between 2+ groups

can do 2 way, multi if more than one exposure

assumptions:
- outcome normally distributed,
- SD same for each exposure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

liner regression

A

normal distribution, linear relationship

can a pearsons correlation co-efficient (parametric)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is Bayes therom

A

P (A|B) = P(A n B) / P (B)

P(A | B) = P (B|A) x P (A)/ P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

chi squared test

A

test for independance
large sample size (n>5 for each square)

to test if r x c are independent or if there is an association

H0: variable 1 and variable 2 are independent.
H1: not independent.
for 2 x2 (1df) chi squared > 3.84 for p<0.05

how to calculate:
1. create 2x2 table
2. calcualte expected ((row sum * column sum) / table sum.)
3. chi sqaured formula to work out number
4. Is it >3.84, reject H0, they are associated.

use fisher exact test is n small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

chi squared test for trend

A

ordered categorical exposure variables. It tests the null hypothesis that there is no linear increase in the log odds per exposure group.

eg menarche and small/medium/large fold test)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

McNemars Test

A
  • used when have paired data
  • to see if the outcome and exposure are independent

look at discordant!!

o Assumption 1: You have one categorical dependent variable with two categories (i.e.,a dichotomous variable) and one categorical independent variable with two related groups.
o Assumption 2: The two groups of your dependent variable mutually exclusive.
o Assumption 3: The cases are a random sample from the population of interest.

for 1 df (X2 distribution)

X > 3.84 p <0.05!!

create 2 x2 table

o + -
+ a r
- s b

in formula sheet

17
Q

Direct standardisation

A

way to adjust for age if you have age specific rates for the study population

procedure:
1. identify standardised population
2. age specfific rate from study population x standard population number for that strata
3. sum all of these up
4. sum (ASR from study pop x standard pop number) / total standard population = Age standardised rate

look at pattern of change of rates in each strata are the same

if 2 can calculate compartive mortality ratio: just divide

18
Q

Indirect standardisation

A

for when you DONT have age specific rates for the study population

procedure:
1. identify standardised population
2. apply standard population age specific rates to the study population to get EXPECTED number of deaths
3. SMR = observed/ expected

how much more/les likey to (die) compared to someone of the same age/sex in the standardised population

(if 1 same)

dont compare different SMR as may have different underlying populations

19
Q

Wilcoxon signed rank

A

non parametric

similar to paired t test

null: median of differences between paired oberservations = 0

W > test statistic: = fail to reject the null

opossite to everything else where a bigger value then the test statistic would mean p even lower then that threshold value

20
Q

Wilcoxon rank sum/ Mann-Whitney U

A

non parametric
similar to unpaired t test
H0: difference between the medians will be 0

opossite to everything else where a bigger value then the test statistic would mean p even lower then that threshold value

21
Q

bootstrapping

A

take repeated samples from sample population with replacement

if do this 1000 of times can create CI

22
Q

systematic review

A

the application of scientific strategies that limit bias by the systemematic assembly, critical appraisal and synthesis of all relevant studies on a specific topic.

23
Q

Likelihood ratio (+v)

A

sensitivity/ 1 - specificity

P(test positive and have disease) / P( test positive and dont have disease)

24
Q

post test probability

A

Post-test probability = post-test odds / (post test odds+1)

Post-test odds = pre-test odds * LR

Pre-test odds = pre-test probability / (1-pre-test probability) (for population screening it is the PREVELENCE OF DISEASE)

25
Q

ROC curve

axis and uses

A

x: 1 - specificity (false positive)

y: sensitivity (true positive)

Uses:
- to set a cut-off value for a test result (for continuous diagnostic variables)

  • to compare the performance of different tests measuring the same outcome (test validation)

Area under ROC: AUROC = larger = better test

26
Q

What type of regression analysis should be used to assess the difference in survival time

A

cox regression

27
Q

Kruskal-Wallis

A

It is a non-parametric test
It is a rank-based test
it is used to test whether two or more independent groups differ.
It is the nonparametric version of one-way independent ANOVA (1 mark)