Definitions from scratch Flashcards

1
Q

Two types of variable

A

Metric variables

Categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Categorical variables can be

A

Nominal: relates to named things i.e. it is NOT numeric
It is categorical because we allocate each bit of data to a category e.g. male or female

Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Nominal data

A

= categorical nominal variable

Nominal: relates to named things
Category: each data point is placed in a category

Properties of nominal data:

  • They do not have units of measurement
  • The ordering of categories is arbitrary i.e. does not matter

Example:
Males: 45
Females: 72

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Properties of nominal data

A

Categorical nominal variable

  • They do not have units
  • The ordering of categories is arbitrary

Example:
Males 43
Females: 52

OR

Females: 52
Males: 43

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal data

A

=categorical ordinal data

Ordinal data is categorical but it can be ordered in a meaningful way i.e. smallest to largest

e.g. Glasgow coma scale
If person A has a GCS of 5, and person B has a GCS of 10 we can conclude person A’s consciousness is lower BUT we can conclude by how much i.e. we CANNOT say half as much

The difference between adjacent scores is not constant

The seemingly numeric values are NOT number, but labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Properties of ordinal data

A

Categorical ordinal data

  • Does not have units
  • CAN be ordered in a meaningful way
  • Nearly always integers
  • Assessed rather than measured

NOTE: they do not have a numeric value, they seemingly have numeric values but these are actually labels i.e. GCS of is saying that they fit into a category called GCS 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What you shouldn’t do with ordinal data

A

YOU SHOULD NOT TREAT THEM AS NUMBERS

i.e. for ordinal data you should not add, divide, or average it

Ordinal data = number labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Metric variables can be

A

Discrete: values occur in discrete intervals i.e. 1, 2, 3, 4, 5,

  • comes from counting i.e. number of operations
  • difference between each count is constant (in comparison to ordinal data)
  • 4 operations is twice as many as 2 operations

Continuous:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Properties of discrete data

A

Discrete metric data

  • Has units
  • Discrete variables can be counted, meaning they are real numbers - produce Integers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Continuous data

A

Continuous metric data

  • Values form a continuum
  • Real numbers
  • Has units
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Frequency table

A

Used to illustrate descriptive statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Frequency distribution

A

Illustrates the number of events in each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Relative frequency

A

= percentages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Contingency table

A

Cross tabulations

Illustrate association between two variables in a single population

Has two columns for the given variable in the row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ranking data

A

Allows assessment of non-parametric data

Order data into size

Starting with larges variable, rank this with value of 1
Next value rank as 2

Equal values are tied with the value of the average some of ranks used in tied series e,g, 7 8 5 5 5 3 1

8: 1
7: 2
5: =4
5: =4
5: =4 (3 , 4 , 5 avergae = 4)
3: 6
1: 7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Ogive

A

Pronounced ojive

Cumulative frequency curve with continuous metric data

Curved (no step) chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Measures of shape (skew)

A

Skewness:
-skewness coefficient defined from -1 to +1

Kurtosis:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Left skew

A

= negative skew

Lots of large values

Negative –> peak is further away from y-axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Right skew

A

=positive skew

Lots of small values

“Right skew, close to you”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Distributions

A

Symmetric: classic one humped distribution

Bimodal: two peaks

Multimodal: multiple peaks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Kurtosis

A

Measure of distribution

Distributions with large kurtosis exhibit tail data exceeding the tails of the normal distribution (e.g., five or more standard deviations from the mean).

Skewness differentiates extreme values in one versus the other tail, kurtosis measures extreme values in either tail.

If you hold the area the same, if you increase the kurtosis, the peak would get flatter and broad and hence larger spread

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Kurtosis value of normal distribution

A

=3

(excess kurtosis value = 0 i.e. the excess subtracts 3 form calculation)

(uniform distribution =1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Mode

A

Useful in categorical data

Useless in continuous data when no two values likely to be the same

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Median

A

CAN BE USED FOR ORDINAL AND CONTINUOUS DATA

Discards a lot of information

Not as affected by skew vs mean

Not as affect by outliers vs mean

Therefore median is a stable measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Mean

A

Uses all the data - each value is included

Therefore subjected to effect from outliers and skew

Cannot be performed on ordinal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Percentiles

A

Values that divide a data set into 100 equal-sized group

To find percentile, multiply percentage in decimals by (n+1)
Where n is equal to number of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Properties of using range

A

Lowest to highest value

Not affected by skew

Sensitive to outliers which may misguide range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Interquartile range

A

Removes 25% from each end

Reduces effect of outliers

Affected by skewed distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Limitations of interquartile range

A

Discards 50% of the data!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Definition of standard deviation

A

Average distance of the data values from their collective mean

Uses each data point, i.e. uses all the data (unlike IQR)

31
Q

Measuring spread with ordinal data

A

Range and IQR

Not standard deviation, can only be used on continuous data

32
Q

Why not use median and SD

A

If you have continuous data, SD is the measure of choice.

However, if you use a median value, that has suggested you have skewed data. Therefore, you shouldn’t use SD.

33
Q

Mean +/- 1 SD

A

68% of values in range

*for normally distributed data

34
Q

Mean +/- 2 SD

A

95% of values in range

*for normally distributed data

35
Q

Mean +/- 3 SD

A

99% of values in range

*for normally distributed data

36
Q

Testing for normal distribution

A

Shapiro-wilk test

  • if less than 2000 values
  • provides p-value with null hypothesis set for normal disrubtion

Kolmogorov-Smirnov test

  • > 2000 values
  • provides p-value with null hypothesis set for normal disrubtion
37
Q

Transforming data

A

Make it normally distributed

log (to the base 10) = most common

square-root

1 over value

38
Q

Incidence rate

A

Is actually the crude incidence rate

Number of new cases of a disease or event over for a defined population given time period

= number of new cases / number at risk
(same time period)

39
Q

Incidence rate ratio

A

ratio of two incidence rates

40
Q

Prevalence

A

Number of cases in a given population at a given point in time

41
Q

Crude mortality rate

A

= number if deaths over a period time (usually 1 year) divided by population at mid-point of that time duration

MULTIPLE by 1000

gives crude mortality per 1000 per year

42
Q

Case fatality rate

A

Number of deaths from a disease in a given time period divide by total number with disease over that time period

43
Q

Standardised mortality rate

A

Crude mortality rate divide by overall standardised mortality rate

44
Q

Properties of a confounding variable:

A

A confounding variable:
- is associated (casually or not) with the exposure

  • causally related to the outcome
  • must not be part of the exposure-outcome pathway
45
Q

Positive confounding

A

Leads to effect of exposure being inflated

46
Q

Negative confounding

A

Leads to effect of exposure being reduced

47
Q

Confounding by indication

A

Occurs when the clinical indication for selecting a particular treatment (eg, severity of the illness) also affects the outcome.

Indication for exposure leads to disease outcome

Not exposure itself

48
Q

Residual confounding

A

Residual confounding is the distortion that remains after controlling for confounding in the design and/or analysis of a study. There are three causes of residual confounding:

There were additional confounding factors that were not considered, or there was no attempt to adjust for them, because data on these factors was not collected.
Control of confounding was not tight enough. For example, a study of the association between physical activity and age might control for confounding by age by a) restricting the study population to subject between the ages of 30-80 or b) matching subjects by age within 20 year categories. In either event there might be persistent differences in age among the groups being compared. Residual differences in confounding might also occur in a randomized clinical trial if the sample size was small. In a stratified analysis or in a regression analysis there could be residual confounding because data on confounding variable was not precise enough, e.g., age was simply classified as “young” or “old”.
There were many errors in the classification of subjects with respect to confounding variables.

49
Q

Controlling for confounding at design stage

A
  1. Restriction
    - exclude all those with exposure hairball
    - limits generalisation of evidence e.g. if you exclude all smokers then results unlikely to be generalise to any population
  2. Matching
    - choice of method in case-control studies
    e. g. frequency matching (same proportions)
    e. g. propensity score matching
  3. Randomisation
    - choice of method in RCTs
    - controls for known and unknown confounding
50
Q

Controlling for confounding at analysis

A
  1. Stratification
    - divides into strata, with and without exposure
    - essentially restriction but after the event
  2. Adjustment
    - regression
51
Q

Reverse causality

A

The exposure-disease process is reversed; In other words, the exposure causes the risk factor.

Lower employment status is linked to causing depression.

It may well be depression is linked to causing employment status.

52
Q

Descriptive cross-sectional studies

A

Do no infer any causality and only measure one variable (i.e. incidence) but can measure multiple

  • Generally not subject to confounding if only measure prevalence
  • If measures multiple things, will need to adjust for potential confounding
53
Q

Analytical cross-sectional studies

A

Attempts to asses potential links between two or more variables at a given time point

  • Does not infer causality
  • Need to be adjusted for confounding variables
54
Q

Cross-sectional studies

A
  • Take one set of measurements from each participant at a SINGLE point in time
  • Used to investigate associations between variables but NOT causality or direction
  • Not useful if condition is rare
  • If used to asses opinions or attitudes, referred to as surveys
55
Q

Cohort studies

A

Pros

  • Main purpose is to identify if exposures or risk factors cause a certain disease
  • Several outcomes can be studied for single exposure
  • Temporal relationship can be established
  • ->therefore adds to causality
  • Suited for rare exposures
  • Less subject to bias and confounding than case-control

Cons

  • Sampling bias
  • Not suited for rare diseases
  • Long follow-up: leads to attrition and bias
  • Recall bias in retrospective studies
  • Data quality in retrospective studies
56
Q

Problems with case-control studies

A
  • Recall bias
  • Selection of cases difficult to find
  • Difficult to match patients for each variable
  • Sampling bias of cases +/- controls
  • Definition of a case e.g. GOLD 1 COPD is unlikely to clarify much
57
Q

Ecological studies

A

Make large-scale comparisons between two groups of people

58
Q

Statistical inference

A

Data from the sample will inform conclusions about the target population

Using sample statistics, we are inferring about the population

59
Q

Sample statistics

A

Variable measured in a sample
=sample statistics

This is used to inform inferences regarding population parameters

60
Q

Sample error

A

Deviation from true value of a parameter in sampled population

-usually unknown

61
Q

Rules of probability

A

Chance of an event occurring lies between 0 - 1

1 is absolute certainty of an event e.g. everyone will die some day

0 is an impossible outcome e.g. rolling an 8 on a dice labelled 1- 6

If an event is equally likely to happen as to not happen, the probability would be 0.5

If p is the probability of an event happening, the probability of the vents not happening is 1 - p

62
Q

Proportional frequency

A

Used to calculate probability in clinical settings when outcomes do not all have an equal chance of occurring
i.e. any clinical setting

Proportional frequency states that the probability of an event occurring is equal to the proportion of times that outcome would have occurred if we repeated the experiment a large number of times

63
Q

Techniques for randomisation

A

Simple randomisation

Block randomisation
-ensures at any given point there are roughly equal numbers in each group

Stratification
-ensures balanced strata of variables across each group

64
Q

Reducing placebo or response bias

A

Blinding of participant

65
Q

Problems with cross-over trials

A
  • Participants may undergone change between treatment 1 and treatment 2
  • Does not work for treatments that require a long time to take effect
  • Does not work in self-resolving or acute illness that responds to therapy immediately
  • “Carry over” effect despite washout periods
66
Q

Hawthorne effect

A

Change in behaviour after knowledge of being observed

Some trials do not recruit controls if data collection will not differ as Hawthorne effect changes outcomes

67
Q

Intention to treat

A

The process of analysing the data as if participants are still in the original group allocation despite loss or changeover of participants

  • Maintains baseline characteristics
  • Prevents attrition bias
  • Reflects real-world practice
  • Keeps sample size and power the same

Cons

  • requires imputation
  • can sometimes underestimate effect size
68
Q

Per protocol analysis

A

Analysis performed as per treatments received by participant

  • protocol deviation has taken place
  • balanced baseline characteristics lost
  • attrition bias now in action
  • subject to confounding
  • loss of power
  • likely to overestimate effect size e.g. those most unwell are least likely to tolerate the side effects of new drug, hence only moderate disease is analysed in treatment group vs full-spectrum of severity in control group
69
Q

Cluster sampling

A

Overcomes need for sampling frame

Common sampling technique in randomised-controlled trials

Units represent GP surgeries, hospitals, schools, clinics etc.
Sampling units are a likely place to find spectrum of participants
However, not a sampling frame - do not include everyone eligible and hence sampling and then selection bias will be introduced

Example: 75 GP surgeries identified as eligible sampling units
Randomly select 25 as your cluster sample
-People who aren’t registered with a GP have no chance of being included, hence not equal sampling probability

70
Q

Probability density function

A

Used when calculating probability function in continuous variables

pdf gives probability that a continupous random vairbale will lie between two values

THIS IS BECAUSE: continuous variable have infinite possible number of outcomes, hence probability of a given outcome = 0

71
Q

Absolute risk

A

Probability of an outcome occurring in a population with exposure

72
Q

Relative risk

A

Risk exposed divided by risk unexposed

Same as risk ratio (decimal)

73
Q

Risk ratios

A

Risk exposed / risk unexposed

Decimal

Can over-inflate risk

74
Q

Relative risk reduction

A

risk reduction / risk in unexposed