Week 1 Flashcards

1
Q

Capitalization
What is P, p
X, x
N, n

A

In general, capital letters refer to population attributes (i.e., parameters); and lower-case letters refer to sample attributes (i.e., statistics). For example,

P refers to a population proportion; and p, to a sample proportion.
X refers to a set of population elements; and x, to a set of sample elements.
N refers to population size; and n, to sample size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

μ refers to
x refers to
σ refers to
s refers to
σ2 refers to

A

Greek vs. Roman Letters
Like capital letters, Greek letters refer to population attributes. Their sample counterparts, however, are usually Roman letters. For example,

μ refers to a population mean; and x, to a sample mean.
σ refers to the standard deviation of a population; and s, to the standard deviation of a sample
σ2 refers to the variance of a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Population parameters
μ
σ
σ2
P
Q
ρ
N

A

μ refers to a population mean.
σ refers to the standard deviation of a population.
σ2 refers to the variance of a population.
P refers to the proportion of population elements that have a particular attribute.
Q refers to the proportion of population elements that do not have a particular attribute, so Q = 1 – P.
ρ is the population correlation coefficient, based on all of the elements from a population.
N is the number of elements in a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sample Statistics
x
s
2s
p
q
r
n

A

x refers to a sample mean.
s refers to the standard deviation of a sample.
s2 refers to the variance of a sample.
p refers to the proportion of sample elements that have a particular attribute.
q refers to the proportion of sample elements that do not have a particular attribute, so q = 1 – p.
r is the sample correlation coefficient, based on all of the elements from a sample.
n is the number of elements in a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Simple Linear Regression

A

β0 is the intercept constant in a population regression line.
β1 is the regression coefficient (i.e., slope) in a population regression line.
R2 refers to the coefficient of determination.
b0 is the intercept constant in a sample regression line.
b1 refers to the regression coefficient in a sample regression line (i.e., the slope).
sb1 refers to the refers to the standard error of the slope of a regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Probability

A

P(A) refers to the probability that event A will occur.
P(A|B) refers to the conditional probability that event A occurs, given that event B has occurred.
P(A’) refers to the probability of the complement of event A.
P(A ∩ B) refers to the probability of the intersection of events A and B.
P(A ∪ B) refers to the probability of the union of events A and B.
E(X) refers to the expected value of random variable X.
b(x; n, P) refers to binomial probability.
b*(x; n, P) refers to negative binomial probability.
g(x; P) refers to geometric probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the role biostatistics plays in the discipline of public health and medicine

A

Statistics is all about converting data in to useful information. This process therefore includes: collecting data, summarising data and interpreting data
Goal: To study the population of interest.
The ‘Big Picture’ in statistics is to make inferences about a population from a given representative sample.
This summary outlines the fundamental steps in conducting research using biostatistics:

Step 1: Begin with defining the population, but since studying everyone is often impractical, select a representative sample. Choices made here, like hypothesis and sample selection, greatly influence data analysis later.

Step 2: Perform exploratory data analysis, using descriptive statistics to summarize the data through graphs, tables, and numerical measures.

Step 3: Assess how the sample differs from the population to ensure validity. Consider randomness and bias in the sample selection process.

Step 4: In the final step of inference, combine descriptive statistics and probability to draw conclusions about the entire population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

demonstrate understanding of the key ethical considerations relevant to study design, data collection, analysis and interpretation.

A

highlights the role of ethics committees in ensuring research methodology is sound and likely to yield meaningful results, beyond just participant safety. It stresses the importance of designing studies carefully to respect participants’ time and minimise risks.

Questions that might arise in ethical review include:

How are data being collected?
Are there special requirements for informed consent or confidentiality for any participants whose information is being used in the study?
How will results be disseminated to the public?
These can be addressed in the design stage by ensuring:

the research question is appropriate,
a valid and rigorous methodology is chosen that is capable of answering the research question,
any participants are made aware of how their information will be used (e.g. will it be identifiable or will analysis only involve anonymised, aggregated data?),
samples are chosen for statistical validity and are representative of the population of interest,
all results will be accurately reported, not just those that support the researchers’ hypothesis,
funding sources and collaborator details will be disclosed as appropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data?

A

Data refers to pieces of information about individuals which are organised into variables.

Data is obtained by collecting information from a group of patients participating in a study.

A set of data which can be identified during an experiment, research project or scenario is a dataset.

We can organise our data into observations (representing individuals) and variables:

Observations

Information from a patient or experiment is called an observation.
An observation may represent single or multiple pieces of information about a patient.
Examples include: age, gender, height, weight, blood pressure level and cholesterol level of a single patient.
Variables

A variable is any observation that can have different values.
A variable may have different values when observed at different times for the same patient.
A variable may have different values for different patients.
If we consider the blood pressure of patients in a study, it is likely that the blood pressure for each patient will be different, thus blood pressure is a variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a random variable?

A

Random variables
A variable is any observation that is different from person to person but can be predicted, e.g. gender, age, height.
A random variable is a variable that arises due to chance and can not be predicted, e.g. blood sugar level of an individual.
A variable is also a random variable if it was obtained from a random sample.

Random variables are a subset of variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of variables

A

Categorical vs Numerical variables

Variables can be classified as either categorical (qualitative) or numerical (quantitative).

Numerical variables represent a measurable quantity, where observations are counts or measurements. For example: the number of people that have graduated from Monash University – a measurable attribute.
Categorical variables represent non-numerical labels, where observations fall into separate distinct categories. For example: the blood type of an individual is either A, B, AB, or O – not a measurable attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Discrete vs Continuous variables

A

A numerical (quantitative) variable can be classified as discrete or continuous.

Discrete: measurements where the possible values are clearly separated from each other or can only take values equal to whole numbers.
For example: the number of people involved in a car crash.
Continuous: measurements that can take fraction/decimal values. There is an infinite possibility of values an observation can take.
For example: blood pressure of an individual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ordinal vs Nominal variables

A

A categorical (qualitative) variable can be classified into ordinal or nominal.

Ordinal: Where observations can be ordered/ranked according to some criteria. For example: an ICU patient may classify their degree of pain as: no pain, mild pain, moderate pain and severe pain.
Nominal: When observations are classified into separate categories that have no logical ranking the data is said to be nominal. For example: blood types (A, B, AB, or O) or gender (Male, Female, Non-binary, etc).
Hint: If there are only two categories (binary data) then the data is always classified as nominal.
Note: It is common for categorical variables to be coded as 0 or 1. Yes= 1 No=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A categorical (qualitative) variable can be classified into ordinal or nominal.

A

Ordinal: Where observations can be ordered/ranked according to some criteria. For example: an ICU patient may classify their degree of pain as: no pain, mild pain, moderate pain and severe pain.
Nominal: When observations are classified into separate categories that have no logical ranking the data is said to be nominal. For example: blood types (A, B, AB, or O) or gender (Male, Female, Non-binary, etc).
Hint: If there are only two categories (binary data) then the data is always classified as nominal.
Note: It is common for categorical variables to be coded as 0 or 1. Yes= 1 No=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

population

A

Population
A population is the largest collection of entities for which we have an interest at a particular time.
The term target population is often used in scientific literature to define the population of interest.
Example: the incidence of ovarian cancer of women between the ages 25 to 55 years in Australia → the population in the example is all women in this age group in Australia.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sample

A

Sample
Defined simply as a representative part of a population and consists of one or more observations from this population.
The sample is a section of the population that represents the population as much as possible.
Consider the diabetic population in Victoria. If we collect fasting blood glucose levels of only a fraction of these patients (e.g. 5 patients), we have only a part of our population and thus we have a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is mean

A

Mean
Obtained by adding all the values in a sample and dividing by the number of values that are added.

Let us assume that n is the number of observations in a sample, x is the value of the observations and x (bar) is the sample mean.
Mean is affected by extreme values in a dataset because it considers information from all patients and is appropriate for symmetric data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Median

A

A value which divides the data set into two equal parts.
If the number of values is odd, the median will be the middle value after all values have been arranged in ascending order of their magnitude.
If the number of values is even, the median is taken to be the average of the two middle values after all values have been arranged in either ascending order of their magnitude.
The median is not affected by the extreme values in a data set because it depends on only the middle observation(s), hence commonly used for asymmetric data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Mode

A

The value which occurs most frequently in a data set.
If all the values are different there is no mode.
A distribution with one mode is known as uni-modal, with two modes is bimodal and similarly having more than two modes is called multi-modal.
The mode is can be useful for categorical data and numerical data

20
Q

Symmetrical Data

A

For symmetric data the mean, median, mode are approximately the same.
If we are comparing summary statistics (averages) for multiple groups of patients where all groups are symmetrical, the mean should be reported for each group.
For example: weight of cardiac surgery patients presented in below is approximately symmetric. Hence the mean, median and mode of weight are approximately the same

21
Q

Positively skewed (asymmetrical data)

A

For positive skewed data the mean ≥ median ≥ mode.
If we are comparing summary statistics (averages) for multiple groups of patients where one or more groups is/are positively skewed, the median should be reported for each group.
For example: Postoperative length of hospital stay (LOS) of cardiac surgery patients presented in figure below is positively skewed. Hence the mean LOS is higher than the median LOS and median LOS is higher than the mode LOS.

22
Q

Negatively skewed (asymmetrical data)

A

For negative skewed data the mean ≤ median ≤ mode.
If we are comparing summary statistics (averages) for multiple groups of patients where one or more groups is/are negatively skewed, the median should be reported for each group.
For example: Age of cardiac surgery patients presented in figure below is negatively skewed. Hence patients mean age is lower than median age and median age is lower than modal age.

23
Q

Q1,Q2, Q3

A

For a set of observations, the nth percentile, say Q , is the value of the observation such that percent of the smallest observations are less than Q and (100 − Q) percent of the largest observations are greater than Q .

The 25th percentile is the value where 25% of observations fall below it and 75% of observations are above it.
The 75th percentile is the value where 75% of observations fall below it and 25 % of observations are above it.
The median is the 50th percentile, the value in the middle (median) or the half way point.
The 25th, 50th and 75th percentiles are often referred to as the first, second and third quartiles respectively. The quartiles are respectively denoted by Q1 , Q2 and Q3.
Excel formula:
=QUARTILE.INC(array,quartile)
This includes the minimum & maximum values when calculating quartiles.
This is the formula to use.

24
Q

lower dispersion =

A

A lower dispersion shows a higher precision and vice versa. The three well known methods of measuring dispersion are the range, inter-quartile range and standard deviation.

25
Q

what is range?

A

The range of a set of observations is the difference between the largest and the smallest value in the dataset.

It is the simplest and least commonly used measure of dispersion.

Range = Largest Observation - Smallest Observation
On excel, you can calculate the range by subtracting the max value [=MAX(range)] and the min value [=MIN(range)].

26
Q

Inter-quartile Range

A

IQR reflects the variability among the middle 50 percent of the observations in a data set.
It is the difference between the third (75th percentile) and the first (25th percentile) quartiles.

Inter-quartile Range (IQR) = Third Quartile ( Q3 ) - First Quartile ( Q1 )
The advantage of the IQR is that it calculates the spread among the middle 50% of the observations only and thus is not affected by the extreme values. Like median, the IQR is commonly used for asymmetric data.

On excel, you can calculate the IQR by subtracting Q3 [=QUARTILE.INC(range,3)] and Q1 [=QUARTILE.INC(range,1)].

27
Q

Function of graph

A

Graphs are good for comparisons, identifying relationships, describing distributions and visualising compositions.

28
Q

Frequency Tables

A

A frequency distribution table gives the number of observations at different values of the variable; frequency is the number of times each observation occurs (repeats) in a data set.
If two observations in a sample have the same highest frequency, the data is known as bimodal.
Data with more than two modes is known as multimodal.

29
Q

Pie chart are used for :

A
  • Summarise the data
  • Illustrate categorical, composition data
  • Present visual representation of data that might otherwise be provided in a small table

What you need:
Each category to be independent of one another (mutually exclusive)
Categorical data of both nominal or ordinal nature
Fewer than 6 categories
( ≥ 6 categories in a single pie chart can be difficult to interpret )
Other graphs and methods:
Bar graphs are similar to pie charts, they are more flexible in that they can present greater number of categories over multiple time points.
The difficulty with pie charts are in directly comparing data segments. Bar charts allow visualisation of quick comparisons of data segment as they are next to each other

Percentages (%) for each category in descending order.

main point

Pie charts are useful in displaying categorical data and proportions of each category
The percentages presented in a pie chart are considered the summary statistics
Pie charts should only be used with less than 6 categories

30
Q

Bar graph

A

Summarise the data.
Illustrate categorical, composition and comparative data.
Present visual representation of data that might otherwise be provided in a frequency table.
What you need:
Each category to be independent of one another (mutually exclusive).
Categorical data of both nominal or ordinal nature.
Discrete (but not continuous) data sets can be used for bar charts.
Other graphs and methods:
Bar charts are easier to follow than the frequency table.
A quick look at the bar chart gives an idea about the distribution of data for which it is created.
A bar chart is superior to pie charts due to larger amounts of categories that can be utilised.
In a bar charts, bars are separated from each other because of discontinuity of categories. This is compared to histograms which have no separation between bars due to continuity of the x-axis variable.

31
Q

Single bar chart

A

Component of a single bar chart:

Percentages for each category in descending order.
The largest (and second largest) percentages and why these are significant.
The smallest percentage and why this is significant.
Summary conclusion – bringing it all together.

32
Q

Histogram

A

Histograms are used to:
Summarise the data.
Illustrate numerical data.
Help visualise and interpret the distribution of a data set.
What you need:
Numerical data that is only continuous in nature.
Each observation or data point is independent to one another.
Differences between bar charts and histograms:
In a histogram the bars are joined to each other and the variable plotted on the horizontal axis is continuous.
In a bar diagram the bars are separated and the variable plotted on the horizontal axis is categorical or discrete.
Histograms display the distribution of a data set, whereas bar charts are used to compare variables.

33
Q

Bimodal distribution

A

Bimodal distributions has a distributions with two separate peaks.
This usually indicates two distinct groups within the data.

34
Q

histogram main point

A
  • A histogram is a block diagram whose blocks are proportional in area to the frequency in each class or group and blocks are joined with each other due to the continuity nature of data.
  • In a histogram the bars are joined to each other and the variable plotted on the horizontal axis is continuous.
  • A histogram might present as symmetrical (bell-shaped) or skewed (positive or negative).
35
Q

Box plots are used to:

A

Present the distribution of numerical data (discrete and continuous). ( comparisons and distribution)
Compare two different data sets (comparison) in parallel box plots.
Identify outliers and comparing distributions for multiples groups.

36
Q

Box-plots provides the following five number summary statistics for a data set:

A

Minimum value.
Maximum value.
25th percentile or the first quartile ⇒ the value below which the smallest 25% of observations fall and 75% of observations are above.
50th percentile or the second quartile or median ⇒ the value in the middle or the half way point.
75th percentile or the third quartile ⇒ the value where 75% of observations fall below and 25% of observations are above.

37
Q

Interpret a Box Plot

A

Symmetrical :A distribution with quartiles, upper and lower whiskers of similar lengths
Asymmetrical – Positively (right) skewed
If most of the observations are concentrated on the low end of the scale, the distribution is skewed right.
The distribution is skewed if one whisker is dragged out more than the other whisker.
The right whisker of the box plot is longer than the left whisker.

38
Q

Comparing IQR’s (dispersion)

A

The IQR calculates the spread among the middle 50% of the observations.
A small spread or dispersion will have a small IQR value and the box plot will look narrow.
A large spread or dispersion will have a large IQR value and the box plot will look wide.
Box-plot is commonly used to compare two or more groups of patients.
When interpreting multiple box plots; compare median values, IQR (spread), outliers and skewness of the data sets.

39
Q

Scatter plot

A

Scatterplots are used to:
Summarise the data.
Show a relationship between two sets of data.
Distribution and relationship
What you need:
Two quantitative variables x and y.
X variable may be numerical or categorical, the Y variable must be continuous.
Graph to plot your data with an x (horizontal) and y (vertical) axis.
Assumptions: Observations in our data set are independent
Why? If individuals are measured at multiple time points or related to other observations then results we find can be misleading or incorrect.

40
Q

Interpret a Scatterplot

A

The overall pattern.
Deviations from the pattern known as outliers.
Positive relationship

Increasing values of the explanatory variable (X-axis) are associated with increasing values of response variable (Y-axis).
Negative relationship

Increasing values of the explanatory variable (X-axis) are associated with decreasing values of response variable (Y-axis).
Neither

Not all relationships can be classified easily. Some plots do not display an observable positive or negative trend.

shape :
Linear- Data points seem to generally fall along a straight line.
Non-linear or curvilinear- Data points seem to follow along some curve, this may be wavy, funnel shaped, exponential
Clusters : Data points in a scatter plot form distinct groups.

41
Q

What is an error bar?

A

An error bar illustrates uncertainty or variation of a data point within a graph, which is represented by a line through a point on a graph.
Provide information on the spread of the data around the mean. for example, if the error bar is wide then the data is said to be more variable from the mean.
Relay information regarding the reliability of the mean value for the data set. If the error bar is narrow then the mean value is accurately representing the data.
Illustrates the possible range of differences between groups according to their SD, SEM or confidence interval.

42
Q

Caution with two overlapping error bars

A

Error bars are useful at illustrating the amount of difference that exists between two or more groups, however it cannot be used to conclude whether a difference is statistically significant.
If the sample sizes of the two groups are not equal and the error bars do not overlap, the p value may be <0.05 or > 0.05.
If the sample sizes of the two groups are equal and the error bars do not overlap, the p value is < 0.05 for only 95% confidence interval error bars (and not SD or SEM error bars).
If the sample sizes of the two groups are equal and the error bars do overlap, the p value is >0.05 for only SD error bars (and not SEM or 95% confidence interval error bars).

43
Q

What is survival analysis?

A

Survival analysis answers the question - did the person experience the event (Yes / No), and if Yes: when did they experience the event.
There are two possible outcomes:
Event = Yes, and time before they did
Event = No, and time followed up
Those who do not experience the event are regarded as censored.
The participant may be lost to follow-up.
Or the participant may have withdrawn from the study before its completion.
Or the participant may have completed the follow-up period of the study, and not experienced an event.
In all of these cases, we know the “no event experienced” status of the participant while they were in the study - but we don’t know if they experienced an event in the future.

44
Q

What is a hazard ratio?

A

Hazard ratio (HR) measures the effect of an intervention on an outcome over time.
HR differs from relative risk as the probability of an event occurring changes with time, whereas relative risk has a set definitive value.
HR is used to determine how long it takes for a particular event to occur – thus it is commonly reported as a time-to-event analysis or survival analysis.
Outcomes of interest can be positive (e.g. time until recovery), or negative (e.g. time until death).

45
Q

Hazard ratio interpretation

A

HR provides a probability that an individual experiences an event, such as death, at a particular time.
Hazard Ratio = Hazard in the intervention group ÷ Hazard in the control group
HR = 1. At a single time point, both the intervention and control group have the same rate/probability.

HR = 2. At a single time point, the intervention group has twice the probability of an event occurring compared to the control group.

HR = 0.5. At a single time point, the intervention group has half the probability of an event occurring compared to the control group.