Week 1 Flashcards
Capitalization
What is P, p
X, x
N, n
In general, capital letters refer to population attributes (i.e., parameters); and lower-case letters refer to sample attributes (i.e., statistics). For example,
P refers to a population proportion; and p, to a sample proportion.
X refers to a set of population elements; and x, to a set of sample elements.
N refers to population size; and n, to sample size.
μ refers to
x refers to
σ refers to
s refers to
σ2 refers to
Greek vs. Roman Letters
Like capital letters, Greek letters refer to population attributes. Their sample counterparts, however, are usually Roman letters. For example,
μ refers to a population mean; and x, to a sample mean.
σ refers to the standard deviation of a population; and s, to the standard deviation of a sample
σ2 refers to the variance of a population.
Population parameters
μ
σ
σ2
P
Q
ρ
N
μ refers to a population mean.
σ refers to the standard deviation of a population.
σ2 refers to the variance of a population.
P refers to the proportion of population elements that have a particular attribute.
Q refers to the proportion of population elements that do not have a particular attribute, so Q = 1 – P.
ρ is the population correlation coefficient, based on all of the elements from a population.
N is the number of elements in a population.
Sample Statistics
x
s
2s
p
q
r
n
x refers to a sample mean.
s refers to the standard deviation of a sample.
s2 refers to the variance of a sample.
p refers to the proportion of sample elements that have a particular attribute.
q refers to the proportion of sample elements that do not have a particular attribute, so q = 1 – p.
r is the sample correlation coefficient, based on all of the elements from a sample.
n is the number of elements in a sample.
Simple Linear Regression
β0 is the intercept constant in a population regression line.
β1 is the regression coefficient (i.e., slope) in a population regression line.
R2 refers to the coefficient of determination.
b0 is the intercept constant in a sample regression line.
b1 refers to the regression coefficient in a sample regression line (i.e., the slope).
sb1 refers to the refers to the standard error of the slope of a regression line.
Probability
P(A) refers to the probability that event A will occur.
P(A|B) refers to the conditional probability that event A occurs, given that event B has occurred.
P(A’) refers to the probability of the complement of event A.
P(A ∩ B) refers to the probability of the intersection of events A and B.
P(A ∪ B) refers to the probability of the union of events A and B.
E(X) refers to the expected value of random variable X.
b(x; n, P) refers to binomial probability.
b*(x; n, P) refers to negative binomial probability.
g(x; P) refers to geometric probability.
Describe the role biostatistics plays in the discipline of public health and medicine
Statistics is all about converting data in to useful information. This process therefore includes: collecting data, summarising data and interpreting data
Goal: To study the population of interest.
The ‘Big Picture’ in statistics is to make inferences about a population from a given representative sample.
This summary outlines the fundamental steps in conducting research using biostatistics:
Step 1: Begin with defining the population, but since studying everyone is often impractical, select a representative sample. Choices made here, like hypothesis and sample selection, greatly influence data analysis later.
Step 2: Perform exploratory data analysis, using descriptive statistics to summarize the data through graphs, tables, and numerical measures.
Step 3: Assess how the sample differs from the population to ensure validity. Consider randomness and bias in the sample selection process.
Step 4: In the final step of inference, combine descriptive statistics and probability to draw conclusions about the entire population.
demonstrate understanding of the key ethical considerations relevant to study design, data collection, analysis and interpretation.
highlights the role of ethics committees in ensuring research methodology is sound and likely to yield meaningful results, beyond just participant safety. It stresses the importance of designing studies carefully to respect participants’ time and minimise risks.
Questions that might arise in ethical review include:
How are data being collected?
Are there special requirements for informed consent or confidentiality for any participants whose information is being used in the study?
How will results be disseminated to the public?
These can be addressed in the design stage by ensuring:
the research question is appropriate,
a valid and rigorous methodology is chosen that is capable of answering the research question,
any participants are made aware of how their information will be used (e.g. will it be identifiable or will analysis only involve anonymised, aggregated data?),
samples are chosen for statistical validity and are representative of the population of interest,
all results will be accurately reported, not just those that support the researchers’ hypothesis,
funding sources and collaborator details will be disclosed as appropriate.
What is data?
Data refers to pieces of information about individuals which are organised into variables.
Data is obtained by collecting information from a group of patients participating in a study.
A set of data which can be identified during an experiment, research project or scenario is a dataset.
We can organise our data into observations (representing individuals) and variables:
Observations
Information from a patient or experiment is called an observation.
An observation may represent single or multiple pieces of information about a patient.
Examples include: age, gender, height, weight, blood pressure level and cholesterol level of a single patient.
Variables
A variable is any observation that can have different values.
A variable may have different values when observed at different times for the same patient.
A variable may have different values for different patients.
If we consider the blood pressure of patients in a study, it is likely that the blood pressure for each patient will be different, thus blood pressure is a variable.
What is a random variable?
Random variables
A variable is any observation that is different from person to person but can be predicted, e.g. gender, age, height.
A random variable is a variable that arises due to chance and can not be predicted, e.g. blood sugar level of an individual.
A variable is also a random variable if it was obtained from a random sample.
Random variables are a subset of variables.
Types of variables
Categorical vs Numerical variables
Variables can be classified as either categorical (qualitative) or numerical (quantitative).
Numerical variables represent a measurable quantity, where observations are counts or measurements. For example: the number of people that have graduated from Monash University – a measurable attribute.
Categorical variables represent non-numerical labels, where observations fall into separate distinct categories. For example: the blood type of an individual is either A, B, AB, or O – not a measurable attribute.
Discrete vs Continuous variables
A numerical (quantitative) variable can be classified as discrete or continuous.
Discrete: measurements where the possible values are clearly separated from each other or can only take values equal to whole numbers.
For example: the number of people involved in a car crash.
Continuous: measurements that can take fraction/decimal values. There is an infinite possibility of values an observation can take.
For example: blood pressure of an individual.
Ordinal vs Nominal variables
A categorical (qualitative) variable can be classified into ordinal or nominal.
Ordinal: Where observations can be ordered/ranked according to some criteria. For example: an ICU patient may classify their degree of pain as: no pain, mild pain, moderate pain and severe pain.
Nominal: When observations are classified into separate categories that have no logical ranking the data is said to be nominal. For example: blood types (A, B, AB, or O) or gender (Male, Female, Non-binary, etc).
Hint: If there are only two categories (binary data) then the data is always classified as nominal.
Note: It is common for categorical variables to be coded as 0 or 1. Yes= 1 No=0.
A categorical (qualitative) variable can be classified into ordinal or nominal.
Ordinal: Where observations can be ordered/ranked according to some criteria. For example: an ICU patient may classify their degree of pain as: no pain, mild pain, moderate pain and severe pain.
Nominal: When observations are classified into separate categories that have no logical ranking the data is said to be nominal. For example: blood types (A, B, AB, or O) or gender (Male, Female, Non-binary, etc).
Hint: If there are only two categories (binary data) then the data is always classified as nominal.
Note: It is common for categorical variables to be coded as 0 or 1. Yes= 1 No=0.
population
Population
A population is the largest collection of entities for which we have an interest at a particular time.
The term target population is often used in scientific literature to define the population of interest.
Example: the incidence of ovarian cancer of women between the ages 25 to 55 years in Australia → the population in the example is all women in this age group in Australia.
Sample
Sample
Defined simply as a representative part of a population and consists of one or more observations from this population.
The sample is a section of the population that represents the population as much as possible.
Consider the diabetic population in Victoria. If we collect fasting blood glucose levels of only a fraction of these patients (e.g. 5 patients), we have only a part of our population and thus we have a sample.
what is mean
Mean
Obtained by adding all the values in a sample and dividing by the number of values that are added.
Let us assume that n is the number of observations in a sample, x is the value of the observations and x (bar) is the sample mean.
Mean is affected by extreme values in a dataset because it considers information from all patients and is appropriate for symmetric data.
Median
A value which divides the data set into two equal parts.
If the number of values is odd, the median will be the middle value after all values have been arranged in ascending order of their magnitude.
If the number of values is even, the median is taken to be the average of the two middle values after all values have been arranged in either ascending order of their magnitude.
The median is not affected by the extreme values in a data set because it depends on only the middle observation(s), hence commonly used for asymmetric data.