Lecture 1 Ucc Flashcards

(44 cards)

1
Q

What is descriptive statistics?
State the three forms that descriptive stats can be in

A

Descriptive Statistics provides information regarding the overview of the general features of a given dataset.

They may be in the form of tables, graphs or numerical summary measures (i.e proportions, means, etc.)

Examples of numerical summary measures in statistics, which summarize and describe important features of data, include:

  • Mean: The average of all data points.
  • Median: The middle value in a sorted dataset.
  • Mode: The most frequently occurring value(s).
  • Range: The difference between the maximum and minimum values.
  • Variance: The average squared deviation from the mean.
  • Standard Deviation: The square root of the variance; measures how spread out the values are from the mean.
  • Interquartile Range (IQR): The range between the first (Q1) and third quartile (Q3), representing the middle 50% of the data.
  • Percentiles: Values that divide the data into 100 equal parts.
  • Quartiles: Values that divide the data into four equal parts (Q1, Q2, Q3).
  • Z-Score: Represents how many standard deviations a data point is from the mean. A z-score of 0 means the data point is exactly at the mean.
    • A positive z-score means the data point is above the mean.
    • A negative z-score means the data point is below the mean.
    • The magnitude of the z-score indicates how far the data point is from the mean. For example, a z-score of +2 means the data point is 2 standard deviations above the mean.

Uses of Z-Score:.
2. Identifying Outliers: Data points with very high or very low z-scores (e.g., above 3 or below -3) are often considered outliers.

In a standard normal distribution (mean = 0, standard deviation = 1), z-scores help determine the probability of a data point occurring within the distribution. For instance:
• About 68% of data points fall within a z-score range of -1 to +1.
• About 95% fall within -2 to +2.
• About 99.7% fall within -3 to +3.

  • Skewness: Indicates the asymmetry of the data distribution.
  • Kurtosis: Measures the “tailedness” of the distribution.

These numerical summaries provide insights into the distribution, central location, spread, and overall structure of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is inferential statistics

A

inferential statistics - how you’re going to use data to draw a conclusion,hypothesis,inferences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

State the Population,target population,sample needed in the example below
Example: a study for 200 pregnant women in cape coast.

A

Population- women in cape coast
Target population-pregnant women
Sample 200

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is discrete and continuous data?
What is nominal data?
What is ordinal data?
Give three examples each

Here are two more challenging MCQs to test your understanding of discrete and continuous variables:


### Question 3:
A researcher is studying the growth of bacteria in a lab over time. They record the size of the bacterial colony every 12 hours. Which statement best describes the variable “colony size”?

A) It is discrete because the bacteria are counted.

B) It is continuous because the size of the colony can be measured in precise units like millimeters or micrometers.

C) It is discrete because the colony size is measured at regular intervals.

D) It is continuous because the number of bacteria in the colony grows over time.

A hospital tracks the number of days each patient stays in the intensive care unit (ICU). Which of the following best describes the variable “length of stay in the ICU”?

A) It is continuous because the total time can be measured in hours and minutes.

B) It is discrete because the number of days is counted as whole numbers.

C) It is continuous because it can vary significantly between patients.

D) It is discrete because the stay in the ICU is divided into distinct time blocks.

A

Qualitative :
Ordinal data -The categories are ordered in some way. is used in disease state- mild moderate severe anemia ,another example is degree of pain,wealth index being rich poor or poorest
Nominal-The categories are not ordered but rather have names. Blood group (A, B, AB, and 0 )
Marital status (married/ widowed/ single) Sex (male/female),
Political party affiliation,
Educational level,
HIV status (Positive, negative).

Qualitative variables (or Categorical variables): These are variables that describes data that fits into categories i.e non-numerical. E.g Marital status of women (single, married, separated, widow), educational level (none, primary, middle/JSS/JHS, secondary, tertiary)

Types of Qualitative variables
Nominal- two or more categories with no order. g Religion (Christianity, Islam, traditionalists, other); Ethnicity (Akan, Ewe, Ga Adamgbe, Dagaaba, other)

Binary/dichotomous- only two categories. E.g gender (male, female); hiv couple’s status (discordant, concordant); febrile status (afebrile, febrile); abortion (spontaneous, induced), type of delivery (normal, caesarean section)

Ordinal- two or more categories with order. Eg pain (mild, moderate, severe); educational level (none, primary, middle/JSS/JHS, secondary, tertiary), Household wealth (poorest, more poor, less poor, least poor); Assisted delivery (TBA, midwife, doctor), Likert scales (strongly disagree, disagree, neutral, agree, strongly agree)

Quantitative or numerical data data:
Numerical- discrete and continuous. Examples of discrete data- Days in the week or months in the year,number of people in a class, number of cars,

Discrete: These are variables with a well-defined finite set of possible values. They are usually counts e.g number of deaths, number of smokers etc

Continuous variables can assume any value within a specified relevant interval of values assumed by the values.

e.g. height, weight, and temperature of a patient

Continuous- weight of individual cuz your weight can change everytime you check it at intervals

To know which is which, rule out the other options by asking if you can get a decimal from them. If not, they’re discrete.
Example: you can’t get number if children as 44.5

Continuous variables: These are variables with an infinite number of possible values. This implies any value is possible for the variable. E.g height, weight, temperature, systolic blood pressure, haemoglobin level

Types of continuous variables
Interval variables
ratio variables

Interval variables: These are variables in which the difference between two values are meaningful. E.g temperature measured in degrees Celsius or Fahrenheit. In addition, their zero is meaningless or not a true zero.
So if you get zero Celsius, it means the temperature isn’t significant . Not that the zero doesn’t mean anything but that it’s not a significant value.

But for ratio, the zero is significant . So for distance, the zero shows the start of the distance or is telling you the person hasn’t moved anywhere and that’s meaningful or significant .

Examples of zero being meaningless are:
A pH of zero does not mean “no acidity” (just means 1 molar of H+)
Temperature of zero on the Celsius or Fahrenheit scale does not mean “no thermal energy”

Ratio variables: Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that variable. E.g height, distance, weight etc.

NB: Quantitative variables can be converted to qualitative variables. E.g age, parity
Ranked (ordinal) data lie between nominal and interval: you know the order but not the precise differences.

Great question — this confusion is very common when learning about interval vs ratio scales of measurement.

Let’s break it down so it makes sense:

🔑 First, the Key Difference:

Feature Interval Scale Ratio Scale
Has meaningful order ✅ ✅
Has equal intervals ✅ (like 1°C to 2°C = 9°C to 10°C) ✅
Has a true zero ❌ (zero is arbitrary) ✅ (zero means “none”)
You can make ratios ❌ (10°C is not twice as hot as 5°C) ✅ (10kg is twice as heavy as 5kg)

⚠️ Why Zero on an Interval Scale Is “Meaningless” (Not a True Zero):

Let’s explore your two examples:

🔬 Example 1: pH scale
• A pH of 0 means a very high concentration of hydrogen ions (1 mol/L).
• But pH doesn’t start from “no acidity” — it’s logarithmic, and there are negative pH values too (like −1, −2 in strong acids).
• So zero here is not a true absence of acidity; it’s just a point on the scale.
➤ pH is not a ratio scale.

🌡️ Example 2: Temperature
• 0°C or 0°F does not mean “no heat” or “no thermal energy.”
• 0°C is just the freezing point of water — an arbitrary reference.
• You can have negative temperatures in both scales, so zero isn’t an “absence” of heat.
• That’s why saying “20°C is twice as hot as 10°C” is scientifically incorrect.
➤ Celsius and Fahrenheit are interval scales, not ratio scales.

✅ What Would Be a True Zero?

Here’s how ratio scales behave differently:

💧 Example: Weight (kg)
• 0 kg = complete absence of mass.
• 10 kg is truly twice as heavy as 5 kg.
• You can’t have negative weight.
➤ So weight is a ratio scale.

🔥 Example: Temperature in Kelvin
• 0 K = absolute zero = no thermal energy.
• So 200 K is twice as hot (in terms of energy) as 100 K.
➤ Kelvin is a ratio scale, unlike Celsius.

🔁 Summary

Measurement Zero means… Scale type
0 °C Freezing point of water (arbitrary) Interval
0 pH High acidity, not “no acidity” Interval
0 kg No mass Ratio
0 K (Kelvin) No thermal energy Ratio

Let me know if you’d like a visual to go with this or practice questions to test your understanding!

In theory: Age is continuous since it can be measured to any degree of precision.
• In practice: Age is often treated as discrete for convenience when we round to whole numbers.

Here are two more challenging MCQs to test your understanding of discrete and continuous variables:

A researcher is studying the growth of bacteria in a lab over time. They record the size of the bacterial colony every 12 hours. Which statement best describes the variable “colony size”?

A) It is discrete because the bacteria are counted.

B) It is continuous because the size of the colony can be measured in precise units like millimeters or micrometers.

C) It is discrete because the colony size is measured at regular intervals.

D) It is continuous because the number of bacteria in the colony grows over time.

Answer: B) It is continuous because the size of the colony can be measured in precise units like millimeters or micrometers.
(The size of the bacterial colony is measured in a continuous manner, as it can take any value within a range.)

A hospital tracks the number of days each patient stays in the intensive care unit (ICU). Which of the following best describes the variable “length of stay in the ICU”?

A) It is continuous because the total time can be measured in hours and minutes.

B) It is discrete because the number of days is counted as whole numbers.

C) It is continuous because it can vary significantly between patients.

D) It is discrete because the stay in the ICU is divided into distinct time blocks.

Answer: B) It is discrete because the number of days is counted as whole numbers.
(The length of stay is often recorded as whole days, making it a discrete variable in this context.)

These questions focus on understanding the nature of measurement and how variables are typically recorded in real-world scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are numerical summary measures and measures of central tendency

A

Know the kind of data you have before you apply a particular descriptive technique to it.

Numerical summary measures are used to make concise quantitative statements that characterize the whole distribution of quantitative values.

Measures of Central Tendency
These are measures used to investigate the central characteristic of the data or the point which the observations (or values) tend to cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is arithmetic mean?

The arithmetic mean is extremely sensitive to unusual values.
question:
The forced expiratory volume (FEV) in 1 second for 13 study participants are as follows: 2.30, 2.15, 3.50, 2.60, 2.75, 2.82, 4.05, 2.25, 2.68, 3.00, 4.02, 2.85, 3.38. Calculate the arithmetic mean for FEV

A

Arithmetic mean: It is the sum of all observations in a set of data divided by the total number of observations

The arithmetic mean is extremely sensitive to unusual values.
Examples
The forced expiratory volume (FEV) in 1 second for 13 study participants are as follows: 2.30, 2.15, 3.50, 2.60, 2.75, 2.82, 4.05, 2.25, 2.68, 3.00, 4.02, 2.85, 3.38. Calculate the arithmetic mean for FEV values
The arithmetic mean is given by
(2.30+2.15+3.50+2.60+2.75+2.82+4.05+2.25+2.68+3.00, 4.02+2.85+3.38)/13=2.95 litres

Sure, let’s break down each mean with simple explanations and examples:

Definition: The arithmetic mean is what most people commonly refer to as the “average.” You add up all the numbers and then divide by how many numbers there are.

Formula:
[
\text{Arithmetic Mean} =Sum of all values / Number of values

Example:
Consider these numbers: 4, 7, 10.

  1. Sum: (4 + 7 + 10 = 21)
  2. Count: There are 3 numbers.
  3. Arithmetic Mean: {21/3} = 7

So, the arithmetic mean is 7.

Definition: The harmonic mean is used for rates or ratios. You take the reciprocal (1 divided by the number) of each value, find their average, and then take the reciprocal of that average.

Formula:
[
\text{Harmonic Mean} ={Number of values/{Sum of the reciprocals of all values}

Example:
Consider these values: 4, 6.

  1. Reciprocals: 1/4= 0.25, 1/6= 0.167
  2. Sum of Reciprocals: (0.25 + 0.167 = 0.417
  3. Harmonic Mean: {2/0.417)
    It’s 2 cuz the number of values are 2. If the number of values were 3, it would’ve been 3 divided by the reciprocal total.

So, the harmonic mean is approximately 4.8.

Definition: The geometric mean is used for sets of numbers that are multiplied together or when dealing with percentages. You multiply all the numbers together, then take the nth root, where n is the number of values.

Geometric mean is the nth root of the product of n observations.
Formula:
[
\text{Geometric Mean} = \sqrt[n]/Product of all values}}
]

Example:
Consider these numbers: 2, 8.

  1. Product: (2 x 8 = 16)
  2. Geometric Mean: (\sqrt{16} = 4)
    It’s square root cuz the values are 2.
    If it was 3 values, it would’ve been cube root

So, the geometric mean is 4.

  • Arithmetic Mean: Average of numbers (e.g., (7) for (4, 7, 10)).
  • Harmonic Mean: Useful for rates (e.g., (4.8) for (4, 6)).
  • Geometric Mean: Average for multiplicative datasets (e.g., (4) for (2, 8)).

Mean for rates:

Great — in biostatistics, both geometric and harmonic means are used, but for different purposes, depending on the type of data you’re analyzing.

Here’s how to distinguish their uses in biostatistics:

✅ Geometric Mean in Biostatistics

Used when you’re dealing with:
• Skewed biological data (e.g., bacterial counts, viral loads, enzyme activity, concentration levels)
• Growth rates (e.g., population doubling, pharmacokinetics)
• Data that increases or decreases multiplicatively (log-normal distribution)

🔍 Why use it?

The geometric mean is less sensitive to extreme values and better represents the central tendency of positively skewed data.

📌 Common use cases:
• Geometric Mean Titers (GMT) in immunology
• Drug concentration averages over time
• Mean fold changes in gene expression

✅ Harmonic Mean in Biostatistics

Used when you’re averaging:
• Rates: e.g., heart rate (beats per minute), respiratory rate
• Ratios: especially when the denominator varies
• Per unit time events, like patients per nurse per hour

🔍 Why use it?

The harmonic mean gives a better average when the data involves ratios or rates, and when values are inversely proportional to the quantity you’re analyzing.

🔁 Quick Comparison:

Situation Use This Mean Example
Skewed biological data or fold changes Geometric Mean Antibody levels, gene expression ratios
Averaging rates or ratios Harmonic Mean Average patient wait time, average heart rate over unequal intervals

🔬 In short:
• Use geometric mean when you’re dealing with multiplicative data (especially if it’s log-normally distributed)
• Use harmonic mean when you’re dealing with rates or ratios and need a weighted average

Let me know the specific biostat problem or dataset you’re working with, and I can help you choose the correct method and even calculate it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is calculation for mean of ungrouped data different from mean of grouped data

for a grouped frequency distribution table with 20-40,40-60,60-80 and 2,4,6 as their respective frequencies
The mean is?

A

Mean for ungrouped data is different from mean for grouped data.
Mean of ungrouped data is the sum of all the numbers divided by the total number of the numbers
For grouped data, you construct a table and give a range or categories. Then you find the frequency of the categories. Example 0-5,6-10,11-25
How many people fell within 0-5,how many between 6-10

Calculation for grouped data is sigma fx divided by sigma f. Where sigma f is the frequencies, x is the midpoint

So for a grouped frequency distribution table with 20-40,40-60,60-80 and 2,4,6 as their respective frequencies
The mean is:
20-40 midpoint is 60/2=30
40-60 midpoint is 100/2=50
60-80 midpoint is 140/20=70

So sigma fx = 30x2,50x4,70x6
60+200+420
=680
F=12
So sigma fx/sigma f is 680/12
=56.66
So mean for this grouped data is 56.67

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the median under measures of central tendency

What is the median of these values: 2, 4, 1, 2, 3, 3, 1
Find the median of the following values: 6, 5, 2, 4, 3, 2

A

Median: it is the middle value of the set of n observations after they have been arranged in order of magnitude
For a set of n observations, the median is identified by using the formula (n+1)/2
Medians are not sensitive to unusual values
Examples
What is the median of these values: 2, 4, 1, 2, 3, 3, 1. Median is 1,1,2,2,3,3,4. Median is 2

Find the median of the following values: 6, 5, 2, 4, 3, 2
2,2,3,4,5,6
N/2 is 6/2 is 3rd position
N/2 +1 is 4th position
So 3+4/2 is 3.5. Median is 3.5
So median is between 3 and 4th position so in this case, it just so happens that the median is the same as its position.

For odd number- (n+1)/2 to get the position after you’ve arranged the values in order of magnitude
For even number- n/2 to find the first middle number and (n/2)+1 to find the second middle number. So if x is first middle and y is second middle then x+y/2 is the median of the even set of values or you can just pick the two middle numbers and find th average. That’ll give you the median straight without doing the position thing but that’s helpful if you have small data. If it’s a lot, you’ll need to use the n/2 and n+1/2 to get the two middle positions then you find the average of the numbers at those positions to get the median.

❓Does the formula (n + 1) ÷ 2 work for even numbers?

No, not exactly.
That formula is only reliable for odd numbers because it gives you the position of the middle number.

Let’s see why it breaks with even numbers — using your exact example.


🔹 Example: 8 values

Let’s take this set:
2, 3, 4, 5, 6, 7, 8, 9

That’s 8 numbers → even

Now try both methods:

🔸 Method 1 (Correct for Even Numbers):

We find:
• n = 8
• Middle positions = 8/2 = 4 and if you add plus 1 to the 4, you get 5.
→ So: take the average of the 4th and 5th numbers.

From the list:
4th = 5
5th = 6
→ Median = (5 + 6) / 2 = 5.5 ✅

🔸 Method 2: Use (n + 1)/2 = (8 + 1)/2 = 4.5

What does this give us?

It gives position 4.5 — which doesn’t exist in your data.
So it’s telling you the median lies between the 4th and 5th values (which is correct!).

But here’s the catch:

If you use 4.5 as your final median, that’s wrong.

But if you use 4.5 to tell you to average the values at positions 4 and 5, that’s good reasoning.

✅ Conclusion:

Formula Use It For Why
(n + 1)/2 Only when n is odd It gives you the exact position of the middle number
Average of n/2 and n/2 + 1 When n is even Because there are two middle numbers — no single middle

So you were right to calculate (n + 1)/2 = 4.5, but wrong only if you used 4.5 as the final median. It’s just pointing to the middle between two values — which should be averaged.

🧠 Quick Tip:
• If n is odd → (n + 1)/2 = median position
• If n is even → median = average of values at n/2 and (n/2 + 1)

Or go straight to pick the two middle numbers, find average and boom you get your median without doing the position thing. But make sure you do the position thing for the odd numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is calculation for median in odd numbers different from even set of numbers

A

Median:
Odd number of observations calculation is different from even number . Note you can sort it in descending or ascending order

Even number of observations : arrange in ascending or descending orders. N+1/2.
The number you get is nth number that has the median number.
So for 1122334, n is 7. 7+1 divided by 2 is 4. So the 4th number is the median number so 2 is the median.

Mode:most frequent value

From an ungrouped frequency, the mode is the value with the highest frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are measures of dispersion under numerical summary measures
What is range?
Is range sensitive to unusual values? Why?
Find the range of the following observations:
5, 4, 6, 2, 2, 3, 3, 8
2, 3, 3, 1, 3, 5, 40

A

Measures of Dispersion
These are measures used to describe the spread (or variability) of the data.
Range: It is the difference between the minimum and maximum values of a dataset.
It is more useful when the minimum and maximum values are reported
Since range depends on only the minimum and maximum values, it is sensitive to unusual values.

First sort out the values in ascending order else you’ll miss it if there’s a bigger value there.
1. 6-2 is 4 so range is 4
2.40-1 is 39 so range is 39

Number 1 I had it wrong cuz I didn’t arrange it so I didn’t see the 8
So 1. 8-2 is 6. Range is 6

Number 2 is correct

Please note that we don’t report the range as the difference found. We report it as a range.
So 8-2 is the range. Not 6.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is interquartile range
Difference between IQ and median

A

Interquartile Range: It the difference between the Upper quartile (75th percentile) and the lower quartile (25th percentile).
It contains the middle fifty percent of values in a data set. Take note of the word, middle.

The interquartile range contains middle 50% of the observations in a dataset

It is not sensitive to unusual values or skewed data.

The following steps are used to find the 25th and 75th percentiles:

The interquartile range (IQR) and the median (Q2) both relate to measures of central tendency and dispersion but serve different purposes. Here’s a detailed comparison:

  1. Interquartile Range (IQR):
    • Definition: The IQR measures the spread of the middle 50% of the data. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):
      IQR} = Q3 - Q1
    • Purpose: It provides an indication of the variability or dispersion of the central 50% of the data, excluding the lower 25% and the upper 25%. It helps to identify the spread of data and is useful for detecting outliers.
  2. Median (Q2):
    -Definition:The median, or the second quartile (Q2), is the middle value of the dataset when it is ordered. It divides the dataset into two equal halves:
    • Purpose: It represents the 50th percentile and indicates the central value of the dataset. It shows where the center of the data lies, but it does not measure the spread or variability.

Key Differences:

  • Measurement Focus:
    • The IQR measures the spread of the central 50% of the data.
    • The Median (Q2) measures the center of the data distribution.(doesn’t measure the spread of the data at the middle or the center. Just tell you what data is at the center)
  • Calculation:
    • The IQR is derived from Q1 and Q3, focusing on the range between these quartiles.
    • The Median (Q2) is a single value representing the middle of the dataset.
  • Utility:
    • The IQR helps in understanding the distribution and identifying outliers.
    • The Median helps in understanding the central tendency of the data.

In summary, while both IQR and the median relate to the distribution of data, they serve different purposes: IQR measures the range within which the central 50% of data falls, while the median indicates the middle point of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you get the 25th quartile?
Don’t use this one
I’ve found an easier one

A

Interquartile range -quartile is 4 or something being divided into 4. 100 divided into 4 is 25,50,75 and 100.
IQ=Q3-Q1
1.Arrange values you have in ascending order
Find the median position
Find the median position of the median😂.
So if the median position is 5th then find the median of this so 5+1 divided by 2.
The position you get and the number it correlates with is Q1 or the lower quartile

K is the percentile you’re interested in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you get the 25th percentile from values 2236789

What’s the interpretation of the lower and upper quartiles?(applicable to the middle quartile)

A

Interpretation; if you use weight and you got maybe 20kg for lower quartile, the interpretation is, 25% of people had a weight below 20kg or 75% of people had a weight above 20kg.

If upper quartile and you had maybe 75kg, you’ll say 75% of people had weight below 75kg or 25% of people had weight above 75kg. Because upper quartile is also Q3 or also the 75th percentile.

How to get 25th percentile:

nk/100
k is the percentile
n is number of observations

So first arrange the observations in ascending order

Then 7 x 25/100 is 1.75.
This result isn’t a whole number or integer so we must round it to the nearest integer or the nearest highest integer.
So 2. Or the 2nd position(yes the position thing is here too)
Then you pick the number in the 2nd position which is 2. So the 25th percentile is 2.

Note that if you got 1.10, you would still round it to 2.
1.01, still round it to 2.

If the number you got was an integer, let’s say nk/100 had given us 3 or the 3rd position, we will add it to the next number in that position which is 6 and then we’ll find the average. So 3rd position which gives us 3 + next position which is gives us 6.
9/2. So we would’ve had 4.5. Then we round it to 5. So 5 is your answer (not 5th position).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Find the interquartile range of the observations below
1. 2, 4, 4, 1, 1, 5, 4, 4, 4

  1. 6, 6, 2, 1, 3, 5, 6, 7, 8
A

112444445
|||||||||
123456789

The numbers with their positions so 5 is the 9th position

Step 1: rearrange in ascending order
Step 2: find the median of the number set above
So 9+1/2= 5
So the number that correlates with the fifth position is the median so 4 is the median.

Now, split from 1-5 into two and split from 5-9 also to find lower quartile. So yoj find the median of the median position.
So 5+1/2=3
So the value that corresponds to the third position is 2,so 2 is the lower quartile.

The third quartile or 75th quartile is 3(n+1)/2 so n =9. The final answer is 7.5 so rounded up to 8(you never use decimals. Always round to the highest nearest whole number. Round the result to the immediate integer larger than the result to give you the location of the kth percentile. For example 3.12 has to be rounded to 4 and not 3)
So the number corresponding with the 8th position is 4
So IQ=Q3-Q1
So IQ is 4-2(don’t subtract it. It is a range not a value. Leave your final answer as 4-2)

Or

Start with rearranging numbers in ascending order
Q1= n+1/4 so the position is the answer then you relate it to the dataset you have
Q2 is the same as finding the median of a regular set
Q3= 3(n+1)/4 so the number that this position corresponds to is the 75th quartile

So using this example on the dataset, it’s the same thing to get the same answer. So Q1= 9+1/4= 2.5 and this is rounded to 3. And this three is the third position which correlates to 2. So 2 is the lower quartile
Then q3=3(9+1)/4
Is 7.5 this is rounded to 8 and so the number correlating to the 8th position is 4. So Q3 is 4 hence IQ is 4-2

This is not even funny.
I’ve found another way😂. It’s easier I guess.

112444445

Q1
9x25/100 is 2.25 so rounded up to 3
So the third position which is 2.

Then Q3 is 9x75/100 by is 6.75 which is rounded up to 7. So 7th position which is 4.
So IQ is 4-2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What term encompasses the methods of collecting, summarizing, analysing and drawing conclusions from the data?
What term is given to a set of recorded observations on one or more variables?
What is the term given to a characteristic (personal aspect ) or attribute (feel, behave, or think) of an individual or
an organization that can be measured or observed and varies among individuals or organizations ?

A

What is Statistics?
●It encompasses the methods of collecting, summarizing, analysing and drawing conclusions from the data.

●Data: is a set of recorded observations on one or more variables.
Eg. Data on demography of patients with particular illness.

● A variable is a characteristic (personal aspect ) or attribute (feel, behave, or think) of an individual or
an organization that:
(a)can measure or observed

(b) varies among individuals or organizations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Calculate the IQ range for these values:
662135678

2,3,4,5,7,8,12

(This is my final understanding of this thing)

A

662135678
Step 1- rearrange in ascending order
Step 2-it’s an odd number of observations so separate them into two equal halves and the middle number is the median number
1235 6 6678
Q2 or median is 6
Step 3- find the mean of the two middle numbers of the first half to get Q1. So 2+3/2 will give you 2.5 rounded to 3 so the Q1 is 2.5 or 3(use 2.5 depending on the possible answers)
Q1 is 2.5 or 3 not the number at the 3rd position
Step 4-find the mean of the two middle numbers of the second half to get Q3
6+7/2 is 6.5 rounded to 7.
Not the number at the position. Q3 is 6.5 or rounded to 7 not the postion of the 7th number
IQ= 7-3 not the number at the 7th position or the number at the 3rd position
Just the numbers. So 7-3 is 4
So IQ is 4

The above is for odd number of values
Don’t use the formula n+1/4 or 3(n+1)/4 for this odd number of sets

Another example for odd numbers is 2,3,4,5,7,8,12
The Q2 is 5
The Q1 is 3(cuz 3 is the middle number between 2 and 4)
Q3 is 8( cuz 8 is between 7 and 12)

For even number of sets, rearrange in ascending order too.

12356667
Find the median number separate into two equal halves
So 123 56 667
So there are two numbers in the middle instead of one. So you’ll find the mean of that. 5+6/2 this will give you 5.5 or rounded to 6 for Q2

Next step is to find Q1.
For Q1, this time you won’t consider the median in your calculation.
So you’ll divide the numbers into two equal halves without removing the middle two numbers. So it’ll be
1235 6667
So you’ll find the mean of the two numbers in the first half. 2+3/2 which is 2.5 so Q1 is 2.5
For Q3, you’ll find the mean of the two numbers in the second half.
So 6+6/2 which is 6
So Q3 is 6

IQR is 6-2.5 which is 3.5
So to decide whether to round up or not, don’t round up at any point. Don’t round up your final answer too
It depends on the answers to the question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the shape of a normal distribution curve?
How many modes does a normal distribution curve have?
Is it symmetrical?

What does a low SD mean
What about a high one
What percentage of data falls within +1 and -1 SD?
What percentage of data falls within +2 and -2 SD?
What percentage of data falls within +3 and -3 SD?

A

In a normal distribution curve, the mode, median and mean are equal and lie in the same location.

It is bell shaped curve, unimodal (has one mode or highest frequency ) and symmetrical

The normal distribution is determined by two parameters: the mean and the vari- ance. The fact that the mean and the variance of the normal distribution are the nat- ural parameters for the normal distribution explains why they are sometimes pre- ferred as measures of location and scale.

Low SD means values are close to the mean.

High SD means values are spread out over a large range.

•68% of the data falls within +1 and -1 SD

•95% of data falls within +2 and -2 SD

•99 of data falls within +3 and -3 SD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is skewed distribution
What are the two types
Most data are points are of what type of skewed distribution?
Which type of distribution has the mean greater than the median and the median greater than the mode?

Which type of distribution has the mean lesser than the median and the median lesser than the mode or the mode is greater than median and median is greater than mean

Question 1:
Which of the following correctly describes a positively skewed distribution?

A) The tail points to the left, with most data points concentrated on the right.

B) The tail points to the right, with most data points concentrated on the left.

C) The data is evenly distributed around the mean.

D) The tail points to the right, with most data points concentrated on the right.

Question 2:
In a negatively skewed distribution:

A) The majority of data points are concentrated on the right, and the tail extends to the left.

B) The majority of data points are concentrated on the left, and the tail extends to the right.

C) The distribution has no tails and is perfectly symmetrical.

D) The distribution’s mean and median are equal.

Question 3:
Which of the following best represents the difference between a positively skewed and a negatively skewed distribution?

A) Positively skewed distributions have their tails pointing left, while negatively skewed distributions have their tails pointing right.

B) Positively skewed distributions have their tails pointing right, while negatively skewed distributions have their tails pointing left.

C) Both types of distributions have their tails pointing to the center.

D) Both types of distributions are symmetric, with tails of equal length on both sides.

.

Question 4:
In which type of skewed distribution would the mean typically be greater than the median?

A) Negatively skewed

B) Positively skewed

C) Symmetrical distribution

D) Bimodal distribution

A

Skewed distributions: it is a measure of symmetry of distribution.

Positively skewed-tail points to the right and most data points are at the left
Negatively skewed-
Tail is to the left

Most data are positively skewed

Skewed distribution:
There is change in median mode and mean
They are not equal to each other.
Positively skewed- mean is greater than median and median is greater than mode
Negatively skewed- mode is greater than median and median is greater than mean
Or the mean is less than the median and the median is less than the mode

The mean is always near the tail. So in positively skewed, the tail is at the right so the mean is at the right then the median is behind it and then the mode is the first value.

But I’m negatively skewed, the tail is to the left so the mean is at the left followed but the median and then the mode is the last value

  1. Positively Skewed Distribution:• Tail Direction: The tail extends to the right.
    • Data Concentration: Most data points are concentrated on the left (lower values).
    • Impact on Mean: The mean is sensitive to all values in the dataset, including extreme values in the tail. In a positively skewed distribution, the high values in the right tail increase the sum of the data points, thus pulling the mean to the right (towards the higher values). As a result, the mean becomes greater than the median.
  2. Negatively Skewed Distribution:• Tail Direction: The tail extends to the left.
    • Data Concentration: Most data points are concentrated on the right (higher values).
    • Impact on Mean: Similarly, the mean in a negatively skewed distribution is affected by all data points, including the low values in the left tail. These low values decrease the sum of the data points, pulling the mean to the left (towards the lower values). Consequently, the mean becomes less than the median.

Summary:

•	In a positively skewed distribution, the extreme high values in the right tail pull the mean to the right, making it greater than the median.
•	In a negatively skewed distribution, the extreme low values in the left tail pull the mean to the left, making it less than the median.

Here’s an MCQ to test your understanding of skewed distributions:

Question 1:
Which of the following correctly describes a positively skewed distribution?

A) The tail points to the left, with most data points concentrated on the right.

B) The tail points to the right, with most data points concentrated on the left.

C) The data is evenly distributed around the mean.

D) The tail points to the right, with most data points concentrated on the right.

Answer:
B) The tail points to the right, with most data points concentrated on the left.

Question 2:
In a negatively skewed distribution:

A) The majority of data points are concentrated on the right, and the tail extends to the left.

B) The majority of data points are concentrated on the left, and the tail extends to the right.

C) The distribution has no tails and is perfectly symmetrical.

D) The distribution’s mean and median are equal.

Answer:
A) The majority of data points are concentrated on the right, and the tail extends to the left.

Question 3:
Which of the following best represents the difference between a positively skewed and a negatively skewed distribution?

A) Positively skewed distributions have their tails pointing left, while negatively skewed distributions have their tails pointing right.

B) Positively skewed distributions have their tails pointing right, while negatively skewed distributions have their tails pointing left.

C) Both types of distributions have their tails pointing to the center.

D) Both types of distributions are symmetric, with tails of equal length on both sides.

Answer:
B) Positively skewed distributions have their tails pointing right, while negatively skewed distributions have their tails pointing left.

Question 4:
In which type of skewed distribution would the mean typically be greater than the median?

A) Negatively skewed

B) Positively skewed

C) Symmetrical distribution

D) Bimodal distribution

Answer:
B) Positively skewed

19
Q

When the data is normally distributed, what measures do you report?
What about when it’s skewed

A

When the data are normally distributed, report the arithmetic mean and standard deviation (because these two are grossly affected by outliers so if it’s skewed, they will be affected or not accurate. That’s why they’re only used in normal distribution)
However, when the data are skewed report the median or interquartile range

20
Q

Here’s an MCQ to test your understanding of what a box plot is:

Question:
Which of the following best describes a box plot?

A) A graphical representation that shows the frequency distribution of a data set.

B) A type of plot that illustrates the relationship between two continuous variables using a scatter of points.

C) A graph that depicts differences in quantitative measures between two or more subgroups, summarizing data using quartiles and highlighting potential outliers.

D) A plot that displays the cumulative frequency of a data set.

A

Answer:
C) A graph that depicts differences in quantitative measures between two or more subgroups, summarizing data using quartiles and highlighting potential outliers.

Box plots: These are graphs usually used to depict the differences in quantitative measures between two or more subgroups. It also provides a summary of the data

You use IQ or box plots to :
- [ ] provide a summary of Quantitative continuous variables
- [ ] Find or compare Median of two data sets.

21
Q

Explain the parts of the box plots

A

Vertically:
The upper part(the upper whisker not the box) is the maximum value or number in the data set
The lower part(the bottom whisker) is the minimum value
Then for the box, the line dividing the box is the Q2. The part of the box nearer to the upper whisker is Q3 and the part near to the lower whisker is Q1
How to remember this is going up the box is ascending. So the down or lower part is minimum and the upper whisker of boxplot is maximum

But the box plot horizontally, it is still ascending so from min to max from left to right
Then box plot vertically, it’s ascending so from min to max,from down to up

22
Q

How do you calculate for outliers
If there are outliers in the data, don’t plot them on boxplot
Any number outside the max value is an outlier
Any number outside the min value is an outlier

A

To check for outliers in min range Q1-1.5 X IQ
To check in max range Q3+1.5 X IQ

(Q1-1.5 X IQ, Q3+1.5 X IQ)
(Remember BODMAS in this calculations. UCC doesn’t let you use calculators)
The answer becomes your range for plotting your box and whisker plot
So example is getting (1,12)
This is your range for plotting. So if you got 2 as your minimum,2 is within 1-12 so it’s not an outlier
If you got 11 as your max, 11 is within 1-12 so it’s not an outlier.
So let’s say you got 15 as your max. And your values are 2,3,4,9,10,15
You won’t use 15 cuz 15 is an outlier cuz it’s bigger than 11. So you’ll use the next number which is 10 as your max and you’ll plot 10 not 15. Then you’ll indicate 15 as an outlier (the way outliers appear on boxplots)

So you’ll do this outlier calculation after getting your min and max values from the data

23
Q

You checked the weights of people and 80 as the Q3 and 50 as the Q1
IQ is 30
interpret the interquartile values,Q1 values and Q3 values

A

Q3- so it’s also called 75th percentile so if let’s say you checked the weight of people and the Q3 was maybe 80. It means 75% of the people had weight of 80kg and below

So for Q1- if it was maybe 50, we can say 25% of the people had weight less than 50kg or from 50kg downwards and the rest had above 50kg

So IQ is 80-50 so 30kg
So the 50% of the weight of the people is spread over a range of 30kg from 50kg to 80kg
So 50% of the people has weight 30kg or 50% have their weight between 80-50kg
Or 50% of the people are between weight 80-50kg

Given:

  • Q1 (First Quartile) = 50
  • Q3 (Third Quartile) = 80
  • IQR (Interquartile Range) = 30

Here’s how to interpret these values:

  1. Quartile Values:
    • Q1 (50): This value represents the 25th percentile of the data. It means that 25% of the people in your dataset have weights less than or equal to 50 kg.
    • Q3 (80): This value represents the 75th percentile of the data. It means that 75% of the people in your dataset have weights less than or equal to 80 kg.
  2. Interquartile Range (IQR):
    • IQR (30): This is the difference between Q3 and Q1, calculated as:
      [
      \text{IQR} = Q3 - Q1 = 80 - 50 = 30
      ]
      The IQR measures the spread of the middle 50% of the data. In this case, the weights of the middle 50% of people fall within a range of 30 kg.
  • Q1 = 50 kg means 25% of the people weigh 50 kg or less.
  • Q3 = 80 kg means 75% of the people weigh 80 kg or less.
  • The middle 50% of the people have weights ranging from 50 kg to 80 kg, with a spread of 30 kg.

This range gives you an idea of the typical weight distribution and helps identify where most of the data points lie.

Let’s break down the interpretation of the interquartile range (IQR), Q1, and Q3 values based on the provided data:

  • Interpretation:
    • The first quartile (Q1) represents the 25th percentile of the data.
    • This means that 25% of the people in the dataset have a weight less than or equal to 50 kg.
    • It also means that 75% of the people have a weight greater than 50 kg.
  • Interpretation:
    • The third quartile (Q3) represents the 75th percentile of the data.
    • This means that 75% of the people in the dataset have a weight less than or equal to 80 kg.
    • It also means that 25% of the people have a weight greater than 80 kg.
  • Calculation:
    • IQR is calculated as Q3 - Q1, so ( 80 - 50 = 30 ).
  • Interpretation:
    • The IQR measures the spread of the middle 50% of the data.
    • In this case, the middle 50% of the people have weights that range between 50 kg and 80 kg.
    • An IQR of 30 indicates that there is a moderate spread in the middle half of the weights.
  • Q1 = 50 kg: 25% of people weigh 50 kg or less.
  • Q3 = 80 kg: 75% of people weigh 80 kg or less.
  • IQR = 30 kg: The middle 50% of people’s weights range from 50 kg to 80 kg, indicating that most people’s weights are within a 30 kg range.

The IQR is useful for identifying the range within which the central bulk of the data lies, and it helps in understanding the spread or dispersion of the data around the median.

No, there is no Q4 in the context of quartiles. Here’s why:

  1. Q1 (First Quartile): This is the 25th percentile. It marks the point below which 25% of the data falls.
  2. Q2 (Second Quartile): This is the 50th percentile, also known as the median. It divides the data into two equal halves, with 50% of the data below and 50% above.
  3. Q3 (Third Quartile): This is the 75th percentile. It marks the point below which 75% of the data falls.
  • The quartiles split the data into four equal parts. However, when discussing specific quartiles, we typically refer to Q1, Q2, and Q3 because they represent key percentiles:
    • Q1: 25th percentile
    • Q2: 50th percentile (Median)
    • Q3: 75th percentile
  • The fourth quartile is the remaining portion of the data that falls between Q3 (75th percentile) and the maximum value. However, this isn’t typically labeled as Q4; instead, it’s just referred to as the top 25% of the data.
  • Q1, Q2, and Q3 are the three main quartiles.
  • While the data can be divided into four parts, we don’t refer to a Q4; we consider Q3 as the last main quartile, and the remaining part of the data beyond Q3 is just the upper 25%.
24
Q

Statistics is the collection, organization,analysis and interpretation of numerical data.

Applying this to biological and health is biostat

Outcome- dependent variable
Independent variable-???

Ways to get descriptive stats:
Using Numerical summary measures -consice quantitative statement used to describe the features of a dataset
Using graphs
Using tables
This is from first slide on descriptive stat

Categorical data- percentage
Numerical- mean

You have to know the kind of data before applying any technique so you can apply the right technique

Types of mean:
Geometric mean
Harmonic mean
Arithmetic mean

Median: A N+1/2 is for the position of the value but not the value itself

2,4,1,2,3,3,1
6,5,2,4,3,2

1,1,2,2,3,3,4
7+1/2
8/2
4
4th position
4th position is 2.

2,2,3,4,5,6
3+4/2
7/2
3.5th position

Range: write it from min -max
Don’t subtract.

1,2,3,3,3,5,40

Instead of doing 40-1, to get 39
You’ll just state the range as 1-40
Cuz this is more informative as compared to just giving the subtraction.
IQ- don’t subtract. Give the range as done in range. Example is if Q3 is 3 and Q1 is 2, you won’t do 3-2 is 1.
You’ll just give it as 3-2.

IQ is from Q1-Q3

What is an integer??
Negative or positive whole number. Including zero.

You’ll round to immediate whole number if it’s not an integer.
Example is 3.1
This will be 4th position
Or 5.3 will be 6th position
To get the value

If you get an integer:
You’ll find the mean of what you got plus the next one
Example, you got 6th position after doing nk/100
K is the percentile

So you’ll do the mean of the value at 6th position + value at 7th position /2

Or if you got the 3rd position after doing nk/100,
You’ll do value at (3rd position + value at 4th position) /2

Know when to apply the different calculations. He said he won’t bring calculations.

Variance together with standard deviation.
Standard deviation is square root of variance.
How to calculate variance

When given a data, you have to check the distribution first before you calculate the mean.
If it’s skewed or not
If not skewed, you’ll calculate the mean.

You use IQ or box plots to :
- [ ] provide a summary of Quantitative continuous variables
- [ ] Find or compare Median of two data sets.

The height of the box is the IQR

CLOSED CIRCLES ARE SHADED
THE ONES THAT ARENT SHADED ARE OPEN CIRCLES

Question:
The following values are normally distributed. From the boxplot, The median is 5
What is the mean???
Ans: the means is 5
Cuz for normal distribution,
mean =median=mode

25
Variance gives you a measure of spread in squared units. • Standard deviation gives you a measure of spread in the same units as the data, making it more interpretable. Variance tells you how much the data points vary in squared units. • Standard deviation tells you that, on average, the data points are 2.83 units away from the mean (in this case, 2.83 units from 6). The standard deviation is the square root of the variance. It brings the value back to the same units as the data, making it easier to interpret. • It tells us the average distance of the data points from the mean.
26
GEM 301: How to use stata Has four windows and two behind the scenes windows. State them.
Results window: This window display the output of typed stata commands Review window: This window displays the recent or current commands for quick access. Commands can be modified. Variable window: This window displays data variables Command window: It is where stata commands are typed and they’re typed one command at a time. You can’t execute multiple commands at a time. Clear erases commands typed. Stata is case sensitive. So if your variable is saved as Education and you type it in the command as education, you’ll get error message or message in red. But you must also type all stata commands in lowercase. Stata accepts abbreviations . Behind the scenes: Do file: This is a file which contains a set of stata commands which can be executed concurrently. So you can type all commands and execute them simultaneously or at the same time Log file: This is a file which contains all the past and current results displayed by stata. Type browse in command window to see the numerical data of the variables.
27
Gem 301: Geometric mean is the same as the arithmetic mean on log scale Geometric mean is for skewed data.
It computes the arithmetic mean on the original (raw) data, not the log-transformed data. ✅ If you want to compute a geometric mean in Stata (i.e., mean on log scale), you have to manually log-transform the data first. You’re very close — great job following along! Just one small correction, and then you’ll have it. ✅ Yes, you’re right that: Geometric mean is better for skewed data — especially when the data values multiply or grow rapidly (like bacteria count, income, viral load, etc.). But here’s the corrected version of what you said: ❌ You don’t exponentiate each log number. ✅ Instead, you: 1. Take the log of each number 2. Then take the average (mean) of those log numbers 3. Then you exponentiate only that final average 🔁 Step-by-step Example: Let’s say you have 3 numbers:  Step 1: Take the log (use natural log or log base 10 — let’s use log base 10 here): • log(2) = 0.301 • log(8) = 0.903 • log(32) = 1.505 Step 2: Take the mean of the logs:  Step 3: Now exponentiate that mean (undo the log):  🎉 So, the geometric mean = 8 🔍 Summary: Step What you do 1 Take the log of each number 2 Take the mean of those logs 3 Take the exponent (antilog) of that mean → that’s the geometric mean Why not exponentiate each one? Because that would give you back the original numbers — and defeat the purpose of averaging the log values, which is where the “magic” of handling skewness happens.
28
Mode can be bimodal meaning two or multi modal. You can get more than one mode
In Biostatistics: Knowing if a distribution is unimodal, bimodal, or multimodal helps you understand if: • There are subgroups in your population (e.g., men vs women, diseased vs non-diseased) • There might be different underlying processes affecting the data Great question — let’s break down why mode is important in biostatistics, especially in a field that deals with real-life biological, clinical, and public health data. ✅ 1. Mode helps identify the most common or typical outcome • In clinical or survey data, sometimes you care not about the average, but about what’s most common. Example: If you collect data on blood types: A, A, B, O, A, AB, B, O, A • Mode = A → Most common blood type • Arithmetic mean is meaningless here. 📌 Key idea: When data is categorical, mode is the only measure of central tendency you can use. ✅ 2. Mode reveals subgroups or clusters (bimodal or multimodal distributions) • In continuous data, a bimodal distribution might mean your population isn’t homogenous. Example: Say you record cholesterol levels from a population: • You see two peaks (modes) — one around 170 mg/dL, one around 240 mg/dL. • This could indicate: • One group is healthy • One group has hypercholesterolemia 📌 Key idea: Multiple modes may reveal hidden patterns or distinct clinical subgroups. ✅ 3. Useful when mean and median are misleading (especially with skewed or multimodal data) • In skewed distributions or when outliers are present, the mean gets distorted. • The mode stays stable because it reflects frequency, not magnitude. Example: • In a rural area, most people might earn $200/month (mode), but a few rich people push the mean to $800/month. Mode shows you what’s typical, not what’s distorted. 📌 Key idea: Mode is robust when outliers or skew make mean/median misleading. ✅ 4. Helpful for analyzing survey and clinical questionnaire data In Likert scale data (e.g., Strongly Disagree → Strongly Agree), mode tells you: • What option most people chose. • Whether patients or participants generally lean toward positive or negative responses. 📌 Example: If 60% of respondents select “Agree”, mode = “Agree” → a quick summary of public opinion or patient satisfaction. ✅ 5. Used in pattern recognition and image analysis in biomedical research • In medical imaging, modes can represent the most common pixel intensity, helping with image segmentation or tissue identification. 📊 Summary: Why mode matters in biostat Reason Importance Works for categorical data Mean/median don’t apply to non-numbers Detects subgroups Bimodal/multimodal patterns show distinct populations Unaffected by outliers Unlike the mean Describes most frequent outcome Clinically meaningful (e.g. most common disease symptom) Useful in survey analysis Shows popular opinion or behavior Would you like a visual example of a bimodal distribution or a real-world biostats example using mode (e.g., in disease diagnosis, drug response, etc.)?
29
You’re absolutely on the right track — and I’ll help you sharpen that statement just a bit: ✅ Correct idea: For skewed data, the median is often preferred because it’s not affected by extreme values (outliers), unlike the mean. 🔧 Let’s clarify your sentence: ❌ “Observations cluster in the middle” — not always true for skewed data. In skewed data, most observations tend to cluster on one side (not necessarily in the middle), and a few extreme values pull the mean away. 🔁 Refined version: ✅ “In skewed data, most values cluster on one side, and outliers pull the mean, so the median is a better measure of central tendency.” 🔍 Example: Let’s say you collect monthly incomes from a small group: $200, $300, $250, $350, $400, $6000 • Mean = (200 + 300 + 250 + 350 + 400 + 6000) / 6 = $1166.67 • Median = middle of the ordered list = $325 👉 See how the mean is distorted by that $6000 outlier? ✅ The median ($325) gives a better sense of a “typical” person’s income. Summary: Statistic Behavior in Skewed Data Mean Pulled by outliers/extreme values Median Resistant to outliers; better for typical value Mode Shows the most frequent value; useful for categorical or multimodal data Would you like to compare how mean, median, and mode shift in right-skewed vs left-skewed distributions visually?
30
how to assess normality after log transformation
Great question — let’s walk through how to assess normality after log transformation, and why you’d do it. ✅ Why assess normality on the log scale? In biostatistics and many statistical tests (like t-tests, linear regression), normal distribution of data is an assumption. But real-world data is often skewed, especially things like: • Income • Viral load • Bacterial counts • Biomarker levels • Reaction times 👉 These are often positively skewed (long right tail). To fix this, we take the log of the data — which can make the distribution more symmetric (closer to normal). 🧪 How to assess normality on the log scale: 🔹 Step 1: Apply a log transformation Use natural log (ln() or log()) or log base 10 (log10()) to transform your variable. • In Stata: generate log_var = log(original_var) 🔹 Step 2: Check normality on the transformed variable (not the original) You can now check if the log-transformed variable is approximately normal. In Stata, you can do: • Histogram: histogram log_var, normal • Q-Q plot: qnorm log_var • Shapiro-Wilk test: swilk log_var • If p > 0.05, data is not significantly different from normal → you can assume normality. • If p < 0.05, data is still non-normal. 🎯 Summary: Step Action 1 Apply log() to the variable 2 Use histogram, qnorm, or swilk to assess normality of log-transformed data 3 If log-transformed data is normal → proceed with statistical tests that assume normality 💡 Bonus tip: If log transformation doesn’t help, you can try: • Square root transformation • Box-Cox transformation • Use nonparametric tests that don’t assume normality (e.g. Mann–Whitney U test instead of t-test) Would you like to try this with an example dataset, or want help interpreting a histogram or Q-Q plot you’ve already created?
31
When you “add the log,” you are doing a log transformation. But let’s break that down a bit more so it’s super clear: 🔁 What does “log transformation” mean? It means you replace each original value with its logarithm (base 10 or natural log). There can be a question that says find the appropriate mean and that’s different from one that says find the appropriate numerical summary measure When skewed data is transformed, it becomes normal.
Yes! You’re really getting it 👏 — and here’s a refined way to understand and say it: ✅ Correct idea: When data is skewed, using the arithmetic mean directly can be misleading, so you often transform the data (like with a log transformation) to reduce skew before summarizing it with the mean or applying tests that assume normality. 🔍 Why? • The arithmetic mean is sensitive to extreme values (outliers). • In skewed data, those outliers pull the mean away from the “typical” value. • So, you transform the data (e.g., with a log or square root) to: • Reduce skew • Stabilize variance • Make the data look more like a normal distribution • And then the mean becomes more representative and valid to use 🎯 Example: Suppose you’re measuring viral loads in patients: Patient viral loads (copies/mL): 500, 700, 850, 900, 1,200, 65,000 • Arithmetic mean: ≈ 11,858 (very misleading because of the outlier) • Log-transform the data: log(500), log(700), ..., log(65000) → Gives a more symmetric distribution → Now you can compute the mean of the log values → Then exponentiate that mean to get the geometric mean 🧠 Final Takeaway: ✅ Yes — when data is skewed, don’t jump to using the arithmetic mean. 👉 Transform it first, check if the distribution improves (e.g., becomes roughly normal), then apply the mean or statistical tests that require normality. Would you like a step-by-step example of this using real numbers or in a program like Stata or Excel?
32
What are standard deviation and variance as numerical summary measures(measures of dispersion) What is degree of freedom
Variance and Standard Deviation •The variance quantifies the amount of spread (or variability) of observations around their mean •For n observations, the variance is calculated by subtracting the mean of the observations from each observation, squaring the deviations, adding them (squared deviations) up and then dividing the result by (n-1). •The standard deviation is simply the square root of the variance. SD is used to measure how much deviation exist in the distribution. n-1 is the degree of freedom. Degrees of freedom: How many values you can pick freely before the rest are “stuck” following a rule. Degrees of freedom” means how many numbers or values you’re allowed to choose freely before a rule or formula forces the rest to follow You and your friends are picking numbers that must add up to a certain total. Everyone can choose freely — except the last person. They have to pick a number that makes the total work. That means only the first few people had freedom to choose. Degrees of freedom = Total number of values − Number of rules Degrees of freedom is the number of values in a calculation that are free to vary while still following a fixed rule (like a mean or total). ⸻ 💡 Easy Way to Say It: It tells you how many numbers you can pick however you like before the rest are stuck because of a formula (like the average). ⸻ 🧮 Example with Mean (Average): Let’s say you have 5 test scores, and you know the mean is 60. You can pick: • The 1st score = anything • The 2nd = anything • The 3rd = anything • The 4th = anything But the 5th score is not free — it has to make the average stay 60 or become 60. So only 4 scores were free to vary. So you could’ve had the first score as 2, second as 3 , third as 10, fourth as 2 but the fifth number can’t be any number cuz it must make sure it adds up to the rest to become 60. So the degree of freedom is how many numbers you can pick before the rest are stuck following a rule. 🔍 Why Not n - 2? You only subtract 2 when you’re estimating 2 parameters. • In sample standard deviation, you’re only estimating 1 parameter — the mean. • In simple linear regression, you estimate 2 parameters: the intercept (a) and slope (b) → so df = n - 2 • In more complex models, you subtract as many degrees as parameters you estimated. ⸻ ✅ So Can df Be More Than n - 1? No, it can’t. In fact, df will usually be: \text{df} = n - \text{number of estimated parameters} So: • In standard deviation: df = n - 1 • In regression (with 2 estimates): df = n - 2 • In ANOVA: df depends on groups and parameters estimated ➡️ That’s why: Degrees of freedom = 5 (scores) − 1 (mean rule) = 4
33
Define the following and give examples: 5. Chi-square 6. Regression 7. Correlation 8. Non-parametric methods.
Here’s an in-depth explanation of the following statistical concepts: Chi-square, Regression, Correlation, and Non-parametric methods — including definitions, assumptions, applications, why they matter, and clear examples. ⸻ 🧠 5. Chi-Square Test (χ²) 🔍 What It Is: The chi-square test is a statistical method used to determine whether there is a significant association between categorical variables or whether observed frequencies differ from expected ones. There are two common types: • Chi-square test of independence (association between two categorical variables) • Chi-square goodness-of-fit test (how well observed data fit a specific distribution) ⸻ 📌 Assumptions: • The data are counts or frequencies (not percentages or means). • Categories are mutually exclusive and exhaustive. • Expected frequency in each cell ≥ 5 (for validity). • Observations must be independent (one subject can’t appear in more than one cell). ⸻ 🎯 Applications: • Testing whether gender is related to voting preference. • Checking if a die is fair (goodness-of-fit). • Evaluating whether treatment success rates differ between clinics. ⸻ ❓Why It Matters: Chi-square tests are essential when dealing with categorical data, which is very common in fields like public health, psychology, biology, and market research. They help you assess if differences between groups are statistically significant or just due to random variation. ⸻ 📈 6. Regression Analysis 🔍 What It Is: Regression is used to model the relationship between one dependent variable (Y) and one or more independent variables (X). It helps you predict values and understand how changes in X affect Y. • Simple linear regression: 1 predictor (X), 1 outcome (Y) • Multiple linear regression: multiple predictors (X₁, X₂, …) The basic form: Y = a + bX + ε (where a is the intercept, b is the slope, and ε is the error) ⸻ 📌 Assumptions (Linear Regression): • Linearity: The relationship between X and Y is linear. • Independence of errors. • Homoscedasticity: Constant variance of errors. • Normality of residuals. • No multicollinearity (in multiple regression). ⸻ 🎯 Applications: • Predicting a person’s weight based on their height. • Estimating house prices based on number of rooms, location, etc. • Forecasting sales based on advertising spend. ⸻ ❓Why It Matters: Regression tells you how variables are related, helps with prediction, and is fundamental in many fields like economics, medicine, social sciences, and machine learning. It’s also the basis for causal inference when properly applied. ⸻ 🔗 7. Correlation 🔍 What It Is: Correlation measures the strength and direction of a linear relationship between two continuous variables. • Value ranges from -1 to +1: • +1: perfect positive relationship • -1: perfect negative relationship • 0: no linear relationship Common measure: Pearson’s correlation coefficient (r) ⸻ 📌 Assumptions (Pearson’s r): • Both variables are quantitative and normally distributed. • Linear relationship between variables. • No significant outliers. • Homoscedasticity (equal spread of data). ⸻ 🎯 Applications: • Relationship between hours studied and exam score. • Association between blood pressure and cholesterol. • Exploring risk factor relationships in epidemiology. ⸻ ❓Why It Matters: Correlation helps in exploratory data analysis by identifying whether variables are linked. It’s often the first step before deeper modeling (e.g., regression), and it guides researchers in spotting patterns or relationships worth investigating further. ✳️ Note: Correlation ≠ Causation! Just because two things move together doesn’t mean one causes the other. ⸻ 📊 8. Non-Parametric Methods 🔍 What They Are: Non-parametric tests are statistical tests that don’t assume a specific distribution (like normality) of the data. They’re used when data violates parametric assumptions or when dealing with ordinal or ranked data. Common types: • Mann–Whitney U test (independent samples) • Wilcoxon signed-rank test (paired samples) • Kruskal–Wallis test (multiple groups) • Spearman’s rank correlation (non-parametric version of Pearson’s) ⸻ 📌 Assumptions: • Do not require normal distribution. • Often work with medians and ranks rather than means. • Require independent observations (most tests). • For ordinal data or when data is skewed or has outliers. ⸻ 🎯 Applications: • Comparing pain levels between two treatments using a Likert scale. • Analyzing customer satisfaction rankings. • Evaluating median income between groups when data is skewed. ⸻ ❓Why They Matter: Non-parametric methods are robust and flexible, making them ideal for: • Small sample sizes • Ordinal/ranked data • Data that violates normality They allow you to still draw valid conclusions when parametric methods aren’t appropriate. ⸻ 🧠 SUMMARY TABLE Concept Key Use Assumptions Why It’s Important Chi-Square Test association between categories Categorical data, expected freq ≥ 5 Tests for independence or goodness-of-fit Regression Predict or explain outcome (Y) Linearity, homoscedasticity, normal errors Quantifies relationships and predicts Correlation Measure strength of association Linearity, no outliers Detects linear patterns in data Non-parametric Compare medians, ranks, or distributions Fewer/no assumptions about distribution Robust alternative to parametric tests ⸻ Would you like a set of flashcards or diagram illustrations to go with these notes? Excellent questions — let’s break each one down clearly, with public health examples to help you understand and apply them better. ⸻ ✅ 1. What Is the Chi-Square Goodness-of-Fit Test? 🔍 Simple Meaning: It answers the question: “Does my observed data match what I expected?” Imagine you expect something to follow a certain pattern or distribution. The goodness-of-fit test checks whether the data you actually observe fits that expected pattern or deviates significantly. ⸻ 📊 Public Health Example: Suppose a health organization expects 25% of the population to be in each of 4 BMI categories: BMI Category Expected % Underweight 25% Normal weight 25% Overweight 25% Obese 25% You collect actual data from 400 people and see: BMI Category Observed Count Underweight 80 Normal weight 100 Overweight 90 Obese 130 The chi-square goodness-of-fit test checks: Are these differences due to chance, or is the actual distribution truly different from what we expected? ⸻ ✅ 2. What Is Homoscedasticity? 🔍 Simple Meaning: It means that the spread (variance) of the errors is equal across all levels of your independent variable in regression. Imagine predicting blood pressure based on age. Homoscedasticity means that for all age groups, the spread of blood pressure values around the regression line should be roughly the same. ⸻ ❌ When It Fails (Heteroscedasticity): If the spread increases or decreases as age increases, that’s heteroscedasticity — and it can make your regression results unreliable. ⸻ 📊 Public Health Example: In a study of income vs. medical expenses, lower-income people might have very consistent expenses (low variability), but higher-income individuals might have wildly different spending habits (high variability). That’s heteroscedasticity, and it can bias your conclusions. ⸻ ✅ 3. What Is Multicollinearity? 🔍 Simple Meaning: In multiple regression, multicollinearity happens when your independent variables are highly correlated with each other, which confuses the model. It’s like asking: “Which of these variables is driving the effect?” But you can’t tell because they’re too similar. ⸻ 📊 Public Health Example: You’re predicting heart disease risk using: • Blood pressure • Age • Cholesterol • Weight • BMI But BMI and weight are strongly related — they both contain the same information. This overlaps and makes it hard to know which one is actually affecting the outcome. ⸻ ❗Why It’s Bad: Multicollinearity makes: • Your regression coefficients unstable • Interpretation unclear • Model’s predictive power weaker ⸻ ✅ 4. Spearman’s Rank Correlation 🔍 Simple Meaning: It’s like Pearson’s correlation, but for ranked or non-normal data. Use it when: • Data is ordinal (e.g., Likert scale: 1–5) • Variables aren’t normally distributed • Relationship is monotonic (as one increases, the other tends to increase or decrease, not necessarily in a straight line) ⸻ 💡 How It Works: You rank each value in both variables, then check how closely their ranks agree. ⸻ 📊 Public Health Example: Say you’re studying the relationship between: • Perceived stress levels (rated 1–5) • Sleep quality (rated 1–5) Since both are ordinal, use Spearman’s rank. You might find: Higher stress → lower sleep quality, but not in a perfectly linear way. Spearman’s rank still captures that trend. ⸻ 🧠 Now, Public Health Examples for All 4 Concepts: Concept Public Health Example Chi-square (goodness-of-fit) Is the distribution of vaccination types across regions what we expect? Regression Predicting number of ER visits based on age, income, and smoking status. Correlation (Pearson/Spearman) Relationship between exercise frequency and BMI (Pearson) OR stress level and mood (Spearman). Non-parametric tests Comparing median waiting times in clinics with small sample sizes or ranked satisfaction scores. ⸻ Would you like me to turn all this into: • Flashcards? • Printable revision notes? • Or visual diagrams (like mind maps or concept maps)?
34
What are the desirable properties of a confidence interval? How do sample size and the level of confidence (e.g., 90%, 95%, 99%) affect the width of a confidence interval?
A confidence interval is a statistical tool used to estimate the range in which a population parameter (like a mean) is likely to fall. Unlike a point estimate, the confidence interval which is another type of estimate, gives a range that includes possible values for the true population parameter and reflects the uncertainty in the estimate. For example, if the confidence interval is from 45 to 55, this range is calculated using a confidence level, such as 95%, and this tells us how sure we are that the true value lies within that range. In this case, a 95% confidence level means that if the sampling was repeated many times, about 95 out of 100 of those intervals would contain the true value. Confidence level is the percentage but confidence interval is the range of values you’re sure the population parameter will be found in. The desirable properties of a confidence interval are: 1.It gives the correct level of confidence. A 95% confidence level means that if many samples were taken and the interval was calculated each time, about 95 out of 100 intervals would contain the true population mean. This property is called correct coverage. 2.It should be precise (not too wide). A narrower interval gives a more exact idea of where the true value lies. For example, an interval from 5.0 to 7.0 is more helpful than one from 4.0 to 8.0, as it narrows down the possibilities. However, narrower intervals are only useful if they give a reasonable level of confidence. 3.It shows the balance between precision and confidence. If we want more confidence (like 99%), we need a wider interval to be more certain the true value is included. On the other hand, if we accept a lower confidence level (like 90%), we get a narrower interval, but we’re less sure it contains the true value. So, higher confidence = wider interval, and lower confidence = narrower interval. 4.It is based on correct statistical assumptions. If we know the population standard deviation, we use the normal distribution. If we don’t know it, we use the t-distribution, which gives slightly wider intervals to account for extra uncertainty. Using the correct formula helps make the interval more reliable. How Sample Size Affects the Width of the Confidence Interval: The sample size (n) has a big effect on the width of a confidence interval. A larger sample size makes the standard error smaller, which leads to a narrower confidence interval. This means we get a more precise estimate of the true population value or it means the estimate is more accurate as compared to using a smaller sample size. Since the estimate will be more precise, it will reduce uncertainty or there will be less guessing. For example: • With a sample size of 25, an interval might be [4.00, 7.92] . The width of the confidence level is 3.92 • Increasing the sample size to 100, the interval becomes [4.98, 6.94] . The width of the confidence level is 1.96 So, when the sample size increases by a factor of 4, the interval becomes about half as wide. More data gives us more accuracy.Yes, that statement is approximately true under certain conditions. Here’s a simple explanation: ⸻ ✅ Statement: “If n increases by a factor of 4, the width of the confidence interval is reduced by a factor of 2.” ✅ Why it works: The width of a confidence interval is largely influenced by the standard error, which is calculated as: \text{Standard Error (SE)} = \frac{\sigma}{\sqrt{n}} So, if your sample size (n) increases 4 times, then: \sqrt{n} \text{ increases by a factor of } \sqrt{4} = 2 This means the standard error is cut in half, and since the confidence interval is based on the standard error, the width of the interval also decreases by about half. ⸻ ✅ In summary: Yes — if the sample size increases by a factor of 4, the width of the confidence interval is roughly reduced by half (assuming everything else stays the same, like variability and confidence level). How the Confidence Level Affects the Width of the Confidence Interval: The confidence level also changes the width of the interval. A higher confidence level (like 99%) gives a wider width of the interval because we want to be more certain the true value is included. A lower confidence level (like 92%) gives a narrower width of the interval, but we are less sure or confident that it includes the true value. A wider width of the confidence interval cannot be more precise because precision means how narrow and specific an estimate is. A wider width will cover more possible values which means there will be less precision. A narrow width of the confidence interval is more precise because it gives you a smaller range For example: •A 99% interval might be [3.38, 8.54] and this means the interval is wide but very confident •A 90% interval might be [4.31, 7.50] and this means the interval is narrow but less confident.
35
Absolutely! Here are flashcard-style questions (front side only for now) based on what you’ve been learning — especially about independent vs mutually exclusive events, and basic probability concepts. ⸻ 🎓 PROBABILITY FLASHCARDS (Questions First): 1. What does it mean when two events are independent? 2. What is the probability formula for independent events occurring together? 3. What does it mean when two events are mutually exclusive? 4. What is the probability of two mutually exclusive events both happening? 5. What is the probability formula for the union of mutually exclusive events? 6. Can mutually exclusive events also be independent? 7. What is the probability formula for dependent events? 8. If Event A has no effect on Event B, what type of relationship is that? 9. If P(A ∩ B) = P(A) × P(B), what type of events are A and B? 10. If P(A ∩ B) = 0, what type of events are A and B? 11. Give a real-life example of independent events. 12. Give a real-life example of mutually exclusive events. 13. What does “intersection” of two events mean in probability? 14. What does “union” of two events mean in probability? 15. What does it mean when P(A ∩ B) ≠ P(A) × P(B)?
Perfect! Here are the answers (back of the flashcards) for the first 15 questions you received earlier — all related to independent events, mutually exclusive events, and basic probability concepts: ⸻ 🎓 PROBABILITY FLASHCARDS (ANSWERS) 1. 🔹 Two events are independent if the outcome of one does not affect the outcome of the other. 📌 Example: Tossing a coin and rolling a die. ⸻ 2. 🔹 For independent events A and B, the probability they both occur is: P(A ∩ B) = P(A) × P(B) ⸻ 3. 🔹 Two events are mutually exclusive if they cannot happen at the same time. 📌 Example: Getting heads and tails on the same coin toss. ⸻ 4. 🔹 If two events are mutually exclusive, the probability they both occur is: P(A ∩ B) = 0 ⸻ 5. 🔹 For mutually exclusive events A and B, the probability that A or B occurs is: P(A ∪ B) = P(A) + P(B) ⸻ 6. 🔹 No — mutually exclusive events cannot be independent, because if one happens, the other definitely can’t. ⸻ 7. 🔹 For dependent events A and B, the probability they both happen is: P(A ∩ B) = P(A) × P(B|A) ⸻ 8. 🔹 If Event A doesn’t affect Event B, they are independent. ⸻ 9. 🔹 If P(A ∩ B) = P(A) × P(B), then A and B are independent. ⸻ 10. 🔹 If P(A ∩ B) = 0, A and B are mutually exclusive. ⸻ 11. 🔹 Example of independent events: Tossing a coin and getting a 6 on a die — one does not affect the other. ⸻ 12. 🔹 Example of mutually exclusive events: Rolling a die and getting either a 3 or a 5 — you can’t get both on one roll. ⸻ 13. 🔹 The intersection (A ∩ B) means both A and B happen at the same time. ⸻ 14. 🔹 The union (A ∪ B) means either A happens, or B happens, or both. ⸻ 15. 🔹 If P(A ∩ B) ≠ P(A) × P(B), A and B are not independent — they are dependent. ⸻ Let me know when you’re ready for the answers to the second set (questions 16–35)! Yes — you’ve got it exactly right! 👏👏 Let’s break it down clearly and confirm your understanding: ⸻ ✅ What You Said: Independent events mean two events can occur at the same time, but one doesn’t influence the other. Mutually exclusive events can’t occur at the same time — if one happens, the other definitely does not. ✅ This is 100% correct. ⸻ 🔁 Let’s use some examples to seal it in: 🔹 Independent Events: • A = It rains today. • B = You pass your statistics test. Even if both happen on the same day, the rain doesn’t affect your score — they are independent. So: P(A \cap B) = P(A) \cdot P(B) ⸻ 🔹 Mutually Exclusive Events: • A = You get a distinction in your exam. • B = You fail the same exam. You can’t get a distinction and fail the same exam. If one happens, the other definitely cannot. So: P(A \cap B) = 0 \quad \text{and} \quad P(A \cup B) = P(A) + P(B) ⸻ 💡 Key Insight: Concept Can happen together? Affect each other? Independent Events ✅ Yes ❌ No Mutually Exclusive ❌ No ➖ (One prevents the other) ⸻ If you like, I can give you a short quiz or scenario-based problems to test this understanding — or we can move on to compound probability rules or diagnostic tests. Let me know!
36
Awesome! Here are additional flashcard-style questions to cover the rest of the key probability topics — perfect for building a solid foundation. ⸻ 🧠 FLASHCARDS — NEW TOPICS: 🔹 Conditional Probability 16. What is conditional probability? 17. What is the formula for conditional probability P(A|B)? 18. What does it mean if P(A|B) = P(A)? 19. How does conditional probability differ from regular probability? ⸻ 🔹 Complement Rule 20. What is the probability of the complement of an event A? 21. What is the formula for the complement rule? 22. If P(A) = 0.7, what is P(not A)? 23. Why is the complement rule useful? ⸻ 🔹 Compound Events 24. What is a compound event in probability? 25. What is the formula for “at least one” events? 26. How do you calculate the probability of “at least one” success in repeated trials? ⸻ 🔹 Law of Total Probability (Intro-level) 27. What does the law of total probability help you find? 28. What type of events does the law of total probability work with? 29. When would you use the law of total probability in real life? ⸻ 🔹 Bayes’ Theorem (Intro-level) 30. What does Bayes’ Theorem allow you to calculate? 31. What is the basic idea behind Bayes’ Theorem? 32. What real-life situations use Bayes’ Theorem? ⸻ 🔹 Review & Application 33. What’s the difference between dependent and conditional probability? 34. How do you decide which formula to use in a probability problem? 35. Why is it important to define the sample space first? ⸻ Would you like: • Just the answers to these now (back of the flashcards)? • Or the complete set with all answers (original + new)? • Or do you want to try answering them first?
Awesome! Here are the answers (back of the flashcards) for questions 16–35, continuing from your earlier probability flashcard set: ⸻ 🎓 PROBABILITY FLASHCARDS (ANSWERS 16–35) 16. 🔹 If P(A ∪ B) = P(A) + P(B) – P(A ∩ B), the formula works for any events — independent, dependent, or overlapping. ⸻ 17. 🔹 When events are mutually exclusive, P(A ∩ B) = 0, so: P(A ∪ B) = P(A) + P(B) ⸻ 18. 🔹 Conditional probability is the probability that event B happens given that A has already happened. Notation: P(B|A) ⸻ 19. 🔹 Formula: P(B|A) = P(A ∩ B) / P(A) (As long as P(A) ≠ 0) ⸻ 20. 🔹 If events A and B are independent, then: P(B|A) = P(B) because A happening doesn’t affect B. ⸻ 21. 🔹 If two events are mutually exclusive, then knowing A happened tells you B definitely didn’t, so: P(B|A) = 0 ⸻ 22. 🔹 In a Venn diagram, mutually exclusive events have no overlapping region. ⸻ 23. 🔹 On a Venn diagram, independent events may overlap, but that overlap doesn’t change the individual probabilities. ⸻ 24. 🔹 Additive rule: Used for the probability that A or B happens: P(A ∪ B) = P(A) + P(B) – P(A ∩ B) ⸻ 25. 🔹 Multiplicative rule: Used for the probability that A and B happen: P(A ∩ B) = P(A) × P(B|A) (Or just P(A) × P(B) if independent) ⸻ 26. 🔹 Use additive rule when you want the probability of either event happening. ⸻ 27. 🔹 Use multiplicative rule when you want the probability of both events happening. ⸻ 28. 🔹 Independent but not mutually exclusive: Rolling a die: Event A = even number, Event B = number > 3. They can overlap, but knowing one doesn’t change the chance of the other. ⸻ 29. 🔹 Mutually exclusive but not independent: Event A = draw a red card; Event B = draw a black card. You can’t get both — if A happens, B can’t, so not independent. ⸻ 30. 🔹 To test if events are independent, check if: P(A ∩ B) = P(A) × P(B) If not, they’re dependent. ⸻ 31. 🔹 To test if events are mutually exclusive, check if: P(A ∩ B) = 0 ⸻ 32. 🔹 If A and B can occur at the same time, they are not mutually exclusive. ⸻ 33. 🔹 If the occurrence of A affects B, then A and B are not independent. ⸻ 34. 🔹 If P(B|A) = P(B), then A and B are independent. ⸻ 35. 🔹 If two events can’t both happen but they don’t affect each other, this is impossible — because mutually exclusive events are always dependent. ⸻ Would you like a printable version or a flashcard app format (like Anki/Quizlet) version next? Or want to test yourself with a quiz on these?
37
Got it! Here’s a clear and structured summary of notes based on the text you provided, organized into key topics for easy understanding and reference: ⸻ 🧠 PROBABILITY DISTRIBUTIONS – KEY CONCEPTS 📌 1. Probability Distributions • Describe how probabilities are distributed across the values a random variable can take. • Help us make inferences about population parameters using tools like: • Confidence intervals • Hypothesis testing ⸻ 🎲 2. Random Variables • A random variable (X) is a numerical outcome of a random process. • The actual value it takes in an experiment is denoted as x (lowercase). • An event like [X = x] means the variable takes a specific value and has a certain probability. ⸻ 🔢 3. Types of Random Variables A. Discrete Random Variables • Take on separate, distinct values (e.g., 0, 1, 2…). • Can be: • Finite: e.g., Number of heads in 5 coin tosses. • Infinite: e.g., Poisson distribution (counts events like lightbulb failures). • Have positive probabilities for specific values. • Use Probability Mass Functions (PMFs). B. Absolutely Continuous Random Variables • Take on values over a continuous range (e.g., any number between 0 and 1). • Examples: • Uniform distribution: All values equally likely in an interval like [0,1]. • Normal distribution: Bell-shaped, spans all real numbers. • Use Probability Density Functions (PDFs). • P(X = x) = 0 for any single value — we only get meaningful probabilities over intervals, like P(0 < X < 1). ⸻ 🧮 4. Common Distributions A. Discrete Examples • Binomial distribution: • Counts number of successes in n trials. • Symmetric only when p = 0.5 (like a fair coin). • Poisson distribution: • Models rare events happening at a known average rate. • Examples: Number of emails per hour, lightbulb failures. B. Continuous Examples • Uniform distribution: • All values in the interval are equally likely. • Normal (Gaussian) distribution: • Bell-shaped curve, symmetric around the mean. • Others: t-distribution, Cauchy, Beta (some types), all can be symmetric. ⸻ 🪞 5. Symmetry & Skewness • Symmetric distribution: • Has a mirror image shape on both sides of a central line (like the normal curve). • Skewed distribution: • Positively skewed: Long tail on the right. • Negatively skewed: Long tail on the left. ⸻ 📌 6. Why All This Matters • Understanding types and characteristics of distributions helps in: • Choosing the right statistical test. • Calculating probabilities. • Making informed predictions and inferences about populations. ⸻ Would you like me to turn this into printable flashcards or add examples for each concept? Absolutely! Here’s a clearer and more relatable explanation of why probability distributions, random variables, and the types of distributions really matter, especially for someone learning or working in public health, medicine, or data analysis: ⸻ 🌟 Why These Concepts Matter in Real Life and Public Health ⸻ 🎲 1. Probability Distributions Help Us Understand Uncertainty • In public health, outcomes are often uncertain — e.g., “How many people will get infected this month?” or “What’s the chance someone responds to a vaccine?” • A probability distribution helps us model that uncertainty. • Example: If we say the number of daily COVID cases follows a Poisson distribution, we can estimate how likely it is to get more than 100 cases tomorrow — and prepare resources accordingly. ⸻ 🎯 2. Random Variables Let Us Model Real-World Events Mathematically • Think of a random variable like a “slot” that holds the possible outcomes of a random event. • Example: X = number of malaria cases in a community. • Once we treat it as a variable, we can: • Calculate probabilities • Summarize the data using measures like mean and variance • Make predictions ⸻ 🔢 3. Discrete vs. Continuous Variables Help Us Choose the Right Tools • Knowing what type of variable you have (discrete or continuous) tells you: • Which formulas to use • What kind of graph to draw • Which statistical test is valid Examples: • If you’re counting number of new HIV cases per month → use discrete models like binomial or Poisson. • If you’re measuring body temperature or blood pressure → use continuous models like the normal distribution. ⸻ 📊 4. Normal Distribution = Foundation for Many Statistical Tests • The normal distribution is key because: • It’s used in t-tests, Z-tests, ANOVA, etc. • Many biological measurements (height, blood pressure, BMI) naturally follow a bell-shaped (normal) curve. • If data is normal, we can make powerful assumptions and predict behavior accurately using just the mean and standard deviation. ⸻ 🚨 5. Skewness Affects How We Interpret Data • If a distribution is skewed, using the mean can be misleading. • Example: Income in a population is often right-skewed — a few people earn a lot more than the rest. • The mean may be high, but the median better reflects what most people earn. • In public health, understanding skewness helps you avoid bad policy decisions. ⸻ ✅ 6. Choosing the Right Distribution Improves Decision-Making • For data on rare events (e.g., sudden disease outbreaks), using a Poisson distribution helps anticipate future occurrences. • For yes/no outcomes (e.g., did the patient survive?), the binomial distribution is your go-to. • If you’re not sure which model fits, you might need non-parametric methods — less powerful, but more flexible. ⸻ 🧪 7. They’re the Foundation for Confidence Intervals & Hypothesis Tests • Everything from drug effectiveness trials to disease surveillance relies on the mathematical rules these distributions follow. • They let us ask: • Is what we’re seeing just chance, or is it statistically significant? • How sure are we about the true rate or risk in the population? ⸻ 🧠 Summary: Why It All Matters Concept Why It Matters Random variable Converts real-world outcomes into numbers we can analyze Discrete/Continuous Helps you choose the correct formula or test Probability distribution Describes the likelihood of different outcomes Normal distribution Key for many statistical tests and predictions Skewness Tells you if the mean is a good summary or not Matching real-world examples to distributions Helps you model reality accurately and make better decisions ⸻ Would you like this in a printable format or as flashcards too?
38
Absolutely! Here’s a flashcard set with only the questions based on the concepts in the passage: ⸻ 📊 PROBABILITY & DISTRIBUTIONS — FLASHCARDS (Questions Only) 1. What are probability distributions used for in statistics? 2. What are parameters in the context of probability distributions? 3. What is the difference between a statistic and a parameter? 4. What is a random variable? 5. How are random variables usually denoted in probability theory? 6. What is a discrete random variable? 7. Give an example of an infinite discrete distribution. 8. What type of events does the Poisson distribution model? 9. What is a binomial distribution? 10. What values can a binomial distribution take? 11. What is a continuous random variable? 12. What is the difference between a probability mass function (PMF) and a probability density function (PDF)? 13. Why is P(X = x) = 0 for continuous variables? 14. What is a uniform distribution? 15. What is a normal distribution also called? 16. What is symmetry in a probability distribution? 17. Which common distributions are symmetric? 18. When is a binomial distribution symmetric? 19. What is a positively skewed distribution? 20. What is a negatively skewed distribution? ⸻ Would you like these in a downloadable flashcard format (like PDF or Anki)? Or shall I now give you the question-only set for Poisson, binomial, normal, uniform, and t-distributions?
Thanks for sharing this detailed explanation! Let’s break it down into clear flashcards with questions first, based directly on the concepts from the passage you sent. These flashcards will help reinforce your understanding of probability distributions, random variables, discrete vs continuous distributions, Poisson, binomial, normal distributions, and symmetry/skewness. ⸻ 🔢 PROBABILITY & DISTRIBUTIONS — FLASHCARDS 1. Q: What are probability distributions used for in statistics? A: They describe the probability of different outcomes of a random variable and are used to make inferences about population parameters. ⸻ 2. Q: What are parameters in the context of probability distributions? A: Parameters are characteristics of probability distributions, such as the mean or standard deviation. ⸻ 3. Q: What is the difference between a statistic and a parameter? A: A statistic is a measure calculated from a sample (e.g. sample mean), while a parameter is a fixed value that describes a population (e.g. population mean). ⸻ 4. Q: What is a random variable? A: A random variable is a variable that takes on numerical values determined by the outcome of a random phenomenon. ⸻ 5. Q: How are random variables usually denoted in probability theory? A: With capital letters (e.g. X), while specific values they take are written in lowercase (e.g. x). ⸻ 6. Q: What is a discrete random variable? A: A variable that takes on countable distinct values, such as 0, 1, 2, etc. ⸻ 7. Q: Give an example of an infinite discrete distribution. A: The Poisson distribution, which assigns positive probabilities to all non-negative integers. ⸻ 8. Q: What type of events does the Poisson distribution model? A: Rare or infrequent events, such as the number of lightbulb failures over time. ⸻ 9. Q: What is a binomial distribution? A: A discrete distribution that models the number of successes in n independent trials with the same probability of success p. ⸻ 10. Q: What values can a binomial distribution take? A: 0, 1, 2, …, up to n (total number of trials). ⸻ 11. Q: What is a continuous random variable? A: A variable that can take on any value within a given range, such as all real numbers between 0 and 1. ⸻ 12. Q: What is the difference between a probability mass function (PMF) and a density function (PDF)? A: PMFs are for discrete random variables (assign probabilities to exact values), while PDFs are for continuous variables (assign probabilities over intervals, not exact values). ⸻ 13. Q: Why is P(X = x) = 0 for continuous variables? A: Because the probability at any exact point on a continuous scale is zero; we consider the probability over intervals, like P(0 < X < 1). ⸻ 14. Q: What is a uniform distribution? A: A distribution where all outcomes in a range (e.g., 0 to 1) are equally likely. ⸻ 15. Q: What is a normal distribution also called? A: The Gaussian distribution. ⸻ 16. Q: What is symmetry in a probability distribution? A: A distribution is symmetric if both sides around a central value (like the mean) are mirror images. ⸻ 17. Q: Which common distributions are symmetric? A: Normal, Student’s t, Cauchy, uniform, and certain beta distributions. ⸻ 18. Q: When is a binomial distribution symmetric? A: Only when the probability of success p = 0.5 (like a fair coin toss). ⸻ 19. Q: What is a positively skewed distribution? A: Most values are concentrated on the left, with a long tail stretching to the right. ⸻ 20. Q: What is a negatively skewed distribution? A: Most values are concentrated on the right, with a long tail stretching to the left. ⸻ Would you like a second flashcard set just for Poisson, binomial, normal, uniform, and t-distributions (with formulas and conditions for use)?
39
Great! Here are the answers to the flashcard questions on skewness and distribution behavior: ⸻ 📉 SKEWNESS & DISTRIBUTIONS — FLASHCARDS (Answers) 1. Which two distributions are common examples of positively skewed distributions? → The negative exponential distribution and the chi-square distribution. ⸻ 2. How does the value of p affect the skewness of a binomial distribution? → The probability of success (p) determines whether the binomial distribution is symmetric, positively skewed, or negatively skewed. ⸻ 3. When is a binomial distribution positively skewed? → When p < 1/2 (less than 0.5). ⸻ 4. When is a binomial distribution symmetric? → When p = 1/2 (exactly 0.5). ⸻ 5. When is a binomial distribution negatively skewed? → When p > 1/2 (greater than 0.5). ⸻ 6. Can beta distributions be symmetric, positively skewed, or negatively skewed? What does this depend on? → Yes. Beta distributions can take any of these shapes depending on the values of their two shape parameters (α and β). ⸻ 7. What determines the shape of a beta distribution? → The values of its parameters α (alpha) and β (beta). ⸻ 8. Why is it important to know if a distribution is symmetric or skewed? → Because the shape affects how we interpret data, calculate probabilities, and choose the right statistical tests or models.
Thanks! Based on this passage, here are flashcard questions only to help reinforce the concepts of skewness and distribution behavior: ⸻ 📉 SKEWNESS & DISTRIBUTIONS — FLASHCARDS (Questions Only) 1. Which two distributions are common examples of positively skewed distributions? 2. How does the value of p affect the skewness of a binomial distribution? 3. When is a binomial distribution positively skewed? 4. When is a binomial distribution symmetric? 5. When is a binomial distribution negatively skewed? 6. Can beta distributions be symmetric, positively skewed, or negatively skewed? What does this depend on? 7. What determines the shape of a beta distribution? 8. Why is it important to know if a distribution is symmetric or skewed? ⸻ Would you like the answers to these now, or shall I continue with another flashcard set (e.g., on distribution formulas or key properties)?
40
Estimation and Hypothesis Testing
Estimation and Hypothesis Testing a. Estimation Definition: Using sample data to estimate a population value (like a mean or proportion). Types: • Point estimate (single value) • Interval estimate (range, like a confidence interval) Assumptions: Random sampling, reliable measurement. Limitations: Point estimates can be misleading without confidence intervals. Advantages: Gives an idea of population characteristics. Public Health Example: From a sample, you estimate 30–40% of children are underweight (95% confidence). Confidence interval is a interval estimation and used in hypothesis testing b. Hypothesis Testing Definition: A statistical method to test an assumption about a population. Terms: • Null hypothesis (H₀): no effect • Alternative hypothesis (H₁): there is an effect When used: To test new treatments, policies, trends. Assumptions: Depends on test—normality, independence, equal variances, etc. When not to use: With biased data, small samples, or when assumptions are violated. Limitations: Misinterpretation of p-values; false positives/negatives. Advantages: Basis of scientific decision-making.
41
Great! Here’s a detailed yet easy-to-understand breakdown of the topics you listed, including definitions, when to apply them, assumptions, limitations, advantages, and real-world public health examples: ⸻ 1. Parametric and Non-parametric Methods Category Parametric Methods Non-parametric Methods Definition Statistical tests that assume data comes from a population with a specific distribution (usually normal). Tests that don’t assume a specific distribution. When to Use - Large sample size- Data is normally distributed- Interval or ratio scale - Small samples- Data is not normally distributed- Ordinal or ranked data Examples t-test, ANOVA, Pearson correlation, Linear regression Mann-Whitney U, Wilcoxon, Kruskal-Wallis, Spearman correlation Assumptions - Normality- Equal variances- Independent observations - Fewer assumptions- Often based on medians/ranks Limitations Less flexible with skewed or small data Less powerful (needs larger effect to detect significance) Advantages More powerful if assumptions are met Flexible, works with ordinal or non-normal data Public Health Example Compare mean blood pressure between male and female adults using a t-test. Compare pain scores (ordinal) between two treatment groups using Mann-Whitney U test. ⸻ 2. Methods Used in Inferential Statistics and When to Apply Them Method Purpose When to Use Public Health Example Confidence Intervals Estimate a range for a population value Estimating means, proportions 95% CI for HIV prevalence in women Hypothesis Testing (t-test, ANOVA) Test a claim about a population Comparing means Does a new diet reduce cholesterol? Chi-square Test Association between categorical variables Comparing observed vs expected frequencies Handwashing vs diarrhea presence Regression (linear/logistic) Predict outcomes, adjust for factors Predicting or modeling outcomes Predict diabetes based on BMI, age Non-parametric tests When assumptions for parametric are not met Skewed or ordinal data Pain scores between two groups Correlation (Pearson/Spearman) Assess strength/direction of relationships Continuous (Pearson), ordinal (Spearman) Sugar intake vs BP levels ⸻ 3. Methods Used in Descriptive Statistics and When to Apply Them Method What it Describes When to Use Public Health Example Measures of Central Tendency Mean, median, mode Summarizing the center of data Mean age of tuberculosis patients Measures of Spread Range, variance, SD, IQR Show how data is spread SD of hemoglobin levels Frequency Tables Number of occurrences in categories For categorical data Frequency of malaria by region Graphs & Charts Pie, bar, histogram, boxplot Visualize data Pie chart of vaccination coverage Percentiles & Quartiles Distribution position Assessing health status Children in lowest height percentile ⸻ 4. Elementary Probability and Probability Distributions (Normal Distribution) Concept Definition When to Use Assumptions Example Probability Likelihood of an event (0–1) Predict risk, plan interventions Defined outcomes Chance a smoker develops lung cancer Probability Distribution Describes how values of a variable are distributed Modeling outcomes Depends on type: normal, binomial, etc. Height or weight distribution Normal Distribution Symmetric, bell-shaped curve For many natural, biological variables Continuous data, symmetrical Birth weights in a large sample Binomial Distribution For yes/no outcomes over trials Modeling events like infection (yes/no) Fixed # of trials, binary outcomes Probability of 3+ kids being anemic out of 5 ⸻ 5. Estimation and Testing of Hypotheses Component Definition When to Use Assumptions Example Point Estimate Single value estimate of population parameter Estimating mean/proportion Random sample 38% of women are anemic Confidence Interval Range that likely contains population parameter To reflect uncertainty in estimate Random, representative sample 95% CI for average hemoglobin level Null Hypothesis (H₀) Assumes no difference/effect Used in all hypothesis testing — “There’s no difference in drug A vs B” Alternative Hypothesis (H₁) What we suspect is true — — “Drug A is more effective” p-value Probability of observing result if H₀ is true Determines significance Valid test assumptions p < 0.05 suggests rejecting H₀ Type I Error (α) False positive p too small when H₀ true — Conclude vaccine works when it doesn’t Type II Error (β) False negative p too large when H₁ true — Miss real vaccine effect ⸻ 6. Chi-square, Regression, and Non-parametric Methods Method Definition When to Use Assumptions Limitations Advantages Public Health Example Chi-square Test Compares observed vs expected frequencies Categorical data (e.g. yes/no) Independent observations, expected ≥5 Doesn’t show strength/direction Easy to use for categorical data Link between education and contraceptive use Regression Models relationship between variables Predicting outcomes, adjusting for confounders Linearity, independence, normality (for linear) Outliers affect results Adjusts for many variables Predict HIV risk based on age, behavior Non-parametric Tests Compare medians/ranks without assuming normality Skewed, ordinal, or small data sets Fewer assumptions Less statistical power Robust and flexible Compare patient satisfaction scores (ordinal) across clinics ⸻ ✅ Summary Table Topic When to Use Assumptions Advantages Limitations Example Parametric Normal, continuous data Normality, equal variances Powerful, accurate Not suitable for skewed data Compare average BMI Non-parametric Skewed/ordinal/small data Minimal Flexible Less powerful Pain scores Inferential Stats To generalize from samples Random sample Informs decisions Can mislead if sample is biased Estimate anemia rate Descriptive Stats To summarize data None Quick overview No predictions Median age Probability To model risk Defined outcomes Helps in risk estimation Misunderstood easily Chance of infection Estimation & Hypothesis For inference/testing Sample reliability Generalizes findings Risk of false results Vaccine effect test Chi-square Association between categories Expected counts ≥5 Easy to use No strength measure Sanitation vs diarrhea Regression Prediction, control for variables Depends on type Predictive power Affected by outliers Predict diabetes risk Non-parametric Skewed or ranked data Few assumptions Works with most data Less sensitive Satisfaction rankings ⸻ Would you like this in a visual summary PDF or editable study sheet (like Google Docs or Word)?
42
Here’s a detailed deep dive into Non-Parametric Tests — including definitions, types, when and when not to use them, assumptions, advantages and limitations, plus public health examples that are easy to relate to. ⸻ ✅ What Are Non-Parametric Tests? Definition: Non-parametric tests are statistical methods that do not require assumptions about the distribution of the data, such as normality. They are based on ranks, medians, or signs, not means and standard deviations. ⸻ 🧠 When to Use Non-Parametric Tests Use them when: • Data is ordinal, ranked, or categorical • The sample size is small • Data is skewed or not normally distributed • You can’t meet assumptions of parametric tests (e.g., equal variances, normality) • There are outliers that distort means ⸻ 🚫 When Not to Use Them Avoid them when: • Your data is normally distributed and meets parametric assumptions (parametric tests are more powerful) • You need precise measures of effect (since non-parametric tests work on ranks, not actual values) • You’re analyzing interval or ratio data with strong statistical assumptions met ⸻ 🧾 Assumptions of Non-Parametric Tests Though minimal, still consider: • Independence of observations • Random sampling • Data should at least be ordinal level (can be ranked) • For paired tests: observations are matched pairs or repeated measures ⸻ ✅ Advantages • Can be used with small sample sizes • Robust to outliers and non-normal data • Ideal for ordinal or ranked data • Fewer assumptions than parametric tests • Easy to interpret in many cases ⸻ ❌ Limitations • Less powerful than parametric tests (need larger effect to detect significance) • Don’t provide estimates like mean differences, confidence intervals • Harder to adjust for multiple covariates (unlike regression) • Often rank-based, so you lose some information ⸻ 🔎 Common Non-Parametric Tests Test Use Case Alternative to Data Type Example in Public Health Mann–Whitney U Test Compare two independent groups Independent samples t-test Ordinal / skewed continuous Compare satisfaction scores between two hospitals Wilcoxon Signed-Rank Test Compare two related samples Paired t-test Paired ordinal or skewed data Pre- and post-treatment depression scores Kruskal–Wallis Test Compare more than two independent groups One-way ANOVA Ordinal/skewed data Compare pain scores across 3 clinics Friedman Test Compare more than two related samples Repeated measures ANOVA Repeated ordinal data Track stress levels in healthcare workers at 3 time points Spearman’s Rank Correlation Measure association between 2 variables Pearson correlation Ordinal or skewed continuous Link between education level and health literacy Chi-square Test Association between categorical variables — Categorical Compare smoking status with gender Fisher’s Exact Test Categorical association with small samples Chi-square (small n) Categorical HIV status vs contraceptive use in a small village ⸻ 🏥 Easy Public Health Examples 🔹 1. Mann–Whitney U Test • Scenario: You’re evaluating two different TB clinics. • Goal: Compare patient satisfaction scores (1–10 scale, not normally distributed). • Why this test: Data is ordinal and independent, and not normally distributed. ⸻ 🔹 2. Wilcoxon Signed-Rank Test • Scenario: You train nurses in vaccine counseling and want to measure confidence levels before and after. • Why this test: Same people, before and after data, ordinal scores. ⸻ 🔹 3. Kruskal-Wallis Test • Scenario: Compare pain levels after childbirth across 3 hospitals using a 5-point pain scale. • Why this test: 3 independent groups, ordinal outcome. ⸻ 🔹 4. Friedman Test • Scenario: You measure the stress level of healthcare workers during a disease outbreak at 3 different points: Week 1, Week 4, and Week 8. • Why this test: Repeated measures from the same people over time. ⸻ 🔹 5. Spearman’s Rank Correlation • Scenario: You want to know if education level (ranked) is associated with health knowledge score. • Why this test: Both are ordinal and may not be normally distributed. ⸻ 🧠 Final Tip: Choosing the Right Non-Parametric Test Think: • Are the groups related or independent? • How many groups? • Is your data ranked, ordinal, or skewed? Example decision tree: • Comparing 2 independent groups → Mann–Whitney U • Comparing 2 paired groups → Wilcoxon Signed-Rank • Comparing >2 independent groups → Kruskal-Wallis • Comparing >2 paired measures → Friedman • Association between two ordinal variables → Spearman’s ⸻ Would you like: • A printable cheat sheet version of all non-parametric tests? • Flashcards for memorization? • Practice questions or public health case studies involving these tests?
43
Thank you for the clarification! Let’s now expand in full on Point 4: Elementary Probability and Probability Distributions (especially Normal Distribution) — including the core ideas, assumptions, types, real-world public health examples, and how it connects to inference. ⸻ ✅ Elementary Probability and Probability Distributions (Including Normal Distribution) ⸻ 🔹 A. Probability 🧠 Definition: Probability is a numerical measure (between 0 and 1) of how likely an event is to happen. • 0 = impossible • 1 = certain • 0.5 = 50% chance (like flipping a coin) 🎯 Basic Rules of Probability: • Addition Rule (for mutually exclusive events):   P(A or B) = P(A) + P(B)   Example: Probability of being either HIV positive or having TB (if no overlap). • Multiplication Rule (for independent events):   P(A and B) = P(A) × P(B)   Example: Probability that a person is both vaccinated and uses mosquito nets. ⸻ ✅ When to Use Probability in Public Health • To predict the likelihood of an outcome • For risk assessment • To guide decision-making (e.g., testing, screening, vaccination policies) ⸻ ⚠️ Assumptions for Valid Probability Use • Clearly defined and mutually exclusive outcomes • Random sampling • Independence (for some models) ⸻ 🧪 Examples in Public Health Scenario Explanation Probability of a vaccine preventing disease If the vaccine is 90% effective, then P(no disease given vaccine) = 0.90 Screening test performance Probability of true positive (sensitivity) and true negative (specificity) Risk prediction Probability of a smoker developing lung cancer over 20 years ⸻ ❌ Limitations of Probability • Assumes events are random and measurable • Doesn’t account for unknown or complex dependencies • Can be misunderstood (e.g., people confuse probability with certainty) ⸻ 🔹 B. Probability Distributions 🧠 Definition: A probability distribution describes how likely each possible outcome of a variable is. ⸻ 📊 Two Main Types of Distributions Type Used For Example Discrete Distribution Countable outcomes (e.g., 0, 1, 2) Number of new COVID cases per day Continuous Distribution Outcomes on a continuous scale Blood pressure, BMI, weight ⸻ 🔹 Important Discrete Distribution: Binomial Use: When each trial has only two outcomes: success/failure Example: • Tossing a coin • HIV test result: positive or negative Public Health Example: If the probability of a baby being born with low birth weight is 0.1, what’s the chance 3 out of 5 babies are born with LBW? → Binomial distribution applies. ⸻ 🔹 C. The Normal Distribution (A Key Continuous Distribution) 🧠 Definition: The normal distribution is a bell-shaped, symmetric curve where most values are around the mean, and fewer values are far away. ⸻ 📊 Properties of the Normal Curve • Mean = Median = Mode (center of the curve) • Symmetrical around the mean • About: • 68% of values lie within ±1 SD of the mean • 95% within ±2 SDs • 99.7% within ±3 SDs ⸻ ✅ When to Use the Normal Distribution • Data is continuous and symmetric • Large enough sample size (Central Limit Theorem allows this even for non-normal data) • When performing parametric tests (like t-tests, ANOVA) ⸻ 📍 Examples in Public Health Example How Normal Distribution Applies Birth weight Most babies weigh around 3 kg; few are under 2 kg or over 5 kg Hemoglobin levels Centered around 12–13 g/dL with symmetric variation Cholesterol levels Distributed around an average in the population ⸻ ⚠️ Assumptions of Normal Distribution • The variable is continuous • Distribution is symmetric and bell-shaped • No strong outliers or skewness ⸻ ❌ When Not to Use Normal Distribution • Data is heavily skewed (e.g., income, length of hospital stay) • Ordinal or categorical data • Small sample sizes that don’t meet normality ⸻ ✅ Advantages of the Normal Distribution • Supports powerful statistical tests • Widely applicable due to Central Limit Theorem • Provides a clear, predictable shape for modeling variation ⸻ ❌ Limitations • Real-life data is often not perfectly normal • Outliers or skewness affect normality • Not suitable for small samples unless normality is tested ⸻ 🔍 How Probability & Normal Distribution Connect to Hypothesis Testing When testing a hypothesis (e.g., using a t-test), we often assume the variable is normally distributed, and we calculate the probability (p-value) of observing our result. Example: • You compare average hemoglobin in two groups. • Under the null hypothesis, any difference is just due to chance. • You use the normal (or t) distribution to estimate how likely your result is. • If the probability is very small (p < 0.05), you reject the null. ⸻ 🏥 Real-World Public Health Summary Concept Real-World Example Probability Risk that a person with TB contacts will develop the disease Binomial Distribution Modeling positive/negative test outcomes Normal Distribution Modeling population blood pressure or BMI Probability in Screening Probability that a test result is correct (sensitivity/specificity) Probability in Hypothesis Testing p-values reflect probability of observed data under the null hypothesis ⸻ Would you like a PDF cheat sheet, flashcards, or simple visual diagrams for these concepts (like the normal curve with SD zones)?
44
Don’t forget to go through your biostat PBL scenarios and what was taught there