Lecture 1 Ucc Flashcards

1
Q

What is descriptive statistics?
State the three forms that descriptive stats can be in

A

Descriptive Statistics provides information regarding the overview of the general features of a given dataset.

They may be in the form of tables, graphs or numerical summary measures (i.e proportions, means, etc.)

Examples of numerical summary measures in statistics, which summarize and describe important features of data, include:

  • Mean: The average of all data points.
  • Median: The middle value in a sorted dataset.
  • Mode: The most frequently occurring value(s).
  • Range: The difference between the maximum and minimum values.
  • Variance: The average squared deviation from the mean.
  • Standard Deviation: The square root of the variance; measures how spread out the values are from the mean.
  • Interquartile Range (IQR): The range between the first (Q1) and third quartile (Q3), representing the middle 50% of the data.
  • Percentiles: Values that divide the data into 100 equal parts.
  • Quartiles: Values that divide the data into four equal parts (Q1, Q2, Q3).
  • Z-Score: Represents how many standard deviations a data point is from the mean. A z-score of 0 means the data point is exactly at the mean.
    • A positive z-score means the data point is above the mean.
    • A negative z-score means the data point is below the mean.
    • The magnitude of the z-score indicates how far the data point is from the mean. For example, a z-score of +2 means the data point is 2 standard deviations above the mean.

Uses of Z-Score:.
2. Identifying Outliers: Data points with very high or very low z-scores (e.g., above 3 or below -3) are often considered outliers.

In a standard normal distribution (mean = 0, standard deviation = 1), z-scores help determine the probability of a data point occurring within the distribution. For instance:
• About 68% of data points fall within a z-score range of -1 to +1.
• About 95% fall within -2 to +2.
• About 99.7% fall within -3 to +3.

  • Skewness: Indicates the asymmetry of the data distribution.
  • Kurtosis: Measures the “tailedness” of the distribution.

These numerical summaries provide insights into the distribution, central location, spread, and overall structure of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is inferential statistics

A

inferential statistics - how you’re going to use data to draw a conclusion,hypothesis,inferences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

State the Population,target population,sample needed in the example below
Example: a study for 200 pregnant women in cape coast.

A

Population- women in cape coast
Target population-pregnant women
Sample 200

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is discrete and continuous data?
What is nominal data?
What is ordinal data?
Give three examples each

Here are two more challenging MCQs to test your understanding of discrete and continuous variables:

A researcher is studying the growth of bacteria in a lab over time. They record the size of the bacterial colony every 12 hours. Which statement best describes the variable “colony size”?

A) It is discrete because the bacteria are counted.

B) It is continuous because the size of the colony can be measured in precise units like millimeters or micrometers.

C) It is discrete because the colony size is measured at regular intervals.

D) It is continuous because the number of bacteria in the colony grows over time.

A hospital tracks the number of days each patient stays in the intensive care unit (ICU). Which of the following best describes the variable “length of stay in the ICU”?

A) It is continuous because the total time can be measured in hours and minutes.

B) It is discrete because the number of days is counted as whole numbers.

C) It is continuous because it can vary significantly between patients.

D) It is discrete because the stay in the ICU is divided into distinct time blocks.

A

Qualitative :
Ordinal data -The categories are ordered in some way. is used in disease state- mild moderate severe anemia ,another example is degree of pain,wealth index being rich poor or poorest
Nominal-The categories are not ordered but rather have names. Blood group (A, B, AB, and 0 )
Marital status (married/ widowed/ single) Sex (male/female),
Political party affiliation,
Educational level,
HIV status (Positive, negative).

Quantitative or numerical data data:
Numerical- discrete and continuous. Examples of discrete data- Days in the week or months in the year,number of people in a class, number of cars,

Continuous variables can assume any value within a specified relevant interval of values assumed by the values.

e.g. height, weight, and temperature of a patient

Continuous- weight of individual cuz your weight can change everytime you check it at intervals

To know which is which, rule out the other options by asking if you can get a decimal from them. If not, they’re discrete.
Example: you can’t get number if children as 44.5

In theory: Age is continuous since it can be measured to any degree of precision.
• In practice: Age is often treated as discrete for convenience when we round to whole numbers.

Here are two more challenging MCQs to test your understanding of discrete and continuous variables:

A researcher is studying the growth of bacteria in a lab over time. They record the size of the bacterial colony every 12 hours. Which statement best describes the variable “colony size”?

A) It is discrete because the bacteria are counted.

B) It is continuous because the size of the colony can be measured in precise units like millimeters or micrometers.

C) It is discrete because the colony size is measured at regular intervals.

D) It is continuous because the number of bacteria in the colony grows over time.

Answer: B) It is continuous because the size of the colony can be measured in precise units like millimeters or micrometers.
(The size of the bacterial colony is measured in a continuous manner, as it can take any value within a range.)

A hospital tracks the number of days each patient stays in the intensive care unit (ICU). Which of the following best describes the variable “length of stay in the ICU”?

A) It is continuous because the total time can be measured in hours and minutes.

B) It is discrete because the number of days is counted as whole numbers.

C) It is continuous because it can vary significantly between patients.

D) It is discrete because the stay in the ICU is divided into distinct time blocks.

Answer: B) It is discrete because the number of days is counted as whole numbers.
(The length of stay is often recorded as whole days, making it a discrete variable in this context.)

These questions focus on understanding the nature of measurement and how variables are typically recorded in real-world scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are numerical summary measures and measures of central tendency

A

Numerical summary measures are used to make concise quantitative statements that characterize the whole distribution of quantitative values.

Measures of Central Tendency
These are measures used to investigate the central characteristic of the data or the point which the observations (or values) tend to cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is arithmetic mean?

The arithmetic mean is extremely sensitive to unusual values.
question:
The forced expiratory volume (FEV) in 1 second for 13 study participants are as follows: 2.30, 2.15, 3.50, 2.60, 2.75, 2.82, 4.05, 2.25, 2.68, 3.00, 4.02, 2.85, 3.38. Calculate the arithmetic mean for FEV

A

Arithmetic mean: It is the sum of all observations in a set of data divided by the total number of observations

The arithmetic mean is extremely sensitive to unusual values.
Examples
The forced expiratory volume (FEV) in 1 second for 13 study participants are as follows: 2.30, 2.15, 3.50, 2.60, 2.75, 2.82, 4.05, 2.25, 2.68, 3.00, 4.02, 2.85, 3.38. Calculate the arithmetic mean for FEV values
The arithmetic mean is given by
(2.30+2.15+3.50+2.60+2.75+2.82+4.05+2.25+2.68+3.00, 4.02+2.85+3.38)/13=2.95 litres

Sure, let’s break down each mean with simple explanations and examples:

Definition: The arithmetic mean is what most people commonly refer to as the “average.” You add up all the numbers and then divide by how many numbers there are.

Formula:
[
\text{Arithmetic Mean} =Sum of all values / Number of values

Example:
Consider these numbers: 4, 7, 10.

  1. Sum: (4 + 7 + 10 = 21)
  2. Count: There are 3 numbers.
  3. Arithmetic Mean: {21/3} = 7

So, the arithmetic mean is 7.

Definition: The harmonic mean is used for rates or ratios. You take the reciprocal (1 divided by the number) of each value, find their average, and then take the reciprocal of that average.

Formula:
[
\text{Harmonic Mean} ={Number of values/{Sum of the reciprocals of all values}

Example:
Consider these values: 4, 6.

  1. Reciprocals: 1/4= 0.25, 1/6= 0.167
  2. Sum of Reciprocals: (0.25 + 0.167 = 0.417
  3. Harmonic Mean: {2/0.417)
    It’s 2 cuz the number of values are 2. If the number of values were 3, it would’ve been 3 divided by the reciprocal total.

So, the harmonic mean is approximately 4.8.

Definition: The geometric mean is used for sets of numbers that are multiplied together or when dealing with percentages. You multiply all the numbers together, then take the nth root, where n is the number of values.

Formula:
[
\text{Geometric Mean} = \sqrt[n]/Product of all values}}
]

Example:
Consider these numbers: 2, 8.

  1. Product: (2 x 8 = 16)
  2. Geometric Mean: (\sqrt{16} = 4)
    It’s square root cuz the values are 2.
    If it was 3 values, it would’ve been cube root

So, the geometric mean is 4.

  • Arithmetic Mean: Average of numbers (e.g., (7) for (4, 7, 10)).
  • Harmonic Mean: Useful for rates (e.g., (4.8) for (4, 6)).
  • Geometric Mean: Average for multiplicative datasets (e.g., (4) for (2, 8)).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is calculation for mean of ungrouped data different from mean of grouped data

for a grouped frequency distribution table with 20-40,40-60,60-80 and 2,4,6 as their respective frequencies
The mean is?

A

Mean for ungrouped data is different from mean for grouped data.
Mean of ungrouped data is the sum of all the numbers divided by the total number of the numbers
For grouped data, you construct a table and give a range or categories. Then you find the frequency of the categories. Example 0-5,6-10,11-25
How many people fell within 0-5,how many between 6-10

Calculation for grouped data is sigma fx divided by sigma f. Where sigma f is the frequencies, x is the midpoint

So for a grouped frequency distribution table with 20-40,40-60,60-80 and 2,4,6 as their respective frequencies
The mean is:
20-40 midpoint is 60/2=30
40-60 midpoint is 100/2=50
60-80 midpoint is 140/20=70

So sigma fx = 30x2,50x4,70x6
60+200+420
=680
F=12
So sigma fx/sigma f is 680/12
=56.66
So mean for this grouped data is 56.67

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the median under measures of central tendency

What is the median of these values: 2, 4, 1, 2, 3, 3, 1
Find the median of the following values: 6, 5, 2, 4, 3, 2

A

Median: it is the middle value of the set of n observations after they have been arranged in order of magnitude
For a set of n observations, the median is identified by using the formula (n+1)/2
Medians are not sensitive to unusual values
Examples
What is the median of these values: 2, 4, 1, 2, 3, 3, 1. Median is 1,1,2,2,3,3,4. Median is 2

Find the median of the following values: 6, 5, 2, 4, 3, 2
2,2,3,4,5,6
N/2 is 6/2 is 3rd position
N/2 +1 is 4th position
So 3+4/2 is 3.5. Median is 3.5

For odd number- (n+1)/2 to get the position after you’ve arranged the values in order of magnitude
For even number- n/2 to find the first middle number and (n/2)+1 to find the second middle number. So if x is first middle and y is second middle then x+y/2 is the median of the even set of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is calculation for median in odd numbers different from even set of numbers

A

Median:
Odd number of observations calculation is different from even number . Note you can sort it in descending or ascending order

Even number of observations : arrange in ascending or descending orders. N+1/2.
The number you get is nth number that has the median number.
So for 1122334, n is 7. 7+1 divided by 2 is 4. So the 4th number is the median number so 2 is the median.

Mode:most frequent value

From an ungrouped frequency, the mode is the value with the highest frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are measures of dispersion under numerical summary measures
What is range?
Is range sensitive to unusual values? Why?
Find the range of the following observations:
5, 4, 6, 2, 2, 3, 3, 8
2, 3, 3, 1, 3, 5, 40

A

Measures of Dispersion
These are measures used to describe the spread (or variability) of the data.
Range: It is the difference between the minimum and maximum values of a dataset.
It is more useful when the minimum and maximum values are reported
Since range depends on only the minimum and maximum values, it is sensitive to unusual values.

First sort out the values in ascending order else you’ll miss it if there’s a bigger value there.
1. 6-2 is 4 so range is 4
2.40-1 is 39 so range is 39

Number 1 I had it wrong cuz I didn’t arrange it so I didn’t see the 8
So 1. 8-2 is 6. Range is 6

Number 2 is correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is interquartile range
Difference between IQ and Q2

A

Interquartile Range: It the difference between the Upper quartile (75th percentile) and the lower quartile (25th percentile)

The interquartile range contains middle 50% of the observations in a dataset

It is not sensitive to unusual values

The following steps are used to find the 25th and 75th percentiles:

The interquartile range (IQR) and the median (Q2) both relate to measures of central tendency and dispersion but serve different purposes. Here’s a detailed comparison:

  1. Interquartile Range (IQR):
    • Definition: The IQR measures the spread of the middle 50% of the data. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):
      [
      \text{IQR} = Q3 - Q1
      ]
    • Purpose: It provides an indication of the variability or dispersion of the central 50% of the data, excluding the lower 25% and the upper 25%. It helps to identify the spread of data and is useful for detecting outliers.
  2. Median (Q2):
    • Definition: The median, or the second quartile (Q2), is the middle value of the dataset when it is ordered. It divides the dataset into two equal halves:
      [
      Q2 = \text{median}
      ]
    • Purpose: It represents the 50th percentile and indicates the central value of the dataset. It shows where the center of the data lies, but it does not measure the spread or variability.

Key Differences:

  • Measurement Focus:
    • The IQR measures the spread of the central 50% of the data.
    • The Median (Q2) measures the center of the data distribution.
  • Calculation:
    • The IQR is derived from Q1 and Q3, focusing on the range between these quartiles.
    • The Median (Q2) is a single value representing the middle of the dataset.
  • Utility:
    • The IQR helps in understanding the distribution and identifying outliers.
    • The Median helps in understanding the central tendency of the data.

In summary, while both IQR and the median relate to the distribution of data, they serve different purposes: IQR measures the range within which the central 50% of data falls, while the median indicates the middle point of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you get the 25th quartile?
Don’t use this one
I’ve found an easier one

A

Interquartile range -quartile is 4 or something being divided into 4. 100 divided into 4 is 25,50,75 and 100.
IQ=Q3-Q1
1.Arrange values you have in ascending order
Find the median position
Find the median position of the median😂.
So if the median position is 5th then find the median of this so 5+1 divided by 2.
The position you get and the number it correlates with is Q1 or the lower quartile

K is the percentile you’re interested in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you get the 75th quartile

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Find the interquartile range of the observations below
1. 2, 4, 4, 1, 1, 5, 4, 4, 4

  1. 6, 6, 2, 1, 3, 5, 6, 7, 8
A

112444445
|||||||||
123456789

The numbers with their positions so 5 is the 9th position

Step 1: rearrange in ascending order
Step 2: find the median of the number set above
So 9+1/2= 5
So the number that correlates with the fifth position is the median so 4 is the median.

Now, split from 1-5 into two and split from 5-9 also to find lower quartile. So yoj find the median of the median position.
So 5+1/2=3
So the value that corresponds to the third position is 2,so 2 is the lower quartile.

The third quartile or 75th quartile is 3(n+1)/2 so n =9. The final answer is 7.5 so rounded up to 8(you never use decimals. Always round to the highest nearest whole number. Round the result to the immediate integer larger than the result to give you the location of the kth percentile. For example 3.12 has to be rounded to 4 and not 3)
So the number corresponding with the 8th position is 4
So IQ=Q3-Q1
So IQ is 4-2(don’t subtract it. It is a range not a value. Leave your final answer as 4-2)

Or

Start with rearranging numbers in ascending order
Q1= n+1/4 so the position is the answer then you relate it to the dataset you have
Q2 is the same as finding the median of a regular set
Q3= 3(n+1)/4 so the number that this position corresponds to is the 75th quartile

So using this example on the dataset, it’s the same thing to get the same answer. So Q1= 9+1/4= 2.5 and this is rounded to 3. And this three is the third position which correlates to 2. So 2 is the lower quartile
Then q3=3(9+1)/4
Is 7.5 this is rounded to 8 and so the number correlating to the 8th position is 4. So Q3 is 4 hence IQ is 4-2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What term encompasses the methods of collecting, summarizing, analysing and drawing conclusions from the data?
What term is given to a set of recorded observations on one or more variables?
What is the term given to a characteristic (personal aspect ) or attribute (feel, behave, or think) of an individual or
an organization that can be measured or observed and varies among individuals or organizations ?

A

What is Statistics?
●It encompasses the methods of collecting, summarizing, analysing and drawing conclusions from the data.

●Data: is a set of recorded observations on one or more variables.
Eg. Data on demography of patients with particular illness.

● A variable is a characteristic (personal aspect ) or attribute (feel, behave, or think) of an individual or
an organization that:
(a)can measure or observed

(b) varies among individuals or organizations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Calculate the IQ range for these values:
662135678

2,3,4,5,7,8,12

(This is my final understanding of this thing)

A

662135678
Step 1- rearrange in ascending order
Step 2-it’s an odd number of observations so separate them into two equal halves and the middle number is the median number
1235 6 6678
Q2 or median is 6
Step 3- find the mean of the two middle numbers of the first half to get Q1. So 2+3/2 will give you 2.5 rounded to 3 so the Q1 is 2.5 or 3(use 2.5 depending on the possible answers)
Q1 is 2.5 or 3 not the number at the 3rd position
Step 4-find the mean of the two middle numbers of the second half to get Q3
6+7/2 is 6.5 rounded to 7.
Not the number at the position. Q3 is 6.5 or rounded to 7 not the postion of the 7th number
IQ= 7-3 not the number at the 7th position or the number at the 3rd position
Just the numbers. So 7-3 is 4
So IQ is 4

The above is for odd number of values
Don’t use the formula n+1/4 or 3(n+1)/4 for this odd number of sets

Another example for odd numbers is 2,3,4,5,7,8,12
The Q2 is 5
The Q1 is 3(cuz 3 is the middle number between 2 and 4)
Q3 is 8( cuz 8 is between 7 and 12)

For even number of sets, rearrange in ascending order too.

12356667
Find the median number separate into two equal halves
So 123 56 667
So there are two numbers in the middle instead of one. So you’ll find the mean of that. 5+6/2 this will give you 5.5 or rounded to 6 for Q2

Next step is to find Q1.
For Q1, this time you won’t consider the median in your calculation.
So you’ll divide the numbers into two equal halves without removing the middle two numbers. So it’ll be
1235 6667
So you’ll find the mean of the two numbers in the first half. 2+3/2 which is 2.5 so Q1 is 2.5
For Q3, you’ll find the mean of the two numbers in the second half.
So 6+6/2 which is 6
So Q3 is 6

IQR is 6-2.5 which is 3.5
So to decide whether to round up or not, don’t round up at any point. Don’t round up your final answer too
It depends on the answers to the question

17
Q

What is the shape of a normal distribution curve?
How many modes does a normal distribution curve have?
Is it symmetrical?
What does a low SD mean
What about a high one
What percentage of data falls within +1 and -1 SD?
What percentage of data falls within +2 and -2 SD?
What percentage of data falls within +3 and -3 SD?

A

In a normal distribution curve, the mode, median and mean are equal and lie in the same location.

It is bell shaped curve, unimodal (has one mode or highest frequency ) and symmetrical

SD is used to measure how much deviation exist in the distribution.

Low SD means values are close to the mean.

High SD means values are spread out over a large range.

•68% of the data falls within +1 and -1 SD

•95% of data falls within +2 and -2 SD

•99 of data falls within +3 and -3 SD

18
Q

What is skewed distribution
What are the two types
Most data are points are of what type of skewed distribution?
Which type of distribution has the mean greater than the median and the median greater than the mode?

Which type of distribution has the mean lesser than the median and the median lesser than the mode or the mode is greater than median and median is greater than mean

Question 1:
Which of the following correctly describes a positively skewed distribution?

A) The tail points to the left, with most data points concentrated on the right.

B) The tail points to the right, with most data points concentrated on the left.

C) The data is evenly distributed around the mean.

D) The tail points to the right, with most data points concentrated on the right.

Question 2:
In a negatively skewed distribution:

A) The majority of data points are concentrated on the right, and the tail extends to the left.

B) The majority of data points are concentrated on the left, and the tail extends to the right.

C) The distribution has no tails and is perfectly symmetrical.

D) The distribution’s mean and median are equal.

Question 3:
Which of the following best represents the difference between a positively skewed and a negatively skewed distribution?

A) Positively skewed distributions have their tails pointing left, while negatively skewed distributions have their tails pointing right.

B) Positively skewed distributions have their tails pointing right, while negatively skewed distributions have their tails pointing left.

C) Both types of distributions have their tails pointing to the center.

D) Both types of distributions are symmetric, with tails of equal length on both sides.

.

Question 4:
In which type of skewed distribution would the mean typically be greater than the median?

A) Negatively skewed

B) Positively skewed

C) Symmetrical distribution

D) Bimodal distribution

A

Skewed distributions: it is a measure of symmetry of distribution.

Positively skewed-tail points to the right and most data points are at the left
Negatively skewed-
Tail is to the left

Most data are positively skewed

Skewed distribution:
There is change in median mode and mean
They are not equal to each other.
Positively skewed- mean is greater than median and median is greater than mode
Negatively skewed- mode is greater than median and median is greater than mean
Or the mean is less than the median and the median is less than the mode

The mean is always near the tail. So in positively skewed, the tail is at the right so the mean is at the right then the median is behind it and then the mode is the first value.

But I’m negatively skewed, the tail is to the left so the mean is at the left followed but the median and then the mode is the last value

  1. Positively Skewed Distribution:• Tail Direction: The tail extends to the right.
    • Data Concentration: Most data points are concentrated on the left (lower values).
    • Impact on Mean: The mean is sensitive to all values in the dataset, including extreme values in the tail. In a positively skewed distribution, the high values in the right tail increase the sum of the data points, thus pulling the mean to the right (towards the higher values). As a result, the mean becomes greater than the median.
  2. Negatively Skewed Distribution:• Tail Direction: The tail extends to the left.
    • Data Concentration: Most data points are concentrated on the right (higher values).
    • Impact on Mean: Similarly, the mean in a negatively skewed distribution is affected by all data points, including the low values in the left tail. These low values decrease the sum of the data points, pulling the mean to the left (towards the lower values). Consequently, the mean becomes less than the median.

Summary:

•	In a positively skewed distribution, the extreme high values in the right tail pull the mean to the right, making it greater than the median.
•	In a negatively skewed distribution, the extreme low values in the left tail pull the mean to the left, making it less than the median.

Here’s an MCQ to test your understanding of skewed distributions:

Question 1:
Which of the following correctly describes a positively skewed distribution?

A) The tail points to the left, with most data points concentrated on the right.

B) The tail points to the right, with most data points concentrated on the left.

C) The data is evenly distributed around the mean.

D) The tail points to the right, with most data points concentrated on the right.

Answer:
B) The tail points to the right, with most data points concentrated on the left.

Question 2:
In a negatively skewed distribution:

A) The majority of data points are concentrated on the right, and the tail extends to the left.

B) The majority of data points are concentrated on the left, and the tail extends to the right.

C) The distribution has no tails and is perfectly symmetrical.

D) The distribution’s mean and median are equal.

Answer:
A) The majority of data points are concentrated on the right, and the tail extends to the left.

Question 3:
Which of the following best represents the difference between a positively skewed and a negatively skewed distribution?

A) Positively skewed distributions have their tails pointing left, while negatively skewed distributions have their tails pointing right.

B) Positively skewed distributions have their tails pointing right, while negatively skewed distributions have their tails pointing left.

C) Both types of distributions have their tails pointing to the center.

D) Both types of distributions are symmetric, with tails of equal length on both sides.

Answer:
B) Positively skewed distributions have their tails pointing right, while negatively skewed distributions have their tails pointing left.

Question 4:
In which type of skewed distribution would the mean typically be greater than the median?

A) Negatively skewed

B) Positively skewed

C) Symmetrical distribution

D) Bimodal distribution

Answer:
B) Positively skewed

19
Q

When the data is normally distributed, what measures do you report?
What about when it’s skewed

A

When the data are normally distributed, report the arithmetic mean and standard deviation (because these two are grossly affected by outliers so if it’s skewed, they will be affected or not accurate. That’s why they’re only used in normal distribution)
However, when the data are skewed report the median or interquartile range

20
Q

Here’s an MCQ to test your understanding of what a box plot is:

Question:
Which of the following best describes a box plot?

A) A graphical representation that shows the frequency distribution of a data set.

B) A type of plot that illustrates the relationship between two continuous variables using a scatter of points.

C) A graph that depicts differences in quantitative measures between two or more subgroups, summarizing data using quartiles and highlighting potential outliers.

D) A plot that displays the cumulative frequency of a data set.

A

Answer:
C) A graph that depicts differences in quantitative measures between two or more subgroups, summarizing data using quartiles and highlighting potential outliers.

Box plots: These are graphs usually used to depict the differences in quantitative measures between two or more subgroups. It also provides a summary of the data

You use IQ or box plots to :
- [ ] provide a summary of Quantitative continuous variables
- [ ] Find or compare Median of two data sets.

21
Q

Explain the parts of the box plots

A

Vertically:
The upper part(the upper whisker not the box) is the maximum value or number in the data set
The lower part(the bottom whisker) is the minimum value
Then for the box, the line dividing the box is the Q2. The part of the box nearer to the upper whisker is Q3 and the part near to the lower whisker is Q1
How to remember this is going up the box is ascending. So the down or lower part is minimum and the upper whisker of boxplot is maximum

But the box plot horizontally, it is still ascending so from min to max from left to right
Then box plot vertically, it’s ascending so from min to max,from down to up

22
Q

How do you calculate for outliers
If there are outliers in the data, don’t plot them on boxplot
Any number outside the max value is an outlier
Any number outside the min value is an outlier

A

To check for outliers in min range Q1-1.5 X IQ
To check in max range Q3+1.5 X IQ

(Q1-1.5 X IQ, Q3+1.5 X IQ)
(Remember BODMAS in this calculations. UCC doesn’t let you use calculators)
The answer becomes your range for plotting your box and whisker plot
So example is getting (1,12)
This is your range for plotting. So if you got 2 as your minimum,2 is within 1-12 so it’s not an outlier
If you got 11 as your max, 11 is within 1-12 so it’s not an outlier.
So let’s say you got 15 as your max. And your values are 2,3,4,9,10,15
You won’t use 15 cuz 15 is an outlier cuz it’s bigger than 11. So you’ll use the next number which is 10 as your max and you’ll plot 10 not 15. Then you’ll indicate 15 as an outlier (the way outliers appear on boxplots)

So you’ll do this outlier calculation after getting your min and max values from the data

23
Q

You checked the weights of people and 80 as the Q3 and 50 as the Q1
IQ is 30
interpret the interquartile values,Q1 values and Q3 values

A

Q3- so it’s also called 75th percentile so if let’s say you checked the weight of people and the Q3 was maybe 80. It means 75% of the people had weight of 80kg and below

So for Q1- if it was maybe 50, we can say 25% of the people had weight less than 50kg or from 50kg downwards and the rest had above 50kg

So IQ is 80-50 so 30kg
So the 50% of the weight of the people is spread over a range of 30kg from 50kg to 80kg
So 50% of the people has weight 30kg or 50% have their weight between 80-50kg
Or 50% of the people are between weight 80-50kg

Given:

  • Q1 (First Quartile) = 50
  • Q3 (Third Quartile) = 80
  • IQR (Interquartile Range) = 30

Here’s how to interpret these values:

  1. Quartile Values:
    • Q1 (50): This value represents the 25th percentile of the data. It means that 25% of the people in your dataset have weights less than or equal to 50 kg.
    • Q3 (80): This value represents the 75th percentile of the data. It means that 75% of the people in your dataset have weights less than or equal to 80 kg.
  2. Interquartile Range (IQR):
    • IQR (30): This is the difference between Q3 and Q1, calculated as:
      [
      \text{IQR} = Q3 - Q1 = 80 - 50 = 30
      ]
      The IQR measures the spread of the middle 50% of the data. In this case, the weights of the middle 50% of people fall within a range of 30 kg.
  • Q1 = 50 kg means 25% of the people weigh 50 kg or less.
  • Q3 = 80 kg means 75% of the people weigh 80 kg or less.
  • The middle 50% of the people have weights ranging from 50 kg to 80 kg, with a spread of 30 kg.

This range gives you an idea of the typical weight distribution and helps identify where most of the data points lie.

Let’s break down the interpretation of the interquartile range (IQR), Q1, and Q3 values based on the provided data:

  • Interpretation:
    • The first quartile (Q1) represents the 25th percentile of the data.
    • This means that 25% of the people in the dataset have a weight less than or equal to 50 kg.
    • It also means that 75% of the people have a weight greater than 50 kg.
  • Interpretation:
    • The third quartile (Q3) represents the 75th percentile of the data.
    • This means that 75% of the people in the dataset have a weight less than or equal to 80 kg.
    • It also means that 25% of the people have a weight greater than 80 kg.
  • Calculation:
    • IQR is calculated as Q3 - Q1, so ( 80 - 50 = 30 ).
  • Interpretation:
    • The IQR measures the spread of the middle 50% of the data.
    • In this case, the middle 50% of the people have weights that range between 50 kg and 80 kg.
    • An IQR of 30 indicates that there is a moderate spread in the middle half of the weights.
  • Q1 = 50 kg: 25% of people weigh 50 kg or less.
  • Q3 = 80 kg: 75% of people weigh 80 kg or less.
  • IQR = 30 kg: The middle 50% of people’s weights range from 50 kg to 80 kg, indicating that most people’s weights are within a 30 kg range.

The IQR is useful for identifying the range within which the central bulk of the data lies, and it helps in understanding the spread or dispersion of the data around the median.

No, there is no Q4 in the context of quartiles. Here’s why:

  1. Q1 (First Quartile): This is the 25th percentile. It marks the point below which 25% of the data falls.
  2. Q2 (Second Quartile): This is the 50th percentile, also known as the median. It divides the data into two equal halves, with 50% of the data below and 50% above.
  3. Q3 (Third Quartile): This is the 75th percentile. It marks the point below which 75% of the data falls.
  • The quartiles split the data into four equal parts. However, when discussing specific quartiles, we typically refer to Q1, Q2, and Q3 because they represent key percentiles:
    • Q1: 25th percentile
    • Q2: 50th percentile (Median)
    • Q3: 75th percentile
  • The fourth quartile is the remaining portion of the data that falls between Q3 (75th percentile) and the maximum value. However, this isn’t typically labeled as Q4; instead, it’s just referred to as the top 25% of the data.
  • Q1, Q2, and Q3 are the three main quartiles.
  • While the data can be divided into four parts, we don’t refer to a Q4; we consider Q3 as the last main quartile, and the remaining part of the data beyond Q3 is just the upper 25%.
24
Q

Statistics is the collection, organization,analysis and interpretation of numerical data.

Applying this to biological and health is biostat

Outcome- dependent variable
Independent variable-???

Ways to get descriptive stats:
Using Numerical summary measures -consice quantitative statement used to describe the features of a dataset
Using graphs
Using tables
This is from first slide on descriptive stat

Categorical data- percentage
Numerical- mean

You have to know the kind of data before applying any technique so you can apply the right technique

Types of mean:
Geometric mean
Harmonic mean
Arithmetic mean

Median: A N+1/2 is for the position of the value but not the value itself

2,4,1,2,3,3,1
6,5,2,4,3,2

1,1,2,2,3,3,4
7+1/2
8/2
4
4th position
4th position is 2.

2,2,3,4,5,6
3+4/2
7/2
3.5th position

Range: write it from min -max
Don’t subtract.

1,2,3,3,3,5,40

Instead of doing 40-1, to get 39
You’ll just state the range as 1-40
Cuz this is more informative as compared to just giving the subtraction.
IQ- don’t subtract. Give the range as done in range. Example is if Q3 is 3 and Q1 is 2, you won’t do 3-2 is 1.
You’ll just give it as 3-2.

IQ is from Q1-Q3

What is an integer??
Negative or positive whole number. Including zero.

You’ll round to immediate whole number if it’s not an integer.
Example is 3.1
This will be 4th position
Or 5.3 will be 6th position
To get the value

If you get an integer:
You’ll find the mean of what you got plus the next one
Example, you got 6th position after doing nk/100
K is the percentile

So you’ll do the mean of the value at 6th position + value at 7th position /2

Or if you got the 3rd position after doing nk/100,
You’ll do value at (3rd position + value at 4th position) /2

Know when to apply the different calculations. He said he won’t bring calculations.

Variance together with standard deviation.
Standard deviation is square root of variance.
How to calculate variance

When given a data, you have to check the distribution first before you calculate the mean.
If it’s skewed or not
If not skewed, you’ll calculate the mean.

You use IQ or box plots to :
- [ ] provide a summary of Quantitative continuous variables
- [ ] Find or compare Median of two data sets.

The height of the box is the IQR

CLOSED CIRCLES ARE SHADED
THE ONES THAT ARENT SHADED ARE OPEN CIRCLES

Question:
The following values are normally distributed. From the boxplot, The median is 5
What is the mean???
Ans: the means is 5
Cuz for normal distribution,
mean =median=mode

A
25
Q
A

Variance gives you a measure of spread in squared units.
• Standard deviation gives you a measure of spread in the same units as the data, making it more interpretable.

Variance tells you how much the data points vary in squared units.
• Standard deviation tells you that, on average, the data points are 2.83 units away from the mean (in this case, 2.83 units from 6).

The standard deviation is the square root of the variance. It brings the value back to the same units as the data, making it easier to interpret.
• It tells us the average distance of the data points from the mean.