week 2 Flashcards
Why is probability important in statistics?
We want a random sample which represents our population.
This means no inherent biases in the sampling technique.
Variations in the sample data cause uncertainty in the statistical analysis results.
This is because no random sample is similar to another random sample.
E.g. One random sample may be representative of the population of interest, another random sample may be off – due to chance.
Inferential statistics is concerned with measuring the degree of uncertainty which can be quantified using the concept of probability.
Allowing us to draw conclusions about our population using the random sample.
We use probability to quantify how much we expect random samples to vary.
Calculation of probability
The definition of probability depends on some process or experiment that occurs repeatedly under identical conditions, the number of times the experiment was repeated and also the number of times an outcome of interest occurs.
Let us assume that an experiment is repeated n times, and out of n times an event of interest occurred m times. Then the probability that the event occurs is just the relative frequency of the event. Thus the relative frequency definition of probability is given by: Relative frequency : m/n
1000 smokers (m)
20 Lung cancer (n)
what is Parameter
In a study you are aiming to ensure that your sample result is as close to the parameter as possible.
This allows the results of your study to be generalised to the entire (relevant) population.
A parameter is an unknown characteristic of interest in the true population.
General examples include the true mean, true proportion, and true standard deviation.
It is difficult to calculate for a large population due to financial and time constraint.
An example: Consider the BMI of Australians of age 30 to 60 years. If all the Australians within this age groups are taken into account to calculate the mean and standard deviation of BMI, they are called the true (population) mean and standard deviation respectively.
What is the difference between parameters and statistics ?
A statistic and a parameter are very similar in the sense that they are both descriptions of groups. For example, “50% of cat owners prefer X brand cat food.” The difference between a statistic and a parameter is that:
Statistics describe a sample and are denoted by Latin (roman) letters
Parameter describes an entire population is denoted by Greek letters
Normal distribution
The normal distribution is appropriate only for continuous variables.
The normal distribution has two parameters: mean (μ) and standard deviation (σ). These parameters respectively describe the central value and spread in the data.
The shape of the distribution of sample observations depends on the shape of the distribution of the observations in the sampled population. In general, as the sample size increases, the distribution of the sample observations approaches to the population distribution.
We can predict the shape of the data in the population by the shape of the observations in a large sample.
some features of normal distribution
Normal distributions are symmetric around their mean.
The mean, median, and mode of a normal distribution are equal.
The area under the normal curve is equal to 1.0.
Normal distributions are denser in the centre and less dense in the tails.
Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ).
68% of the area of a normal distribution is within one standard deviation of the mean.
Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.
The total area under the curve above the horizontal axis is one square unit because the area under the curve is the cumulative sum of ____________frequencies.
Relative
normal distributions - Z distribution
For a normal distributions of data, the observations are evenly clustered around the central value of the distribution, i.e., around the population mean.
If the data follows the normal distribution with true mean and true standard deviation then:
68% observations fall within 1 standard deviation (SD) of the mean (μ±σ).
95% observations fall within 2 SD’s of the mean (μ ± 2σ).
99.7% observations fall within 3 SD’s of the mean (μ ± 3σ).
Intervals
We can split up the individual intervals into their respective percentages. The ranges are called reference ranges for the normal distribution.
Equivalently for a large sample (for large samples the sample mean approaches the population mean), the same 68-95-99.7% rule applies.
34%-> 13.5%-> 2.35%-> 0.15%
How do we calculate the areas for the ranges that are not within the references ranges?
First what are our reference ranges?
‐3SD, ‐2SD, ‐1SD, Mean, +1SD, +2SD, +3SD.
Answer: z-score
what is Z score?
Z-score indicates how many standard deviations a data point is from the mean and helps us with this problem.
Z score tells use:
- where a data point lies compared with the rest of the data set in relation to the mean
- allows comparisons of data points across different normal distributions
For example, we can compare the scores obtained by a student in two exams whose scores are normally distributed.
how to calculate Z-score?
A simple transformation of the variable can be useful to calculate the probability. This transformation requires the knowledge of the true (population) mean and true standard deviation for the variable.
The transformation is achieved by subtracting the true mean from the observed value and then dividing this difference by the true standard deviation.
The whole expression is denoted by Z and is known as Z-score or standard score. This has a mean = 0, and a standard deviation = 1. This is known as the “standard normal distribution”.
Z-score = (Observed Value – True Mean)/(True Standard Deviation)
if the true mean and SD are unknown….
If the true mean and SD are unknown, they can be replaced by the sample mean and SD respectively when the sample size is large.
For any individual, the z-score tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive z-score indicates the individual is above average and a negative z-score indicates the individual is below average.
Z = – 1.5 means that a BMI of 25 is 1.5 standard deviations below the true mean.
Area under the normal distribution curve
The two main methods for calculating the probabilities for various z-scores are:
Normal distribution table
Statistical packages
In order to use the normal probability table, we must calculate the z-score.
How to read a normal distribution table?
Consider the absolute value (always positive) of the Z-score and break it into two parts:
(a) the whole number and the tenth and
(b) the hundredth.
In this example, the absolute value of the Z-score is 1.50.
The whole number and the tenth is 1.5
and the hundredth is 0.00
The whole number and the tenth (1.5) are looked up along the first column and the hundredth (0.00) is looked up across the first row in the table.
The value in the intersection of the row and column is the probability from the absolute value of the Z-score to infinity (i.e. above the Z score). Thus from the table, the probability above 1.5 is 0.0668.
Note: Due to the symmetry of the normal distribution, the probability below -1.5 is the same as the probability as above +1.5.
We are interested in the probability of a Z-score greater than -1.5.
The easiest way to calculate this is: 1 - 0.0668 (i.e. entire area - [area below -1.5]).
Thus the required probability is 0.9332 (1 - 0.0668).
Hence the probability that a randomly selected adult Australian is overweight (BMI > 25 kg/m2) is 93.32 %.
Sampling distribution - Definition
Sampling distribution is the distribution of a summary statistic.
Medical research often involves acquiring data from a sample of individuals and using the information gathered from the sample to make inferences about a broader group of individuals.