week 3 Flashcards
Descriptive statistics vs Inferential statistics
Descriptive statistics are used to provide descriptions of the population. This is done either through numerical calculations or graphs or tables.
Inferential statistics makes inferences and predictions about a population based on a sample of data taken from the population in question.
Summary Statistics
Summary statistics summarise and provide information about your sample data. This will tell you about the values in your data set. This includes where the average lies and whether your data is skewed.
Summary statistics fall into three main categories:
Measures of location (also called central tendency).
Measures of spread.
Graphs/charts.
Your aim therefore is to present sample data using graphs and descriptive measures to summarise points and characteristics in the sample.
Inferential Statistics
Inferential statistics is used to make inferences about the characteristics of a populations based on sample data.
The goal is to go beyond the data at hand and make inferences about population parameters.
In order to use inferential statistics, it is assumed that either random selection or random assignment was carried out (i.e., some form of randomisation must is assumed).
There are two things to consider for inferential statistics: hypothesis testing & confidence interval. These are covered in more detail later in this Module.
Retain or Reject the Null Hypothesis
Hypothesis test is performed under the assumption that the null hypothesis is true and then we try to disprove it based on the data available.
The sample we have taken from the population either has true mean μ = μ0 or it has a different mean.
Consider the sampling distribution for the sample mean under the null hypothesis, i.e., the population mean μ is equal to μ0.
Cut off point: 95% area
If the sample mean falls within 95% of the middle area, then we say that it is close to the population mean under the null and any differences between sample mean and null hypothesized population mean is due to sampling variability or by chance.
If the sample mean falls either in the lower 2.5% area or in the upper 2.5% area, then we say that the sample mean is so far out that a sample mean this large would rarely occur just by chance when the null is true. So we will conclude that the sample data does not support the null hypothesis and we go with the alternative hypothesis.
For a lower sided alternative hypothesis (one-sided), the lower 5% of the tail area is the rejection region and upper 95% of the area is the retention region.
T
F
True
Calculating P value when the Population SD known
The p-value or the tail area of the observed sample mean can be obtained easily by making a transformation on the sample mean assuming that the null hypothesis is true.
Z score : sample mean -population mean / SE
SE = σ/√n.
The normal distribution (Table A1) is then used to find the p-value or tail area.
Calculating P value when the Population SD unknown
If the population standard deviation is unknown, the recommended transformation for the sample mean is:
T score : sample mean -population mean /SE
The t-score has (n-1) degrees of freedom.
SE = SD/√n where SD is the sample standard deviation.
The t-distribution (Table A3) is then used to calculate the required tail area or the p-value.
P value interpretation
If the p-value is greater than 0.05, we retain the null hypothesis.
If the p-value is less than or equal to 0.05, reject the null hypothesis and go with the alternative hypothesis.
If 0.001 ≤ p-value < 0.01, then the results are highly statistically significant.
If p-value < 0.001, then the results are very highly statistically significant.
If 0.05 ≤ p-value < 0.10, then the results are marginally statistically significant.
A general guideline about wording you may see in scientific journals:
“The results are statistically significant” – when the p-value < significance level
“The results are not statistically significant” – when the p-value > significance level
Which of the following statements regarding p value equalling 0.001 is/are CORRECT?
This means that if 1000 similar studies were undertaken on the same population, only 1 out of 1000 studies would result in a sample result as extreme as the one obtained in the study is due to sampling variability or by chance.
This means that if 1000 similar studies were undertaken on the same population, only 999 out of 1000 studies would result in a sample result as extreme as the one obtained in the study is due to sampling variability or by chance.
The study result is so rare that a chance factor can be ignored for the difference from the hypothesized value.
The study result is so rare that a chance factor can’t be ignored for the difference from the hypothesized value.
The study result is so rare that a chance factor can be ignored for the difference from the hypothesized value.
This means that if 1000 similar studies were undertaken on the same population, only 1 out of 1000 studies would result in a sample result as extreme as the one obtained in the study is due to sampling variability or by chance.
Statistical Hypotheses
A statistical method which aims to analyse if the results of a study are due to chance alone.
p‐values and Confidence Intervals are used to determine whether a result is statistically significant.
A hypothesis may be defined simply as a statement about one or more study populations.
Example: A physician may hypothesise that a new drug will be more effective than a standard drug for reducing pain caused from prostate cancer.
Types of Statistical Hypotheses
There are two different statistical hypotheses involved in hypothesis testing, namely, null and alternative hypothesis.
Null hypothesis: No difference or association; H0
Alternative hypothesis: There is a difference or association; Ha
The majority of hypotheses in medicine are two-sided hypothesis. i.e.
Null hypothesis: No difference
Alternative hypothesis: There is a difference - in either negative or positive direction
Occasionally a one-sided hypothesis is used. This is when the Alternative hypothesis is only one direction.
Option 1: Negative alternative hypothesis
Null hypothesis: No difference, or positive difference
Alternative hypothesis: Negative difference only
Option 2: Positive alternative hypothesis
Null hypothesis: No difference or negative difference
Alternative hypothesis: Positive difference only
Hypothesis testing process
Hypothesis testing is the process of deciding whether to accept or reject the null hypothesis based on sample data. Here’s a concise guide:
- State the Study:
- Outline study objectives, importance, and implications.
- Define null (H0) and alternative (Ha) hypotheses.
- Plan the Hypothesis:
- Clearly state hypotheses and justify the alternative choice.
- Check Assumptions:
- Ensure data follows approximate normality.
- Confirm random sample selection and independence.
- Analyze the Data:
- Calculate test statistic (e.g., t-score) using appropriate formula.
- Use tables to find p-value and assess significance.
- Discuss Results
- Interpret summary statistics and statistical significance.
- Draw conclusions regarding the study population and its implications.
evaluating equality of group variances
Evaluating equality of group variances is required when performing a hypothesis test for two groups. The degrees of freedom for T-score (T-test) and the SE depends on the equality of variances.
The three methods of evaluation
Method 1: Present the data in each group on parallel histograms or box-plots and compare the dispersion. If the dispersions are similar, assume equal variances, otherwise assume unequal variances. There is no cut off, so use your judgment.
Note: Large data should be presented on a histogram, whereas small data should be presented on a boxplot.
Method 2: Take the ratio of larger to smaller variances,
i.e., RATIO = (larger SD)2/(smaller SD)2.
If this ratio ≥ 2, assume unequal variances, otherwise assume equal variances.
Note: Standard deviation is the square root of variance.
Method 3: Use a hypothesis test procedure known as the Levene’s test for testing the null hypothesis that the groups have equal variances against the alternative hypothesis that the groups have unequal variances.
If the resulting p-value ≤ 0.05, reject the null hypothesis, i.e., consider unequal variances.
On the other hand, if the p-value > 0.05, retain the null hypothesis, i.e., consider equal variances.
Note: Method 3 is the most appropriate method for comparing equality of variances. There is no strong theoretical backup for Methods 1 and 2.
There are two types of errors that are possible with hypothesis testing: type I and type II.
If you reduce the risk of Type I errors, Type II errors increase and vice versa.
Retain the null hypothesis when in reality the null is true – CORRECT decision
Reject the null hypothesis when in reality the null is false – CORRECT decision
Reject the null hypothesis when in reality the null is true – Type I error
Retain the null hypothesis when in reality the null is false – Type II error
A common practice is to fix the type I error at some threshold value (e.g. at 0.05), called the significance level of the hypothesis test, and then minimise type II error or maximise power.
The power of a hypothesis test is not committing a Type II error, which is affected by sample size, significance level and the true value of parameter
What is the confidence interval?
A sample statistic is rarely the same as the parameter. A difference between the sample statistic and the parameter may occur purely by chance or sampling variability. So it is sensible to estimate the parameter by an interval centred on the sample statistic. This interval is called the confidence interval.
The key of obtaining the confidence interval is the sampling distribution for sample statistics. The confidence interval has an associated confidence level, for example 95%, to show how confident we are that this interval contains the parameter.