B.2 Data and information analysis Flashcards
Employ the concepts of descriptive and inferential statistics and their utility to specific contexts.
What do health informaticians need to employ according to Australian health informatics competency B.2?
Descriptive and inferential statistics and their utility to specific contexts
What are the two main domains of statistics?
- Descriptive statistics
- Inferential statistics
What do descriptive statistics do?
Describe data
What do inferential statistics do?
Derive inferences from data samples about larger populations using probability theory
What are the two key kinds of descriptive statistics?
- Measures of central tendency
- Measures of dispersion
What is the mean?
Calculated by adding all data values and dividing by the number of data values
What is the median?
The middle value when the values are ordered by magnitude
What is the mode?
The most commonly occurring number in a dataset
What is the range in statistics?
The difference between the smallest and largest values in a dataset
What are percentiles?
Values that divide a dataset into equal parts
What do quartiles do?
Divide the dataset into quarters
What do quintiles do?
Divide the dataset into fifths
What do deciles do?
Divide the dataset into tenths
When is mean typically used as the measure of central tendency?
When a dataset is normally distributed
How is standard deviation calculated?
- Calculate variance by finding the deviation of each value from the mean, squaring it, adding these squared deviations, and dividing by the number of observations.
- Calculate the positive square root of this variance.
What does a smaller standard deviation indicate?
Values are very close to their mean
What does a larger standard deviation indicate?
Values are far from the mean
What is the probability of a specific event (Px)?
The number of possible instances of that event divided by the total number of possible outcomes
What does the addition rule of probability state?
If two events are independent, the probability of one or another event occurring is equal to the sum of the probabilities of each event
What does the multiplication rule of probability state?
If two events are independent, the probability of their joint occurrence is equal to the product of their individual probabilities
What is the complement of an event?
All outcomes where that event is negated/does not occur
What is probabilistic reasoning?
Logically deductive reasoning that argues from general assumptions to specific outcomes
What is statistical reasoning?
Inductive reasoning that argues from observed specifics to broader generalisations
What is a random sample?
A sample where every member of the target population has an equal probability of being selected
What is a confidence interval?
A range of values about which we can be reasonably confident that a sample statistic approximates the population statistic
What characterizes a normal distribution?
- Symmetrical around the mean
- Mean = median = mode
- Approximately 68.26% of values lie within one standard deviation of the mean
- Approximately 95.44% lie within two standard deviations of the mean
- Approximately 99.72% lie within three standard deviations of the mean
What is the significance of a 95% confidence level?
The probability of observing a value greater than two standard deviations from the mean is less than 5%
What is the significance of a 99% confidence level?
The probability of observing a value greater than three standard deviations from the mean is less than 1%
Fill in the blank: The simplest measure of dispersion is _______.
range
Fill in the blank: A good sample will ensure that the numbers of respondents in all of the cells of interest are _______.
sufficient
What does a 95% confidence interval indicate?
The probability of observing a value greater than 1.96 standard deviations from the mean is less than 5%.
Confidence intervals provide a range of values about which we can be reasonably confident rather than a single estimate.
What is Six Sigma?
The basis of Six Sigma is ‘six standard deviations’. It means attaining a sustained standard of fewer than 3.4 mistakes per million opportunities.
99.99966% of values lie within 6 standard deviations of the mean.
What is a sampling distribution?
The distribution of a statistic that results when all possible random samples of a given size are drawn from the population.
It depends on the underlying distribution, the statistic considered, the sampling procedure, and the sample size.
What does the Central Limit Theorem state?
If samples are large (over 30), independent, and randomly generated, the values will be normally distributed, and the mean of the sampling distribution will equal the population mean.
The standard deviation of a sampling distribution is called the standard error.
What are inferential statistics used for?
To make inferences about the characteristics of a population based on sample data.
Inference involves drawing samples from the population and interpreting sample data against a sampling distribution.
What is estimation in statistics?
Inferring the characteristics of a population from sampled data, using sample mean to estimate population mean and sample standard deviation to estimate population standard deviation.
Estimation is easier when population data is known, but often the Student’s t distribution is used.
What is the Student’s t distribution?
A family of probability models that are bell-shaped and centered on 0, identified by degrees of freedom.
A t distribution becomes increasingly normal as degrees of freedom rise.
What is the formula for calculating the test statistic in hypothesis testing?
t = ( – μ) / (s /√n)
This is used to test the null hypothesis against the alternative hypothesis.
What are the two types of hypotheses in hypothesis testing?
The null hypothesis and the alternative hypothesis.
The null hypothesis states there is no relationship, while the alternative hypothesis states there is a relationship.
What is a p-value?
The probability of a particular result when it is assumed that there is no relationship in the population.
It is compared against a significance level to determine if the null hypothesis should be rejected.
What are the possible outcomes from hypothesis testing?
- Correct decision when null is true and not rejected
- Type 1 error when null is false and rejected
- Correct decision when null is false and rejected
- Type 2 error when null is true and not rejected
These outcomes illustrate the potential errors in hypothesis testing.
What is a Z-Test?
A statistical test for which a normal distribution can approximate the distribution of the test statistic under the null hypothesis.
It is more convenient than the Student’s t-test for large samples.
How is the degree of freedom calculated?
One less than the sample size (n-1).
For example, for a sample size of 101, the degrees of freedom would be 100.
What is the significance level in hypothesis testing?
The decision rule chosen in advance to conclude whether to reject the null hypothesis or not (e.g., 5%, 1%).
Common significance levels in epidemiology are 0.05 and 0.01.
What does a t-table provide?
Selected values for t-distributions with various degrees of freedom and confidence levels.
It helps in determining the t value needed for hypothesis testing.
What happens if the probability value is less than or equal to the significance level?
The null hypothesis should be rejected, and the alternative hypothesis accepted.
This indicates statistical significance in the findings.
What is the significance level critical value for a Z-test at 5% two-tailed?
1.96
When is the Student’s t-test more appropriate than the Z-test?
When the population variance is unknown and the sample size is not large (n < 30)
What does statistical power refer to?
The likelihood that the null hypothesis would be rejected if a specified difference exists
What factors affect statistical power?
Sample size and variance of individual observations
True or False: Reducing the significance level from 0.05 to 0.01 increases statistical power.
False
What does correlation quantify?
The degree to which two variables vary together
What happens if two variables are independent?
The value of one has no relationship to the value of the other
What is regression used for?
To analyze how variables influence each other
In regression analysis, what is the dependent variable?
The value of interest that is influenced by independent variables
What type of regression model is used when the dependent variable is continuous and normally distributed?
Linear regression
What type of regression model is used when the dependent variable indicates the presence or absence of a characteristic?
Logistic regression
What are Cox proportional hazards models used for?
When the dependent variable represents the time until the occurrence of an event