Ipres 3 Flashcards
Descriptive statistics
is a set of methods used to describe data and their characteristics. For example, if you were investigating the number of visitors to a beach
in August (nice job if you can get it!), you might draw a graph to see how the
number of visitors varied each day, work out how many people visit on an average
day and calculate the proportion of visitors who were male/female or children/
adults. These would all be descriptive data.
Inferential statistics
involves using what we know to make inferences (estimates
or predictions) about what we don’t know. For example, if we asked 200 people
who they were going to vote for on the day before a local election we could try to
predict which party would win the election. Or if we asked 50 injecting drug users
whether they share injecting equipment such as needles with other users, we could
try to estimate the proportion of all injecting drug users who share equipment.
Continuous variables
include things like ‘weights of newborn babies’, ‘distance travelled to work by those in full-time employment’, or ‘percentage of children living in lone-parent families’. Such variables are measured in numbers, and an observation may take any value on a continuous scale. For example, distance trav- elled to work could take a value of 0 miles for people working at home, 1.6 miles,
4.8 miles, or any other value up to 100 miles or more for those commuting long
distances. Similarly, any variable measured as a percentage can take a value of 0%,
100% or anything in between. For continuous variables the standard rules of arithmetic apply, so it makes sense to say that if you commute for 4 miles then that is
twice the commute of someone who commutes 2 miles.
Discrete/ categorical variables
are not measured on a continuous
numerical scale. Examples of discrete variables are:
- sex: female/male
- religion: Buddhist/Christian/Hindu/Jewish/Muslim/Sikh/other
- degree subject studied: Politics/Sociology/Social Work
and so on. Such variables have no numeric value. We may assign them a number,
for example, Politics = 1, Sociology = 2, Social Work = 3 and so on, but the actual
numbers do not mean anything. For example, Social Work is not three times greater
than Politics!
types of categorical variables
Nominal variables
are variables that have two or more categories, but which do not have an intrinsic order or inherent numerical quality in themselves. So nominal variables include things like ‘marital status’, ‘ethnicity’ or ‘location’, in addition to the variables already identified including ‘religion’ and ‘degree studied’. In some cases there may be many attributes to a nominal variable. For instance, if we were classifying where people live in the USA by state there would be 50 attributes (or states).
types of categorical variables
Dichotomous variables
are nominal variables which have only two categories or levels. For example, if we were looking at gender, we would generally categorise somebody as ‘male’ or ‘female’. This is also a nominal variable. A further example would be if we asked somebody if they had ever smoked, giving them the possible answers of ‘yes or ‘no.
types of categorical variables
Ordinal variables
have two or more categories, like nominal variables, but the categories
can also be ordered or ranked moving from greater to smaller values (or vice versa). So an opinion poll might ask how likely you were to vote for a particular party at the next election with the possible options ‘very likely’, ‘likely’, ‘not sure’, ‘unlikely’ or ‘very unlikely’. While the responses are in a particular order we cannot (or should not!) place a ‘value’ on them. This is despite our own views about which we prefer! Another example would be if you asked someone to provide an answer to the statement ‘I generally eat healthily’ and had a similar scale from ‘strongly agree’ to ‘strongly disagree’. It is difficult to say whether the distance between the two categories is equal as it will depend upon individual perception. So someone who has their ‘five-a-day’ and five chocolate bars too may consider themselves to eat healthily, whereas another person may consider that eating two chocolate bars in addition to lots of fruit and vegetables means they do not eat healthily. As a result, the distance between the points on the scale is not clear and continuous.
continupus variables
difference between interval and ratio variables
interval has 0 as a an arbitrary point (celcius scale) ,ratio does not (age, weight)
Item non-response
A specific type of missing data that occurs when a respondent does not provide a valid answer to a question, which can reduce sample size and introduce bias.
Selection bias
A type of bias that arises when missing data is not random, meaning that certain groups of people or opinions are systematically underrepresented.
Listwise deletion
A method of handling missing data by removing all cases with missing values from the analysis, which can reduce sample size and potentially introduce bias.
Imagine you’re playing a board game with your friends, but one of them doesn’t answer an important question. Instead of guessing their answer, you decide to remove them from the game completely. That’s what listwise deletion does—it removes any data that has a missing answer, even if most of it is still useful.
Imputation
A technique for handling missing data by predicting the missing responses using other available information from the questionnaire.
Let’s say you’re reading a storybook, but one page is missing. Instead of skipping that part, you try to guess what happened based on the rest of the story. That’s what imputation does—it fills in missing answers by using the information that’s already there.
Mean imputation
A basic form of imputation that replaces missing values with the midpoint or mean, which is only reliable if missing data is random.
Regression imputation
A more advanced method that uses regression techniques to estimate missing values based on other answers from the respondent, making it more statistically sound but requiring good predictors.
Aggregate data
means collecting and combining lots of individual pieces of information to look at overall patterns instead of specific details.
Imagine a jar full of different-colored candies. Instead of counting each color one by one, you just say, “There are 100 candies in total, and about 40% are red.” That’s aggregation—you’re summarizing the data instead of focusing on each tiny piece.
Likert scale (interval data)
A Likert scale is a rating scale used in surveys to measure people’s opinions, attitudes, or behaviors. It typically consists of a series of statements with response options ranging from strongly agree to strongly disagree (or similar). It helps quantify subjective feelings in a structured way.
What is skewness?
Skewness is about how the data in a graph “leans” or is “stretched” to one side. Imagine stacking blocks from smallest to biggest, and then a few really big or really small blocks change the shape.
Positively skewed (right-skewed)
- Most data is small and clumped on the left side of the graph.
- A few really big values stretch the graph to the right.
- The mean (average) is pulled up by the big numbers.
- The median (middle value) stays closer to the center of the data.
👉 Result:
Mean > Median
📌 Why?
Because the mean adds everything up and divides, those few very large values make the total bigger, so the average increases.
Example:
Imagine these numbers: 2, 3, 4, 5, 30
Mean = (2+3+4+5+30)/5 = 8.8
Median = 4
→ Mean is higher than median.
Negatively skewed (left-skewed)
- Most data is large and clumped on the right side of the graph.
- A few really small values stretch the graph to the left.
- The mean gets pulled down by the small numbers.
- The median stays closer to the middle of the data.
👉 Result:
Mean < Median
📌 Why?
Because the mean is dragged down by those very small numbers, even if most values are higher.
Example:
Numbers: 1, 20, 21, 22, 23
Mean = (1+20+21+22+23)/5 = 17.4
Median = 21
→ Mean is lower than median.
Guidelines for choosing a measure of the average
- Modes are not used very often, though they can be useful in certain circumstances. Avoid them ingeneral or use them along with other measures!
- The median is more intelligible to the general public because it is the ‘middle’ observation.
- The mean uses all the data, but the median does not. Therefore the mean is‘ influenced’ more by unusual or extreme data. If the data are particularly subject to error, use the median.
- The mean is more useful when the distribution is symmetrical or normal. The median is more useful when the distribution is positively or negatively skewed, because outliers do not affect the median.
- The mean is best for minimising sampling variability. If we take repeated samples from a population, each sample will give us a slightly different mean and median. However, the means will vary less than the medians. (For example, if we took 10 different samples of 20 students and calculated the average weight for each of the 10 groups, the mean weights would differ less between the 10 groups than the median weights.)
other distributions of data
- deciles (which divide a distribution into ten equal parts)
- quintiles (which divide a distribution into five equal parts)
- quartiles (which divide a distribution into four equal parts)
Percentiles
Percentiles tell you how a value compares to the rest of the data by showing the percentage of values that fall below it. If you’re in the 90th percentile on a test, that means 90% of the class scored lower than you — and only 10% scored higher.
Suppose we had a set of exam marks and we knew that the 30th percentile was a mark of 52. This means that 30% of people who sat the exam got a mark of 52 or lower.
Range
The difference between the highest and lowest values in a dataset, used to measure the spread of data.
Inter-Quartile Range (IQR)
The difference between the upper quartile (Q3) and lower quartile (Q1) in a dataset, representing the range of the middle 50% of the data.
To work it out we would position the variables in order from highest to lowest and concentrate on the middle 50 per cent of the distribution. So the inter-quartile range is the range of the middle half of the observations, the difference between the lower quartile and the upper quartile.
Histogram vs. Density Plot
A histogram is a type of graph used for interval or ratio data. It works by grouping data into bins (ranges) and displaying these bins as bars. The height of each bar represents how many values fall into that particular range, giving a visual overview of the frequency distribution of the dataset.
A density plot, on the other hand, is a smoothed version of a histogram that shows the proportion of values at each point of a continuous random variable. Instead of grouping data into bars, it uses a curve to estimate the probability distribution, which helps to visualize where values are most concentrated and makes it easier to compare multiple distributions.
Summary Statistics
Summary statistics are used to describe patterns in data and make comparisons across datasets. They help simplify large amounts of information by highlighting key characteristics of a distribution.
There are two main types:
a. Central Tendency – Measures that describe where the center of a distribution lies, or what a “typical” value might be. This includes:
- Mode: The most frequent value
- Mean: The arithmetic average
- Median: The middle value when data is ordered
b. Dispersion – Measures that show how spread out or clustered the data values are, such as range, interquartile range (IQR), variance, and standard deviation.
Properties of mean vs median (non-) resistance to outliers
- Median is resistant to outliers, mean is not.
- Perfectly symmetric: mean = median
- Skewed to the left / negatively skewed: mean < median
- Skewed to the right / positively skewed: mean > median
Why Use n - 1 in Standard Deviation
When calculating standard deviation for a sample, we divide by n - 1 instead of just n to correct for bias. This is called Bessel’s correction.
It helps us get a more accurate estimate of the population standard deviation because using just n tends to underestimate the variability in the population. Subtracting 1 accounts for the fact that we’re using the sample mean (an estimate) rather than the true population mean.
Z score
- A Z-score measures the number of standard deviations an observation is away from the mean.
- A positive Z-score shows that the observation isgreater than the mean (above average).
- A negative Z-score shows that the observation islower than the mean (below average).
- The Z-score will be zero ifthe observation equals the mean.
- Most Z-scores will lie in the range from Z = —2 to Z = 2. Values more than two standard deviations from the mean tend to be extreme values (outliers).
Difference Between Standard Deviation and Z-Score
A standard deviation (SD) measures how spread out the values in a dataset are—it tells us how much the data tends to deviate from the mean. In contrast, a z-score measures how far a single data point is from the mean, expressed in units of standard deviation. For example, if someone has a z-score of 1.5, it means their value is 1.5 standard deviations above the average. While SD describes the variability of the entire dataset, the z-score focuses on the relative position of one specific value within that distribution.
What does a Z-score of +1 mean. How much of the data lies between the mean and +1 SD on a normal distribution?
A Z-score of +1 means the value is one standard deviation above the mean. In a normal distribution (bell curve), approximately 34.1% of the data lies between the mean and one standard deviation above it. For example, if the mean is 70 and the SD is 10, then about 34.1% of values fall between 70 and 80.
normal distribution characteristics
- The mean lies in the middle and the curve is symmetrical about the mean. The mean is equal to the median.
- Most ofthe observations are close tothe mean, so the frequency ishigh around the mean. There are fewer observations that are much greater or much smaller than the mean.
- The curve never actually touches the X axis on either side; itjust gets closer and closer.
Types of errors in sampling
Sampling Variability:
Refers to the natural differences in results (like the mean or standard deviation) that occur when multiple random samples are taken from the same population. These differences are due to chance and will vary depending on which individuals happen to be selected.
Sampling Error:
This is the difference between the sample estimate (like the sample mean) and the actual population value. It happens because a sample is only a part of the population, and it may not perfectly represent it. Sampling error can be minimized by using better sampling methods, but it is not caused by mistakes.
Non-Sampling Error:
These are errors unrelated to how the sample is selected, such as data entry mistakes, misunderstood questions, or biased data collection procedures.
probability sampling
Simple Random Sampling
Definition: Every individual in the population has an equal chance of being selected.
Example: Assign numbers to all 1,000 employees in a company and use a random number generator to select 100 employees for a survey.
probability sampling
Systematic Sampling
Definition: Select every nth individual from a list, starting from a randomly chosen point.
Example: In a list of 300 students, randomly choose a starting point between 1 and 15, then select every 15th student to form a sample of 20 students.
probability sampling
Stratified Sampling
Definition: Divide the population into subgroups (strata) based on a specific characteristic, then randomly sample from each subgroup proportionally.
Example: In a company with 800 female and 200 male employees, to create a representative sample of 100 employees, randomly select 80 women and 20 men.
probability sampling
Cluster Sampling
Definition: Divide the population into clusters, randomly select some clusters, and then survey all individuals within those clusters.
Example: To study university students across the UK, randomly select 15 universities (clusters) and survey all students within each selected university.
non-probability sampling
convenience sampling
Definition:
Sampling individuals who are easiest to access. No attempt is made to ensure representativeness.
Example:
A teacher surveys their own students to study young adult habits.
🧠 Pros: Quick and easy. Legitimate in pilot studies and experiments
⚠️ Cons: High risk of bias, not generalizable.
non-probability sampling
snowball sample
Definition:
Start with a few participants who meet your criteria, then ask them to refer others.
Example:
Interviewing marijuana users by first contacting known users, who then refer friends.
🧠 Pros: Useful for hidden/hard-to-reach groups (e.g., drug users, undocumented migrants).
⚠️ Cons: Not representative of the larger population.
non-probability sampling
quota sample
Definition:
Select participants non-randomly to fill predefined quotas (e.g., gender, age).
Example:
You want 60 men and 40 women in your sample of 100. You stop collecting data once the quotas are filled.
🧠 Pros: Ensures inclusion of specific groups.
⚠️ Cons: Still non-random, subject to interviewer bias.
non-probability sampling
purposive sample
Definition:
Select participants intentionally because they fit a specific profile relevant to the research.
Example:
Interviewing women aged 30–40 on the street for a study about skincare habits.
🧠 Pros: Good for targeted research.
⚠️ Cons: Subjective judgment, potential for selection error.
central limit theory
The Central Limit Theorem (CLT) is a fundamental principle in statistics stating that, regardless of the population’s distribution, the distribution of sample means will approximate a normal distribution as the sample size becomes large. This holds true even if the original data are not normally distributed. Typically, a sample size of 30 or more is considered sufficient for the CLT to apply. This theorem is crucial because it allows statisticians to make inferences about population parameters using the normal distribution, facilitating hypothesis testing and confidence interval estimation.
confidence intervals
A confidence interval (CI) is a range of values, derived from sample data, that is likely to contain the true value of an unknown population parameter (such as a mean or proportion). It provides an estimated range believed to encompass the parameter with a certain level of confidence, commonly 95% or 99%.
Wikipedia
For example, a 95% confidence interval for a population mean might be 70.3 kg to 72.2 kg. This means that if we were to take many random samples and compute a confidence interval from each, approximately 95% of those intervals would contain the true population mean.
Wikipedia
It’s important to note that the confidence level (e.g., 95%) refers to the long-run success rate of the method used to construct the interval, not the probability that the true parameter lies within a specific interval calculated from a single sample.
Confidence intervals are widely used in statistics to express the uncertainty associated with sample estimates and to infer population parameters.
when to use SD,SE,CI
- Use SD when you want to describe the spread of data within your sample.
- Use SE when you’re interested in the precision of your sample mean as an estimate of the population mean.
- Use CI when you want to express the uncertainty around your sample estimate and make inferences about the population parameter.
one tailed test
A one-tailed test is used when the research hypothesis specifies a direction of the effect—either an increase or a decrease.
Purpose: Tests for the possibility of an effect in one direction only.
Null hypothesis (H₀): The parameter is equal to a specific value.
Alternative hypothesis (H₁): The parameter is either greater than or less than that value, depending on the research question.
Example: Testing if a new drug increases recovery rate compared to the standard treatment.
Critical Region: Allotted entirely to one tail of the distribution.
Consideration: One-tailed tests have more statistical power to detect an effect in one direction but cannot detect an effect in the opposite direction.
Use a one-tailed test when the research hypothesis predicts a specific direction of the effect.
two tailed test
A two-tailed test is appropriate when the research hypothesis does not specify a direction, meaning the effect could be in either direction. In a two-sided test, the aim is to discover whether the sample mean or proportion is different from the population mean or proportion, not whether itiseither larger or smaller.
Purpose: Tests for the possibility of an effect in both directions.
Hypotheses:
Null hypothesis (H₀): The parameter is equal to a specific value.
Alternative hypothesis (H₁): The parameter is not equal to that value.
Example: Testing if a new teaching method leads to a different test score average, without specifying if it’s higher or lower.
Critical Region: Split between both tails of the distribution.
Consideration: Two-tailed tests are more conservative and are appropriate when deviations in both directions are of interest.
Use a two-tailed test when the research hypothesis does not predict a direction, or when deviations in both directions are important.