L1 Flashcards
What’s a population?
The entire group of individuals or items, with common characteristics, that we want to study.
What’s a sample?
A part of the population from which we actually draw a conclusion about the whole population
What’s a parameter?
An unknown population characteristic we want to estimate.
What is random sampling?
Every group of subjects from the population has the same chance of being selected (no one can be left behind).
What is non-probability sampling?
Probability for each subject being selected is unknown and may reflect a selection bias.
Example: observational studies.
What is the hierarchy of study designs?
- RCT
- Quasi experimental
- Observational (cohort/case-control)
- Cross sectional
What are the sources of variation in data?
- Natural variation - Differences between people (experimental units) in the ‘true’ values of the variable of interest
- Measurement variation (error) - Variation due to measuring equipment or measuring technique
What are the sources of bias in data?
Difference between the true value and and average value
of actual data.
Example: Under-reporting of calories eaten in food diary
What is a variable?
a characteristic of interest
in a study that could be expressed with different values for
different subjects or objects (ex. days)
What is a categorical variable?
Variables with categorical scale cannot be “measured” but are rather observed to what category it belongs. The final measurement is then a particular category (label).
Examples: race, sex, physical activity.
Nominal - cannot be ordered
Ordinal - can be ordered
What is a numerical variable?
Variables with numerical scale are measured as numerical values. They are measurements or counts taken on each subject or object.
Discrete variable - consists of only fixed values and no fractions in between them (e.g., number of siblings)
Continuous variable - consists of any fraction value in a range of numbers. (e.g., grams of alcohol consumed)
What is Mode
Mode is the most common value in a data set. It is the value that occurs the most frequently.
What is mean?
Mean is the sum of all the values divided by the number of values in the list. (always somewhere in the middle of the data range). Denoted as X bar. The mean is affected by extreme observations.
Population Mean: μ = (1/N) Σ xi
Sample Mean: x̄ = (1/n) Σ xi
Weighted mean example: X̅ = .5(10) + .3(20) + .2(15) = 5 + 6 + 3 = 14
What is median?
Median is the middle number when the values are put in order. (see-saw point). The median is better when there are extreme observations.
How it is computed:
* Arrange data from smallest to largest
* Find the middle value if there is odd number of
data
* Find the mean of the two middle values if there is
even number of data
What is the range?
Largest to smallest data
What is the variance?
Variance measures the spread of data around the mean.
Variance measures the spread in squared units.
Population Variance: σ² = (1/N) Σ (xi - μ)²
Sample Variance: s² = (1/(n-1)) Σ (xi - x̄)²
Population Variance:
First, find the mean (average) of the entire population.
Subtract the mean from each value to get the deviation of each value from the mean.
Square each of these deviations to remove any negative signs and to give more weight to larger deviations.
Sum all the squared deviations.
Divide this sum by the total number of values in the population. This gives you the average squared deviation from the mean, which is the population variance.
Sample Variance:
First, find the mean (average) of the sample.
Subtract the mean from each value in the sample to get the deviation of each value from the mean.
Square each of these deviations.
Sum all the squared deviations.
Divide this sum by one less than the number of values in the sample. This correction (dividing by n-1 instead of n) is used to account for the fact that we are working with a sample rather than the entire population. This gives you the average squared deviation from the mean, which is the sample variance.
What is the standard deviation?
SD measures the spread of data
around the mean.
SD measures in units. (not squared)
Population Standard Deviation: σ = sqrt((1/N) Σ (xi - μ)²)
First, subtract the population mean (μ) from each value (xi) to find the deviation of each value from the mean.
Square each of these deviations.
Sum all the squared deviations.
Divide this sum by the total number of values in the population (N).
Take the square root of the result to get the population standard deviation.
Sample Standard Deviation: s = sqrt((1/(n-1)) Σ (xi - x̄)²)
What are the properties of the SD?
- Large spread of data = large s (or σ)
- Small spread of data = small s (or σ)
- s (or σ) = 0 … no spread (All the data are the same)
- s (or σ) is never negative (always positive or zero)
- Units for s are the same as units for the data
- Measures how far the data tend to vary from the mean
- Provides, for a typical value, a likely “give or take” from the mean
What happens if you add a number (c) to each data point in linear operations?
New mean = old mean + c
New median = old median + c
New SD = old SD
What happens if you multiply each data by c?
New mean = old mean x c
New median = old median x c
New SD = old SD x c
What are the properties of a histogram?
Gives us a very good idea about the shape of the distribution
* Presents the measure of interest along the X-axis and the relative frequency (or frequency) on Y-axis
* Relative frequency should be used when two groups of subjects are being compared
* The area covered within a histogram represents 100% of the data
What are the properties of a box plot?
Here are the key components and what you can see on a box plot:
-
Median (Q2):
- The line inside the box represents the median, which is the middle value of the dataset when the values are arranged in ascending order. It divides the dataset into two equal halves.
-
Interquartile Range (IQR):
- The box itself represents the interquartile range, which contains the middle 50% of the data. It extends from the first quartile (Q1) to the third quartile (Q3).
- The first quartile (Q1) is the 25th percentile, meaning 25% of the data points are below this value.
- The third quartile (Q3) is the 75th percentile, meaning 75% of the data points are below this value.
-
Whiskers:
- The “whiskers” extend from the edges of the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3, respectively. These whiskers represent the range of the bulk of the data.
- The end of the lower whisker represents the smallest value within 1.5 times the IQR below Q1.
- The end of the upper whisker represents the largest value within 1.5 times the IQR above Q3.
-
Outliers:
- Data points that fall outside the range of the whiskers are considered outliers and are typically represented as individual dots or small circles. These are values significantly higher or lower than the rest of the data.
-
Minimum and Maximum:
- The minimum value (excluding outliers) is at the end of the lower whisker.
- The maximum value (excluding outliers) is at the end of the upper whisker.
What You Can See on a Box Plot:
-
Central Tendency:
- The median line inside the box shows where the center of the data lies.
-
Spread and Variability:
- The length of the box (IQR) indicates the spread of the middle 50% of the data.
- The whiskers show the spread of the bulk of the data (excluding outliers).
-
Skewness:
- If the median is closer to the bottom or top of the box, it indicates skewness in the data.
- If the whiskers are uneven in length, it suggests that the data is skewed to the left or right.
-
Outliers:
- Outliers are easily identifiable as individual points beyond the whiskers.
-
Comparison Between Groups:
- When comparing multiple box plots, you can quickly compare the central tendencies, variability, and spread of different datasets.
In summary, a box plot provides a visual summary of the distribution of a dataset, highlighting its central value, spread, skewness, and the presence of outliers.
What are the properties of bar charts?
1) Used for summarizing qualitative data
2) Bars sometimes do not touch each other
3) For each category we draw one bar
4) Height of the bar indicates percentage of data within given category or the number of data within given category
What is the definition of probability?
In probability, we are expressing a chance of a certain event occurring within a certain environment!
Assume that an experiment (a toss of a coin) can be repeated many times. The probability of a certain outcome (toss of a tail on a coin) is the number of times that outcome occurs divided by the total number of trials.
How do you calculate probability?
Probability of event A = P(A)
P(event) = # of times event occurs / # of all outcomes
… frequentist (experimental, observed) probability
P(event) = size of the event subset / size of sample space
… theoretical probability
What is conditional probability?
Conditional probability measures the likelihood of one event occurring given that another event has already occurred.
For example, the probability of event A occurring given that event B has occurred is denoted as P(A|B). It’s calculated by dividing the probability of both events A and B occurring together (P(A and B)) by the probability of event B occurring (P(B)).
Example:
Imagine you have a deck of 52 cards. Suppose you want to know the probability of drawing a King given that the card drawn is a face card (Jack, Queen, or King).
-
Define Events:
- Event A: Drawing a King.
- Event B: Drawing a face card.
-
Calculate Probabilities:
- The probability of drawing a face card (B) is 12 face cards out of 52, which simplifies to 12/52.
- The probability of drawing a King and it being a face card (A and B) is 4 Kings out of 52, which simplifies to 4/52.
-
Apply the Formula:
- Divide the probability of drawing a King and a face card (4/52) by the probability of drawing a face card (12/52), which gives you 4/12 or 1/3.
So, the probability of drawing a King given that you have drawn a face card is 1/3 or about 33.33%.
What is sensitivity and how do you calculate it?
Definition: Sensitivity is the ratio of correctly identified positive cases to the total number of actual positive cases.
Calculation: You calculate sensitivity by dividing the number of true positives (correctly identified positives) by the sum of true positives and false negatives (missed positives).
Importance: Sensitivity is crucial when it’s important to catch as many positive cases as possible. For example, in medical testing, high sensitivity means fewer cases of a disease are missed.
What is specificity and how do you calculate it?
Definition: Specificity is the ratio of correctly identified negative cases to the total number of actual negative cases.
Calculation: You calculate specificity by dividing the number of true negatives (correctly identified negatives) by the sum of true negatives and false positives (incorrectly identified positives).
Importance: Specificity is important in situations where it’s crucial to accurately identify negative cases. For example, in medical testing, high specificity means that the test correctly identifies individuals who do not have a condition, reducing false positives.
What is PPV?