Module 2 Flashcards
What is a case-control study?
A study comparing cases and controls
What is a retrospective case control study?
Researchers looking back on how subjects behaved over time, looking at case group and control gorup
Draw the casual model(i.e. the “directed acyclic graph). Include the mediator, and have arrows which show “what we already know”, “what we can prove”, and “what we want to know”
Refer to PCV 2.1
What is sampling variability/error?
When you draw a random and representative sample from a population, it is not always going to be the exact same sample every time you draw it.
3 students draw a sample of 5 observations from the population. How many samples did each student draw from the population? What is the sample size used in the experiment?
1 sample with a sample size of 5
What is a sampling distribution?
First you take several samples and take the mean of each sample. Then, you treat the sample means as the new data set and plot a histogram.
What is a sample distribution?
A sample distribution would be a histogram of the values within one sample
As you increase the sample size, what happens to the standard deviation, graph shape, and mean of the sampling distribution graph for a NORMAL distribution?
The graph tightens, variability decreases, standard deviation also decreases. The shape of the graph does not change.
The mean of the means does not change with an increased sample size.
What is n?
The sample size
As you increase the sample size, what happens to the standard deviation, graph shape, and mean of the sampling distribution graph for a UNIFORM distribution?
Mean of the means is unchanged
The standard deviation gets smaller
The shape of the sampling distribution becomes more and more normal
What are some synonyms for a normal graph
symmetric
gaussian
bell shaped
What is the central limit theorem?
When n is “sufficiently” large, the sampling distribution for a particular statistic(e.g. sample mean) will tend towards a normal distribution even if the underlying population distribution is not Gaussian.
What is the normal distribution? What are the two parameters for a normal distribution? If a random variable X is distributed normally, then we denote it as ….
The normal distribution is a continuous probability distribution for real-valued random variable. The two parameters for a normal distribution are the mean (mu) and variance(sigma squared).
(Notation in notes)
How does the normal distribution differ from the binomial distribution?
The binomial distribution is a discrete probability distribution. This means that the values that our random variable X could take on were clearly delineated integers, like the number of heads in five coin flips. X could not be a fraction like three and a half heads.
In comparison, the values that the random variable X could take on in a normal distribution could include fractions of a whole unit
In a normal distribution, what do the mean and variance tell us about the graph?
- The mean tells you about the location of the distribution. The shape would remain the same if only the mean of a normal distribution is changed
- The variance tells us about how widely distributed the values are. The centre of the distributions would be the same but the peak of the graph and the width of the graph would change.
The probability of observing random, normally-distributed values within a given range is equal to
the associated area under the curve (AUC)
this works by considering intervals of potential values for X as defined on the x-axis of the p-lot, then calculating the proportion of th total area under the curve that falls within that interval. This would give us the probability of observing a random normal variable from this population with a value in that range.
Can you calculate the probability for a single value of X in a normal distribution?
No, when working with continuous distributions like the normal distribution, our probability will always be anchored to an interval.
If a random variable X is distributed normally, why do we know about the standard deviations?
- There is an approximately 68% chance X falls within one standard deviation of the mean
- there is an approximately 95% chance falls within two standard deviations of the mean
- There is an approximately a 99.7% chance X falls within three standard deviations of the mean
The value you get for the probability density function is not a probability in and of itself. What do you have to do to calculate the actual probability?
Use calculus to integrate the PDF across the desired rand and calculate the AUC which tells you the probability X falls in the range.
What is a transformation?
transformations are just functions that map a value in one space to a value in a second space. Usually we can identify functions to “back-transform” the new data to the original space(useful transformations will allow us to do this).
What is the log transformation look like(i.e. the calculations)? What are log transformations useful for?
To go from original data to log transformed data: take the log of the x value you are trying to transform, keep the y value the same. TO back transform take the (new) x value to the power of 10.
Log transformation are useful fro mapping right-skewed distributions into more normal distributions in the transformed space.
What is the formula for calculating the 95% confidence interval for a population mean? Define each variable.
Refer to notes page
Is the confidence interval random?
Yes. Our X bar could be different depending on the specific random sample we take. Moreover, the confidence interval will either cover or not cover the true mean
What is the coverage probability?
The fraction of samples that are taken from the dataset that cover the true population mean
What is the difference between mu and X bar?
mu is the population mean
X bar is the sample mean
What is one reason why a CI might have a coverage possible lower than 95%? What are two ways this can be fixed?
If the population data is non gaussian(i.e. very skewed), the our confidence interval is not going to have 95% coverage probability. We can fix this by increasing the sample size. You can also use a log transformation to create a Gaussian distribution, even with a small sample size
How well the data conforms to the stated coverage probability depends on the ________(1)
(1) shape of the population distribution and the sample size
The standard deviation of the sampling distribution of X bar is known as ______(1). What is the formula for this?
the standard error of X bar
formula in notes
What are the three differences between a distribution of gaussian observations vs a distribution of sample means?
Distribution of Gaussian observations:
- made up of individual observations from the populaiton
- Centered at population mean mu
- Variability quantified by standard deviation
Distribution of sample means
- Made up of sample means calculated from infinite samples of size n from the popualtion
- Centered at population mean mu
- Variability quantified by standard error
When calculating confidence intervals, we assume that the true value of mu is equal to
X bar
How do you calculate the margin of error? How do you calculate the width of the 95% confidence interval?
Refer to notes
As n increases, the width of the 95% confidence interval goes _____(1).
(1) down
What effect would lowering the standard deviation have on the standard error, the margin of error, and the width of the confidence interval?
It would lower all of them
What is the formula for calculating sample variance?
Refer to notes sheet
What are the three rules that are relevant to log transformed data?
- The mean of the logged data is almost equal to the median of the logged data
- The log of the median(of the regular data) is equal to the median of the logged data. This is because the median is an observable value in the dataset
- The log of the mean(of the regular data) is NOT equal to the mean of the logged data.
When you do 10 raised to the power of the mean of the log data. what do you get?
the median of the raw data
You just calculated the confidence interval for the mean of the data in log dollars. If you exponentiate these values, what would the new interval tell you?
The confidence interval for the median of the raw data
What does stratified mean?
Stratification is the process of dividing members of the population into homogeneous subgroups before sampling.
What is a histogram?
Graphs with bars that count values in dataset and give us an overview of the data distribution. tell us the skewness of our data
What are the drawbacks of a histogram?
- Hard to see the centre of distribution to compare “typical” values
- Hard to plot the two distributions together on the same graph
What is a density plot? What is the drawback of a density plot?
Similair to a histogram but you have lines instead of bars. The height of the line represents the amount of points at that particular x value in our dataset. Tell us the skewness of our data
It is hard to see the centre of distribution to compare typical values
The standard deviation of the sampling distribution is called the standard error. How do you calculate the standard error?
sigma/sqrrt(n)
Write the formula to calculate variance.
On notes
As n increases, the width of the CI goes
down
What is a Q-Q plot?
A method for comparing the location(center), spread and shape of one distribution against another.
- Sample vs. theoretical distribution: Does the shape of my data look gaussian
- Sample vs. Sample: Do these two data sets have the same distribution?
What are the benefits of boxplots?
Can be used to compare 2 different sets of data. Can see the center and spread, but somewhat harder to see the complete shape of the distrivtuon.
How do you calculate the width of the 95% confidence interval? How do you calculate the margin of error?
Refer to notes
What is the general form for calculating the 95% confidence interval for one population mean?
Estimate +/- 2(SE)
How do you calculate the 95% confidence interval for a difference in population means? Include the two ways of calculating the standard error of the difference ein means
Refer to notes
What do you get if take 10 raised to the power of the difference in means ON THE LOG SCALE?
The ratio of the medians in real dollars!
i.e. how many times the median for one is greater than the median for the other
What is paired data?
Data where every observation in one group has a corresponding observation in the other. The pairs are based on similar characteristics between individuals in the group, or you can also pari a person with themselves
When given paired data, how do you calculate?
You calculate the mean of the paired differences
What is independent data(unpaired)?
Different individuals independent of one another. There is a group with and there is a group without
When given unpaired data, how do you calculate?
Use the difference in population means(i.e. the longer version of the confidenc interval)
what would the null value be for mean paired difference in paired t-test?
0
what would the null value be for difference in means for unpaired t-test?
0
What are the 6 steps involved in hypothesis testing?
- Specify a precise null hypothesis about the population
- Specify the outcome variable
- Specify the Type I error rate/a/significance level. Is usually 0.05 or 5%
- Choose an estimator from your data that is relevant to the hypothesis - Ould be the mean for a continuous outcome
- Calculate the confidence interval
(1-a) x 100% - Reject(if null is not in interval) or fail to reject the null hypothesis
How do you know a fake coin sequence?
It will have long streaks of H’s or T’s
What is the p-value? How do we know when or when not to reject the p-value?
The p-value is the probability, when the null hypothesis is true, of observing test static as or more extreme than what occurred in the sample.
If p vale if less than alpha value then reject the null.
What is a type I error?
Reject the null hypothesis when its true
What is a type II error?
Fail to reject the null hypothesis when its false
Are hypothesis tests constructed to control the rate of type I errors or type II errors?
Type I Errors
How can you minimize type II error?
Increase the size of the sample