Ch.6 Field Flashcards
Population and Samples
Population and Samples
What are some points on samples you have to remember?
- The form of our model is similar for all samples and for the population (The model we get for one sample is similar to all the other models for all the other samples, and similar to the model for the population)
- Parameter estimates vary across samples -> Samples don’t have parameter values that match the true values exactly
- Spread of scores around pouplation model is consistent -> lines representing limits are parallel to the model itself. In other words, at all values of the IV the spread of scores around the model is assumed to be similar (See image 1)
Population and Samples
What is error?
The difference between the predicted value of the DV at a certain level of the IV vs the observed value of the DV at the same value of the IV
error = observed - model
- When referring to error in sample models, we use e
- When referring to error in population models, we use ε
!!! e and ε are the same concept (error = observed - model), they’re just used in different circumstances !!!
(See image 2, notive the difference in hats as well, because with a sample model we want to estimate the population parameters (we can’t get the population values from our data) we add hats. In the population model because our data give us the actual numbers in the population, we don’t estimate parameters, so there are no hats. Also note the difference in e & ε as mentioned above)
Population and Samples
What are some general notes on error?
- most errors in prediction will be close to 0 (this is what the book said, seems like bullshit to me, nonetheless I think just remember it as a note if it’s asked in a multiple choice question)
- As magnitude of errors increase, frequency decreases.
~ !!! The opposite isn’t necessarily the case. Remember it just in this direction only !!! - The distribution of the errors is also normally distributed with a mean of 0 and a variance of s^2
Errors vs Residuals
Errors vs Residuals
What is a Residual
Residual = observed value - value predicted by the model = error, BUT SPECIFICALLY FOR A SAMPLE MODEL (simply, residual is error for a sample model)
- Since we use a sample model to estimate the population model, it’s likely that the residuals from the sample model are a good approximation of the population errors
- If we make a plot distribution of all the residuals, it’s normal with a mean of 0
Errors vs Residuals
What do we use Residuals for?
We use Residuals to inder different stuff about the errors in the population model
Errors vs Residuals
What is the equation for Total error?
See image 3
Errors vs Residuals
What is the ordinary least squares (OLS) regression?
It’s a method that uses the method of least squares to estimate the parameters (b-values) for which the total error is at it’s minimum (method to minimize total error)
Errors vs Residuals
How do you estimate the variance of the model errors?
See image 4
Confidence Intervals and Significance Testing
CI & Significance Testing
General notes on Sampling Distribution
- It’s the distribution of parameter estimates across samples
- The width reflects variability in sampling error
~ Also called th standard deviation -> in a sampling distribution the sd is called standard error
CI & Significance Testing
What is the general equation for any test-statistic?
effect/error
This also equals to = size of parameter/sampling variation in the parameter
- Sampling variation in the parameter is equal to the difference between the means for each sample
CI & Significance Testing
What is the Central Limit Theorem?
!!! If the model errors are normally distributed, the sampling distribution of b^ is also normal.
Therefore, we can estimate the se of b^ and construct the CI and the test-statistic
CI & Significance Testing
What is true about the relationship between sample size and sampling distribution?
As sample size increases, the sampling distribution approximates a normal distribution with a mean equal to the population mean and a variance equal to σ^2/n
(specifically, when model errors are normally distributed, the sampling distribution for b^ is normal)
CI & Significance Testing
Based on the above flashcard, what are the steps for conducting NHST and constructing a CI?
- When the sampling distribution for b^ is normal we can use s^2 to estimate the SE of b
- The sampling distribution of SE is called a X^2 distribution with n-p degrees of freedom
- Knowing the estimate of SE(b) allows us to construct a CI and a hypothesis
CI & Significance Testing
What is the Gauss-Markov Theorem?
When certain conditions are met the OLS is the best way to estimate parameters. The condition that need to bet met are:
- Model errors are on average 0
- Homoscedasticity
- Independence of osbervations
Last two conditions are called spherical errors. See image 5
Bias
Bias
What us an unbiased estimator?
An estimator that yields an expected value that is the same as the one it is trying to estimate (in other words: on average the estimate in the sample will match the estimate in the population)
Bias
What is a consistent estimator?
An estimator that produces estimates which tend to the population value as the sample size increases
Bias
What is an efficient estimator?
An estimator that produces estimates that are in a way “the best” of the available methods of estimation
(best = lowest variance, and the estimates are distributed ,pre tightly around the population)
Bias
What is the optimal estimate for any data set?
The mean
(If dataset has a mean, mean is pushed up and to the right)
Bias
What are outliers and why are they problematic?
Data points that differ significantly from the rest of the data
- They bias parameter estimates
- They increase SSR a lot
If SSR is affected by outliers, the following happens:
1. SSR is biased
2. SD is biased
3. SE is biased
4. CI and test-statistic are biased
Bias
What should we do with outliers?
Keep them, unless you know they’re not representative of the population
Assumptions
Assumptions
What are Assumptions?
A condition that ensures that what we’re attempting to do works as it should
Assumptions
What is the most important assumption?
Linearity and additivity: The process we’re trying to describe can be described by a linear model
Even if all other assumptions are met (next flashcard) the model is invalid because our description of the process of the model is wrong
Assumptions
What are the other general assumptions?
- Expected value of errors is 0
- Spherical errors
~ Homoscedasticity
~ Independence of errors - Assumption of normality
- (No outliers) (not really considered by many an assumption, bu still could be thought of as one)
(See image 6 summarizes some of the things we might want from a model, and the assumptions required for them).
Assumptions
Notes on Homoscedasticity
- Homoscedasticity applies to population errors, not your sample data. BUT, if sample residuals exhibit the characteristics of homogeneity, so will the population errors probably
- If violated, SE, CI, and significance test associated with a parameter will be inaccurate
~ If we apply though the method least squares, we can get an unbiased estimate of a parameter. Still though, the CI and SE is inaccurate
(See image 7 as well for another note)
Assumptions
Notes on Independence of Errors
If violated, same consequences as if Homoscedasticity was violated
Assumptions
Notes on Normality
(In general, the least damage if violated)
For the CI, SE and test statistic coming from a parameter to be accurate, the parameter estimate must come from a normal sampling distribution
- If sample residuals are normal -> Population error is normal -> Sampling distribution is normal
- In large samples, because sampling distribution of the parameter will be normal, this assumption can be ignored (it’ll be true either way)
What is bootstrapping?
A robust method tests use in case normaltiy is violated.
Lack of normality prevents us from inferring the shape of
the sampling distribution unless we have big samples. Bootstrapping gets around this problem by estimating the properties of the sampling distribution empirically from the sample data.