Chapter 6: The Beast of Bias Flashcards
Sources of bias
- outliers
- violations of assumptions (additivity/linearity, normality, homogeneity/homoscedasticity, independence)
stuff that can be affected by bias
- parameter estimates (including effect sizes)
- standard errors and CIs
- test statistics and p-values
- conclusions
there are methods of reducing bias
linear model and parameters
we can use the liner model to test theories or for prediction. in both cases, our interest is in estimating parameters
estimators
-estimation is the process of estimating parameters from sample data
- an estimator is a procedure, rule, or criterion that is used to estimate the parameters
- the result of estimation are estimates of the parameters
- estimates can be below or above the actual parameter value. a value above is called an overestimate, and below is an underestimate.
- in practice, we never know whether our estimates are above or below the parameter
qualities that make a good estimator
- unbiasedness: on avg, its going to give you the population parameter. the distribution is not leaning towards one side or the other
- consistency: as the sample gets bigger, the estimates become more precise
- efficiency: not too spread (little error). mean is the most efficient, median is somewhat efficient, and mode is inefficient
a biased estimator is sometimes the preferred option, can be overcome with a bigger sample size
bias does not mean bad. a biased estimator is a method that will not equal the parameter, on average
bias is a property of an estimator, not an estimate
estimators and mean, median, mode
- on average, the mean is going to give you estimates that match the population parameter, it is unbiased. the expected value of the sampling means is the parameter
- the median is unbiased as long as the sample is normally distributed
- the mode is unbiased as long as the sample is normally distributed
mean is the best estimator because it is unbiased , consistent, and efficient
OLS method
- give estimates of the parameter while making sum of squares as small as possible
what is an outlier?
- a score that is very different from the other scores
- there are different kinds
- outliers affect parameter estimates
- have an effect on the parameters, but an even bigger effect on the SS
- bias > SD > SE > CI (makes them much wider, which is an issue for significance testing)
overview of assumptions
- if assumptions are violated, you can’t trust the test statistic
- assumptions violations vary by degree
- even if assumption is violated, some tests are still valid
- assumptions about the characteristics of the data
- some statistical tests are robust to violations of an assumption, meaning that the results are usually still valid even if the assumption is violated
- parametric tests: statistical tests that make assumptions
- nonparametric tests: don’t require assumptions about the distribution be met
additivity and linearity
assumption
- the relationship between X and Y can be represented by a line
- linear relationship between the predictors and the outcome
- important that this is met because fitting a linear model to nonlinear data would be inappropriate
normality
assumption
- the residuals of the model / the sampling distribution of the parameters (b’s) must be normally distributed
- for CIs around a parameter estimate to be accurate, the estimate must have a normal sampling distribution
- for significance tests of models to be accurate, the sampling distribution of what’s being tested must be normal
- matters if we’re assuming that the residuals are normally distributed, using a linear model, the assumption of normality is important in choosing an estimation method. if assumption is met, use OLS method
central limit theorem
- describes the relationship b/w a population of individual scores and the samplign distribution of the means (estimates)
- as the sample size increases, the shape of the sampling distribution is going to approach normality, not matter the shape of the individual score distribution (parent distribution)
- 30 people
- sampling distribution depends on sample size
homoscedasticity/homogeneity of variance
assumption
- homoscedasticity: assumption that the population variances from each different group are exactly the same
- homogeneity: different groups come from populations with the same variance
- homoscedasticity is the same, but with a continuous variable
- if assumption is violated, consider estimating the parameters using the weighted least squares method (WLS)
- CIs and NHST considerably biased if assumption is not met
funeling indicates violation of homogeneity / heteroscedasticity
independence
assumption
-error terms in your model are unrelated to one another
- cannot trust CIs or NHST if violated
- use robust methods/HLM is violated
- if the errors aren’t independent, this gives a low estimate of SE, which affects CI/NHST