GLM and confidence intervals Flashcards
linear models
a statistical model is an equation
- summarizes, represents and predicts values of a variable or variables
simple line that summarizes the relationship between x and y
linear model equation
y= b0 + b1x
b0 -> intercept
b1 -> slope
- regression coefficients
linear model predict
for any value of xi, we can use the linear model to predict the value of yi (y-hat i)
y- hat i = b0 + b1xi
residual
difference between a score and the value predicted by the model
error= yi - y-hat
GLM components
model and error
yi= model + errori
yi= b0 +b1xi + errori
model predicts value of y-hat
error tells us the difference between each score, and value predicted by the model
error= yi - model
error= yi - y-hat
error = y- (b0 +b1xi)
general linear model equation
equation
yi = (b0 +b1xi) + errori
writing model without error is used to predict values of yi
error= yi - y-hat
yi= model + error
yi= y-hat + error
GLM fitted to a dataset of scores from a single variable
with single variable (y) there is no predictor variable (x) so model is just a constant
yi= b0 + error
want to fit model that will predict values of y use the sample mean (b0= y-bar)
different statistical methods for defining the best model for a data set
- robust regression (M estimation, Huber regression, Theil-Sen regression)
- quantile Regression
- Least Absolute shrinkage and Selection Operator
Most common= Least squares error method (LSE)
- also called ordinary least squares method (OLS)
LSE
LSE defines the best model as the one which generates the smallest total squared error
total squared error= Sum(yi-y-bar)^2
sum of squared residuals= SSr
SSr= Sum(yi-y-bar)^2
- model that best fits data is model that generates smallest value of SSr
why can’t we just calculate sum of the error?
if we sum all deviation scores, we always get a value of zero
avoid this problem by squaring the deviation scores, then summing the squared deviation scores (like standard deviation)
estimating the population mean
a sample should be drawn at random from a poulation
- 2 samples from sample population will probably have different individuals, with different scores
- 2 samples from same pop are unlikely to have identical means
- unlikely that mean of a single sample will be identical to the mean of the underlying population
- variation between sample statistics is sampling variation
confidence intervals
95% Cl is a range of values that will overlap with the population parameter 95% of the time
distribution of sample mean (DSM)
shows the distribution of mean from all possible samples drawn from a population
- assume the sample is representative of the population
bootstrapping
if sample is representative of pop we use sample data to create a hypothetical pop which should approx to real pop
- hypothetical pop= same composition of sample but is infinitely large
- every score in sample is equally represented in pop
- for data with n= 50, each score in sample represents 2% of all scores in the hypothetical population
then randomly sample n=50 scores from the hypothetical pop and calc sample mean
repeat sampling as many times to generate many sample means
- plotting histogram of sample means generates a DSM
bootstrapped 95% Cl
identifies values of y-bar boundaries at the 2.5% tails
reproducibility of simulation -based results
bootstrapping invovles analysis of large numbers of randomly selected sample
if you repeat analysis, you will obtain a different set of randomly-selected samples, so may obtain different results
- bootstrapping is not perfectly reproducible- you get diff results each time you do the analysis
to maximize reproducibility, you can use large numbers of iterations of the simulation
- at least 10,000