Unit 7 Flashcards
Why do we use scatterplots?
To graph the relationship between two quantitative variables.
When we have two variables that we believe are related to one another. With two variables, we cannot use a histogram.
What is a scatterplot?
- each point on a scatter plot shows the value of two different variables measured on the same individual
- if you have an explanatory variable and a response variable they should be plotted on the x-axis and y-axis respectively
- if there is no explanatory variable, the labelling of the axis is unimportant
What are the two different types of models to model a variable y by variables x
- deterministic model: a model where y can always be predicted exactly by x alone
- probabilistic model: a model that includes both a deterministic portion and a random error term
What is the general form of a probabilistic linear model?
y-intercept + slope*x + random error term (which follows a normal distribution of N(0, sigma)
How does the scatterplot affect the sigma value used for the random error term?
points spread out -> sigma large
points close to the line -> sigma small
What does the probabilistic linear model look like for average y?
mean-y = y-intercept + slope*x (the errors should average out to 0 since the error follows N(0, sigma))
What is the regression line?
The line we will use to model our data and make predictions.
the predicted value of y for a given x = the estimate of the intercept + the estimate of the slope*x
What is the difference between observed and predicted values of y?
Observed values of y are the points on our scatterplot. Our predicted values of y are the values along the line we fit.
What is the relationship between the observed and predicted values of y called?
The vertical difference between them is the “error”; ow far off our real world value is from our modeled/predicted value
What do we want to minimize in fitting our curve?
the sum of (observed y - predicted y)^2. Not the sum of the differences because positives and negatives would cancel each other out like in variance
What are the values of the estimate of the intercept and the estimate of the slope that minimize the squared errors?
estimate of slope: SSxy / SSxx
estimate of intercept: sample y - estimate of slope*sample x
What does the slope of the regression line (beta-1-hat) tell us?
How much y changes when x changes by one
What is the general form for an interpretation of beta1-hat?
For every unit increase in X we predict that Y will increase(decrease) by beta-1-hat units
What is the general form for an interpretation for beta0-hat?
The predicted value of Y when X is zero (if x = 0 is outside the range of the data, this interpretation may be meaningless or nonsensical)
What are the assumption we need to make for our regression to be valid?
- the expected value (the mean) of the errors is 0
- the variance of the error terms in constant across all values of y
- the data set is linear
- the probability distribution of the errors is normal
- the errors are independent between observations