Statistics 6 Flashcards
Outline 3 characteristics of explanatory statistics
- Most powerful form of statistical analysis
- Determines causation of relationship
- Strick assumptions
What does regression allow us to do with statistics?
It allows us to make a numerical prediction of how one variable linearly affects another
What does carrying out regression allows us to define?
Causation relationship - which variables are independent and which are dependent i.e. which one affects the other and by how much
What is the regression line?
Numerical description of the line of best fit
What are the three assumptions/conditions of a regression line?
Continuous data
Parametric (Normally distbd. and n>30)
There will always be scatter from the perfect relationship
What is the standard format for expressing a regression line? What are the components of this format?
Y = a + bx [a = y-intercept and b = regression coefficient]
What criteria is the regression line based of? explain this criteria
“least-squares criterion” - ensuring that there is equal total distance of points from the line either side of the line (total distance of points above line = total distance below line). This can be satisfied in many lines and so we need to make sure that the line drawn is the best fit between all the points
What two characteristics do we need to look at to determine whether the regression line is useful?
Unexplained variance and explained variance
What is explained variance?
The variation of the points from the line that can be understood/ explained by the regression line. The closer the points are to the line (in between the regression line and the mean y-value) the higher the explained variances
What is unexplained variance?
The variation of the points from the line that are not well understood/explained by the regression line that has been drawn. The further away the points are from the regression line (not on the side with the mean y-value, the higher the unexplained variance)
How do you define a useful regression line and a bad regression line based off the explained and unexplained variance?
The higher the explained variance and the lower the unexplained variance, the better the regression line is because it represents a greater amount of the dataset. If this is reversed then the regression line is not very representative as it does not explain much of the data.
What is another word for the unexplained variance?
Residuals
If you are visually determining the usefulness of the regression line, what do you need to do and what is another key component?
You need to compare the height of the unexplained variance and the explained variance. To do this, measure the height difference of the regression line to the point in question and then compare it against the height difference from the regression line from the same point to the mean y-value.
What is the f-ratio?
Value which represents the ratio of the explained variance to the unexplained variance.
What is the notation for the explained variance?
S (subscript y and superscript 2)