Regression Flashcards
When would you use regression?
When considering relationships between a continuous predictor variable and a continuous response variable.
How do you plot Least Squares Regression?
1) Plot a point at coordinate (mean of x, mean of y)
2) The best fit line is the line that minimises the squared deviations of data points from the line.
What is the equation of a line and an alternative notation?
1) y=mx+c
2) y= A0 + A1x
(Where A0 is c, A1 is the gradient, and x is x)
What is another term for the gradient?
Coefficient
1) What does Pearson’s correlation coefficient range from?
2) what does a Pearson’s coefficient of -1 or +1 mean
1) Ranges from -1 to 1
2) -1 is a perfect linear negative correlation
+1 is a perfect linear positive correlation
What does Pearson’s correlation coefficient assume?
Assumes correlation must be linear
What does a Pearson’s correlation coefficient of 0 indicate?
There is absolutely no relationship between x and y.
What is Spearmans rank?
What is it used for?
1) This is a non parametric correlation coefficient which doesn’t assume correlation is linear.
2) Is used to look at monotonic correlations
How is a Spearman’s rank calculated?
The raw x and y data is converted into ranks. It the correlation is monotonic the ranks will appear as a perfect linear relationship.
What is Spearman’s rank compared to Pearson’s correlation coefficient?
Spearman’s rank is simply the Pearson’s correlation coefficient of the ranked data as opposed to the raw data.
Can p-values be associated with correlation coefficients?
Yes and they would indicate if the correlation is significantly different from 0
What are 3 types of general linear models?
1) ANOVA
2) ANCOVA
3) Linear regression
What is the form of a General linear Model?
Y= A0 + A1x + A2x + (B1 or B2 or -B1-B2) + E
Where: A0 is a constant
A1 is the gradient of predictor variable 1
A2 is the gradient of predictor variable 2
(B1 or B2 or -B1-B2) is the effects of categorical predictor variables
E is the error which is normally distributed
What determines significance in General linear models?
F ratios
What is R^2?
This is how much variation in the data/model have we explained.
1- (residual sum of squares/total sum of squares)
What is the residual sum of squares / Total s of squares?
Proportion of variation that hasn’t been explained
How can you use the Pearson’s coefficient to find R^2?
R^2 = the Pearson’s correlation coefficient(r)^2
What is Simpsons paradox?
This is when you come to the wrong conclusion because potential lurking variables haven’t been taken into account.
What is interpolation?
Predicting values of the response variable within a zone of measured values.
What is extrapolation?
Predicting values of the response variable outside the zone of measured values.
What can be used if a relationship isn’t linear?
1) Linear regression using polynomial explanatory variables
2) Non linear regression
What is an example of non linear regression?
Random forest regression
What is random forest regression?
This is a forest of decision trees. The trees are built on training data you provide the algorithm.
Randomness comes from building lots of trees only based in a subset of the data that it randomly samples each time.
The decision trees are used to make predictions and the average prediction of the forest of decision trees is used to fit the regression line.
What are advantages/ disadvantages of random forest regression?
Advantages: Based entirely on the data it has and therefore we cannot impose any of our ideas for the nature of the relationship.
Disadvantages: 1) can be slow
2) can sometimes overfit the data