Prediction Flashcards
Two problem categories of prediction
- Regression -> Linear regression
- Classification -> Logistic regression
What is Linear Regression?
We have seen a number of cases in which a scatter plot displays a correlation between variables.
We use the linear regression model to formalize this correlation.
Its a method used to model the relationship between a dependent variable and one or more independent variables.
Pearson’s correlation coefficient
Denoted by r, it is a measure of correlation between two variables (columns)
Specifically, the r-value measure the strength and direction of the correlation
r has the following properties
-1 <= r <= 1
The further from 0, the stronger the correlation.
The slope has the same sign as the r-value
Method of least squares
A fundamental concept of linear regression is the method of least squares.
This method finds the mathematically “best” line for the data.
Minimize:
sum_i=1_n((y_i - y_hat_i)^2)
Root Mean Squared Error
We define the best line to be the line which has the lowest Root Mean Squared Error
To measure this:
- For each observed y value, subtract the y-value predicted by the line
- Square this difference, aka the error term.
- Sum all squared differences
- Divide by the sample size.
- Take the square root
How to find the slope and intercept of the regression line
for i in range(len(X)):
num += (X[i] - X_mean)(Y[i] - Y_mean)
den += (X[i] - X_mean)**2
beta_1 = num / den
beta_0 = Y_mean - beta_1X_mean
beta_1 = slope
beta_0 = intercept
The logistic model
To model proportions, we use the logistic model, also know as the sigmoid function
The most basic logistic model has the form:
f(x) = 1 / (1 + e^-(x))
This model has an s-shape, and bounded between 0 and 1 on the y-axis.
Fitting the logistic model
General logistic model has the form:
p(x) = 1/(1 + e^-(B_0 + B_1*x))
finding these values is not as easy as it is for the linear model.
Requires multivariate calculus.
Using scikit we can get the values ->
intercept = model.intercept_
coefficient = model.coef_