Module 11: Machine Learning pt.2 Flashcards
Ordinary least squares regression
Now we will run an Ordinary Least Squares Regression, but what does that mean?
- There is rarely a perfect relationship between the IV and DV. There is no exact tempaature that will cause a distressed O-ring in this case. Sometimes there will be errors, captured in the error rate
- The OLS is a line of best fit with the least squared errors to provide us with the most accurate prediction as possible by using linear regression
- Here is a review of the variables making up our model:
- O-Rings: This is a constant (its always 6) so there is no point using it as a predictor. It doesn’t vary, so it can’t contribute to different cases having different outcomes.
- DistressedOrings: This is what we’re trying to predict so this is our target variable
- Temp: Our most important predictor
- Pressure: Might or might not be predictive. Include it and see what happens.
- TempOrderOfFlight: This is just the order of the flights (Flight #1, #2, etc.). If we were interested in whether the situation is getting better or worse over time, we want to include this as a predictor. However, since we are only interested in the effects of temperature (and possibly test pressure), including this might result in the model attributing the change in number of distressed rings to just the passage of time and mask the relationship we’re really interested in.
X = df[[‘Temp’, ‘Pressure’]]
y = df[‘DistressedOrings’]
Add a constant so the model will choose an intercept. (Otherwise the model will fit a line through the origin).
X = sm.add_constant(X)
Fit the OLS model
est = sm.OLS(y, X).fit()
Check the results
est.summary()
OLS Regression Results
Dep. Variable: DistressedOrings R-squared: 0.354
Model: OLS Adj. R-squared: 0.290
Method: Least Squares F-statistic: 5.490
Date: Thu, 16 Jan 2020 Prob (F-statistic): 0.0126
Time: 01:44:21 Log-Likelihood: -17.408
No. Observations: 23 AIC: 40.82
Df Residuals: 20 BIC: 44.22
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 3.3298 1.188 2.803 0.011 0.851 5.808
Temp -0.0487 0.017 -2.910 0.009 -0.084 -0.014
Pressure 0.0029 0.002 1.699 0.105 -0.001 0.007
Omnibus: 19.324 Durbin-Watson: 2.390
Prob(Omnibus): 0.000 Jarque-Bera (JB): 23.471
Skew: 1.782 Prob(JB): 8.00e-06
Kurtosis: 6.433 Cond. No. 1.84e+03
With the regression results, you can use params function to find the values of slopes and intercept:
est.params
const 3.329831
Temp -0.048671
Pressure 0.002939
dtype: float64
we can then use these values along with the line which best represents the relationship to predict ourcomes base don different levels of pressure:
Intercept
constant = est.params[0]
# Coeff for Temp
coef1 = est.params[1]
# Coeff for Pressure
coef2 = est.params[2]
No. of O rings in distress when temperature = 31 and pressure is 0, 50, 100, and 200
for pressure in [0, 50, 100, 200]:
print(“Temp=31 Pressure=”, pressure, “ Predicted # of O-Rings in distress:”, constant +
coef1 * 31 + coef2 * pressure)
Temp=31 Pressure= 0 Predicted # of O-Rings in distress: 1.8210269508611583
Temp=31 Pressure= 50 Predicted # of O-Rings in distress: 1.9679931836796445
Temp=31 Pressure= 100 Predicted # of O-Rings in distress: 2.114959416498131
Temp=31 Pressure= 200 Predicted # of O-Rings in distress: 2.4088918821351033
Notes: extrapolating results outside the range of actual observations is always a dicey proposition. Linear regression provides insight but must be thought of as indicative only, not an actual prediction with precision.
What is feature engineering
This is the process of identifying accurate and appropriate input features to arrive at a suitable output. For example, if you are looking at housing prices as an output, and number of rooms, and longitude and latitude as inputs – you could engineer long and lat to get a precise location which is a much more useful feature for the output. Here are some of the techniques for feature engineering:
- Missing values: handling missing values by dropping rows or columns that have a significant number of missing values – or numerical imputation where you through assumption, using the mean, median, or mode of a column
- Detecting outliers: can skew results so it is important to remove
- Binning: categorizing ranges of data in logical bins or groups (low, medium, or high) to make it more meaningful for your analysis
- Variable transformation: typically you want your data to be normally distributed, if not, there are methods of transforming it
- Feature creation: using mathematical functions to create new features – you can combine, add, substract, calculate the mean, min/max, product or any other relevant methods of creating a more meaningful variable
What are dummy variables
Statistics contains both categorical and numerical data – a line of best fit regression model may work particularly well with numerical data but not so much with categorical. Consider the issue of grouping who you vote for by political party, this would be categorical.
The way dummy variables work is by transforming the category of interest to 1 and all else categories to 0. For example, Liberals are 1 and all other political parties are set to 0.
You should be cautious, however, of multicollinearity – which is a problem of redundancy. If, for example, we have a category of smokers (1) and non-smokers (0), then we don’t need to create a separate category for non-smokers (1) and smokers (1). By virtue of the first dummy variable of smokers all else is already non-smokers.
The relevant function is pd.get_dummies(df[‘column name’]) and you can merge the dummy columns within a Dataframe.
Other modelling techniques in brief - logistic regression and decision treed
- Logistic regression: this method is primarily used for classification problems, this is where dummy variables apply. Consider the probability of an explosion due to O-ring damage (1 = damaged, 0 = not damaged) – logistic regression would compute the probability of an explosion based on what class it belongs to.
- Decision trees: these are a special family of ML that are versatile for both classification and regression tasks – they work by breaking datasets into smaller and smaller datasets until a decision is reached.