Introduction to Predictive Analytics in Python Flashcards
Introduction to Predictive Analytics in Python - Introduction and base table structure
Predictive analytics is the process that aims to predict an event using historical data gathered in the analytical basetable which contains target, predictors (which describe the objects by providing information that can be used for prediction (i.e. population)) and population.
Introduction to Predictive Analytics in Python - Logistic regression
Logistic Regression, is a mathematical model used in statistics to estimate (guess) the probability of an event occurring having been given some previous data. Logistic Regression works with binary data, where either the event happens (1) or the event does not happen (0).
Introduction to Predictive Analytics in Python - Using the logistic regression model
Once your model is ready, you can use it to make predictions for a campaign. It is important to always use the latest information to make predictions.
The predictions that result from the predictive model reflect how likely it is that someone is a target.
Making predictions in Python Female (gender_F=1) Age 72 120 days since last gift logreg.predict_proba([1, 72, 120]) array([[ 0.8204144, 0.1795856]])
Making predictions in Python
new_data = current_data[[“gender_F”, “age”, “time_since_last_gift”]]
predictions = logreg.predict_proba(new_data)
Introduction to Predictive Analytics in Python Code Summary
Exploring the predictive variables # Count and print the number of females. print(sum(basetable['gender'] == 'F'))
Logistic regression
Building a logistic regression model
• Import the method linear_model from sklearn.
• Construct a dataframe X that contains the predictors (eg. age, gender_F and time_since_last_gift.)
• Construct a dataframe y that contains the target.
logreg = linear_model.LogisticRegression()
• Create a logistic regression model.
• Fit the logistic regression model on the given basetable.
Code: # Import linear_model from sklearn. from sklearn import linear_model
# Create a dataframe X that only contains the candidate predictors age, gender_F and time_since_last_gift. X = basetable[['age', 'gender_F', 'time_since_last_gift']]
# Create a dataframe y that contains the target. y = basetable[['target']]
# Create a logistic regression model logreg and fit it to the data. logreg = linear_model.LogisticRegression() logreg.fit(X, y)
Showing the coefficients and intercept
retrieve the coefficients using the attribute coef_
coef = logreg.coef_
for p,c in zip(predictors,list(coef[0])):
print(p + ‘\t’ + str(c))
intercept can be retrieved using the attribute intercept_.
intercept = logreg.intercept_
Using the logistic regression model
Making predictions
predictions = logreg.predict_proba(df)
The predictions consist of two values. The second value is the probability that the observation is a target.
Forward stepwise variable selection for logistic regression - Motivation for variable selection
Drawbacks of models with too many variables
- Overfit
- Hard to implement and / or maintain
- Hard to interpret because multi-collinearity can result making interpretation difficult or impossible.
Model evaluation - AUC (ranges from 0 to 1) and measures how well the model can order the objects from low to high chance to be a target. Perfect models have AUC of 1 while random models have AUC of 0.5.
In Python, the roc_auc_score function can be used to calculate the AUC of the model. It takes the true values of the target and the predictions as arguments.
Adding more variables and therefore more complexity to your logistic regression model does not automatically result in more accurate models.
Forward stepwise variable selection for logistic regression - Forward stepwise variable selection
Implementation of the forward stepwise procedure
Selecting the next best variable
The forward stepwise variable selection method starts with an empty variable set and proceeds in steps, where in each step the next best variable is added.
The auc function calculates for a given variable set variables the AUC of the model that uses this variable set as predictors. The next_best function calculates which variable should be added in the next step to the variable list. Loop until desired number of variables is achieved.
Finding the order of variables
Achieved by implementing the forward stepwise variable selection procedure using a function.
Forward stepwise variable selection for logistic regression - Deciding on the number of variables
Partitioning
In order to properly evaluate a model, one can partition the data in a train and test set. The train set contains the data the model is built on, and the test data is used to evaluate the model. This division is done randomly, but when the target incidence is low, it could be necessary to stratify, that is, to make sure that the train and test data contain an equal percentage of targets.
Forward stepwise variable selection for logistic regression - Code Summary
Calculating AUC
The AUC value assesses how well a model can order observations from low probability to be target to high probability to be target using roc_auc_score function.
auc = roc_auc_score(y, predictions_target)
Selecting the next best variable
Explaining model performance to business - The cumulative gains curve
The cumulative gains curve is an evaluation curve that assesses the performance of your model. It shows the percentage of targets reached when considering a certain percentage of your population with the highest probability to be target according to your model.
Explaining model performance to business - The lift curve
The lift curve is an evaluation curve that assesses the performance of your model. It shows how many times more than average the model reaches targets.
Explaining model performance to business - Guiding Better Business Decisions
The cumulative gains graph can be used to estimate how many donors one should address to make a certain profit. Indeed, the cumulative gains graph shows which percentage of all targets is reached when addressing a certain percentage of the population. If one knows the reward of a campaign, it follows easily how many donors should be targeted to reach a certain profit.
Explaining model performance to business - Code Summary
eee
Interpreting and explaining models - Predictor insight graphs
The predictor insight graph shows th elink betweeen the predictor variables and the target we want to predict.
Interpreting and explaining models - Discretization of continuous variables
The predictor insight graph table contains all information needed to create a predictor insight graph. The most important column in the predictor insight graph table is the target incidence column. This column shows the average target value for each group.
Interpreting and explaining models - Plotting the predictor insight graph
The most important element of the predictor insight graph are the incidence values. For each group in the population with respect to a given variable, the incidence values reflect the percentage of targets in that group.