Introduction to Predictive Analytics in Python Flashcards

1
Q

Introduction to Predictive Analytics in Python - Introduction and base table structure

A

Predictive analytics is the process that aims to predict an event using historical data gathered in the analytical basetable which contains target, predictors (which describe the objects by providing information that can be used for prediction (i.e. population)) and population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Introduction to Predictive Analytics in Python - Logistic regression

A

Logistic Regression, is a mathematical model used in statistics to estimate (guess) the probability of an event occurring having been given some previous data. Logistic Regression works with binary data, where either the event happens (1) or the event does not happen (0).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Introduction to Predictive Analytics in Python - Using the logistic regression model

A

Once your model is ready, you can use it to make predictions for a campaign. It is important to always use the latest information to make predictions.

The predictions that result from the predictive model reflect how likely it is that someone is a target.

Making predictions in Python
Female (gender_F=1)
Age 72
120 days since last gift
logreg.predict_proba([1, 72, 120])
array([[ 0.8204144, 0.1795856]])

Making predictions in Python
new_data = current_data[[“gender_F”, “age”, “time_since_last_gift”]]
predictions = logreg.predict_proba(new_data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Introduction to Predictive Analytics in Python Code Summary

A
Exploring the predictive variables 
# Count and print the number of females.
print(sum(basetable['gender'] == 'F'))

Logistic regression
Building a logistic regression model

• Import the method linear_model from sklearn.
• Construct a dataframe X that contains the predictors (eg. age, gender_F and time_since_last_gift.)
• Construct a dataframe y that contains the target.
logreg = linear_model.LogisticRegression()
• Create a logistic regression model.
• Fit the logistic regression model on the given basetable.

Code:
# Import linear_model from sklearn.
from sklearn import linear_model
# Create a dataframe X that only contains the candidate predictors age, gender_F and time_since_last_gift.
X = basetable[['age', 'gender_F', 'time_since_last_gift']]
# Create a dataframe y that contains the target.
y = basetable[['target']]
# Create a logistic regression model logreg and fit it to the data.
logreg = linear_model.LogisticRegression()
logreg.fit(X, y)

Showing the coefficients and intercept
retrieve the coefficients using the attribute coef_
coef = logreg.coef_
for p,c in zip(predictors,list(coef[0])):
print(p + ‘\t’ + str(c))

intercept can be retrieved using the attribute intercept_.
intercept = logreg.intercept_

Using the logistic regression model
Making predictions
predictions = logreg.predict_proba(df)

The predictions consist of two values. The second value is the probability that the observation is a target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Forward stepwise variable selection for logistic regression - Motivation for variable selection

A

Drawbacks of models with too many variables

  1. Overfit
  2. Hard to implement and / or maintain
  3. Hard to interpret because multi-collinearity can result making interpretation difficult or impossible.

Model evaluation - AUC (ranges from 0 to 1) and measures how well the model can order the objects from low to high chance to be a target. Perfect models have AUC of 1 while random models have AUC of 0.5.

In Python, the roc_auc_score function can be used to calculate the AUC of the model. It takes the true values of the target and the predictions as arguments.

Adding more variables and therefore more complexity to your logistic regression model does not automatically result in more accurate models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Forward stepwise variable selection for logistic regression - Forward stepwise variable selection

A

Implementation of the forward stepwise procedure
Selecting the next best variable
The forward stepwise variable selection method starts with an empty variable set and proceeds in steps, where in each step the next best variable is added.

The auc function calculates for a given variable set variables the AUC of the model that uses this variable set as predictors. The next_best function calculates which variable should be added in the next step to the variable list. Loop until desired number of variables is achieved.

Finding the order of variables
Achieved by implementing the forward stepwise variable selection procedure using a function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Forward stepwise variable selection for logistic regression - Deciding on the number of variables

A

Partitioning
In order to properly evaluate a model, one can partition the data in a train and test set. The train set contains the data the model is built on, and the test data is used to evaluate the model. This division is done randomly, but when the target incidence is low, it could be necessary to stratify, that is, to make sure that the train and test data contain an equal percentage of targets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Forward stepwise variable selection for logistic regression - Code Summary

A

Calculating AUC
The AUC value assesses how well a model can order observations from low probability to be target to high probability to be target using roc_auc_score function.
auc = roc_auc_score(y, predictions_target)

Selecting the next best variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explaining model performance to business - The cumulative gains curve

A

The cumulative gains curve is an evaluation curve that assesses the performance of your model. It shows the percentage of targets reached when considering a certain percentage of your population with the highest probability to be target according to your model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explaining model performance to business - The lift curve

A

The lift curve is an evaluation curve that assesses the performance of your model. It shows how many times more than average the model reaches targets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explaining model performance to business - Guiding Better Business Decisions

A

The cumulative gains graph can be used to estimate how many donors one should address to make a certain profit. Indeed, the cumulative gains graph shows which percentage of all targets is reached when addressing a certain percentage of the population. If one knows the reward of a campaign, it follows easily how many donors should be targeted to reach a certain profit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explaining model performance to business - Code Summary

A

eee

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Interpreting and explaining models - Predictor insight graphs

A

The predictor insight graph shows th elink betweeen the predictor variables and the target we want to predict.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Interpreting and explaining models - Discretization of continuous variables

A

The predictor insight graph table contains all information needed to create a predictor insight graph. The most important column in the predictor insight graph table is the target incidence column. This column shows the average target value for each group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Interpreting and explaining models - Plotting the predictor insight graph

A

The most important element of the predictor insight graph are the incidence values. For each group in the population with respect to a given variable, the incidence values reflect the percentage of targets in that group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly