Customer Analytics & Credit Risk Modelling Flashcards
How would you approach a Churn Prediction?
There are various ways to approach it.
e.g. Logistic Regression
Predicts a probability of a binary outcome using a logistic function.
Assumes:
- Linear relationship between independent variables and the log-odds of the dependent variable
- Observations are independent of each other
- No multi-collinearity of the predictors
- Work better on balanced datasets
Steps:
1. Clean the data
2. Feature engineering i.e. Total Spend/Tenure = Avg Monthly Spend
Encoding categorical variables etc
3. Define the target variable an the features
4. Split the dataset into a training and testing
5. Fit the data
6. Evaluate the performance on the testing dataset
Accuracy
Precision
Recall
F-1
+ The coefficients expressed in log-odds but can be transformed into % using exponent function
How would you approach customer segmentation?
We can use **k-means clustering. **
packages: matplotlib for visuals, scipy stats for shapiro test, skicit learn.cluster and preprocessing
Unsupervised machine learning algorithm which assigns each data point to the nearest cluster, which results in groups of similar data points.
Objective: minimize within-cluster variance –> reducing the distance between points in the same cluster and their centroid.
Steps:
1. Data cleaning
2. data normalization - z-score normalization
used when the values are normally distributed. Use shapiro test to check
3. Use the elbow method to find the optimal number of clusters
4. run k-means clustering with the chosen n_clusters
5. Denormalize the data
6. Analyze the centroids (mean values of features for each cluster)
e.g. Age
Cluster 1 - 0.86 0.86 years lower than average
Cluster 2 1.5 1.5 years older than average
Cluster 3 …
Cluster 4 …
Choosing the number of clusters:
the Elbow Method - finds a sweet spot where adding more clusters stops making a meaningful difference.
for each cluster calculates within-cluster sum of squares (WCSS) also known as inertia. It measures how tightly the data points are clustered within each cluster.
It then plots the values of K on the x-axis and the WCSS on the y-axis.
The “elbow” is the point on the graph where the WCSS begins to level off.
How would you approach Credit Risk Modelling project?
Steps:
1. Data Cleaning & preparation
2. Feature Selection:
using correlation matrix and visualizing with a heatmap
Shows which features have correlation with the dependent var and which features have multicollinearity
3. Use Random Forest for feature selection and visualize in a bar chart
4. Select the features
5. Either use logistic regression or Gradient Boosting
Logistic regression:
6. In case of Logistic regression use SMOTE or other technique to handle the class imbalance (SMOTE = Synthetic Minority Over-Sampling Technique)
7. Split the data into training and testing
8. fit the model on SMOTE data
9. test on the test data
10. evaluate the performance & retrieve the coefficients
Gradient Boosting:
Works by buiding a model in a sequenial manner, where each new model tries to correct the errors of the previous model.
Starts with a weak model, analyzes where it went wrogn, builds new models in each iteration.
Gradient Descent to minimize loss: in each iteration the algorithm computes the graient of the loss function (which measures the model’s error), and this guides the next model in correcting the largest errors. By following the gradient, the mode is able ot progressively minimize the loss function and improve prediction accuracy
Steps:
11. select the features
12. Split the dataset
13. Fit the Model (GradientBoostingClassifier(N_estimators, learning_rate, max_depth, random_state)
14. Make predictions on the test data and evaluate the performance
How would you approach a survival analysis/CLV Prediction?
Steps:
1. Data cleaning and preparation
2. Handling outliers (z-score is a standard score that tells you how many STDs a data point is from the mean. More than 3 is an outlier)
3. Cox Regression for predicting customer survival probability at each point in time. Predicts how longit will take for a specific event to happen based on the features
4. Estimate the customer lifetime based on survival probabilities (sum of survival probs)
5. Estimating CLV = (Total Spend/Tenure)xExpected Lifetime
How would you approach Lookalike marketing for customer acquisition?
Random Forest builds upon decision trees by creating an ensemble of multiple trees and combining their predictions. A single tree might overfit but averaging the predictions of many trees is more reliable.
Each decision tree is trained on a random suset of the data and randomly selects features to split on
Lookalike marketing:
Trying to find customers that are similar to our high-value customers.
Steps:
1. Data Cleaning
2. Split the data into training and testing sets
3. Perform 5 fold cross validation on the training set:
splits the data into multiple subsets(folds), training the model on some of them and testing on the remaining ones
helps get more reliable estimate of model performance on unseen data
4. Fit the model on the training set
5. Evaluate the model’s performance on the test set
Improving model performance:
* Tuning hyperparameters (n_estimators, max_features)
* Feature engineering e.g. combining some features, keeping only relevant features
* Handle missing values properly
* Handling Class imbalance (e.g. SMOTE)
* Cross-Validation to ensure it generalizes well to unseen data
How would you approach A/B Testing?
e.g. A/B testing for conversion
Steps
1. Calculating conversion by group
2. Perform a chi-square test (chi-square to compare proportions between groups)
Types of tests:
**T-Test: **comparing the means of two groups
**Z-Test: **similar but when we have a large sample size (n>30) or when the population variance is knwon
Chi-Square Test: Comparing Categorical variables (e.g. people from city A, B, C have same preferences for a product) or how proportion differs across groups
ANOVA (Analysis of Variance): comparing menas for 3+ groups
F-Test: Compare variances between 2 groups
Statistical Significance:
p-value <0.05 means that the diffference observed in the data ins unlikely due to a random chance. However, casuality cannot be established (maybe only if the expriment is properly randomized and controlled for potential confounding variables etc)
- Evaluate the **Effect Size **(Practical significance) using Cohens test
Assessing whether the difference is big enoguh to make a meangful buiness impact
0-0.2 Small
0.2- 0.5 Medium
0.8+ High