Week 4 - RFM (Logistic Regression) Flashcards
Examples of Large Databases
- Online transactions (e-commerce)
- Amazon: 300 million customer accounts - Web browsing/click stream data
- Purchases at department/ grocery/ convenience stores
- Albert Heijn:16 million transactions per week - Subscription data
- Netflix: 200 million-plus subscriber
Why Data Mining?
- Lots of data being collected
- Computers and technology cheaper and more powerful now
- Gain a competetive edge
- Discover “hidden” info in the data
Data in the real world is dirty. Why?
Lacks values
Errors
Discrepencies
Major Tasks in Data Preprocessing
Data cleaning: dealing with missing values, inconsistencies
Data integration: integration of multiple databases
Data transfromation (date, time..)
Data reduction: reduced volume, same result
What is data mining for?
- Pattern Discovery
- finding new, useful patterns in datasets - Relationship Analysis
- uncover unexpected rel. and summarize
Examples of Database Marketing Applications
Predicting customer response
* Likelihood of future purchase
* Likelihood of churn
* Marketing affectiveness
Market Basket Analysis
Click-stream Analytics
what does RFM stand for?
Recency = Time passed since last purchase
Frequency = Frequency of purchase in a given period
Monetary value = Amount spent on average in a given period
RFM + limitations (3)
segmentation technique
-accurate
-easy
-can be computed for any database
limitations:
-does not take into account other factors
-prediction of next period only
-past behaviour may be due to PAST MKT activities
What is Logistic Regression + Types of Logistic Regression (2)
Predicts categorical (non metric) outcomes (purchased, not purchased) with two or more categories = (yes (1) / no (0))
if two categories - Binary Logistic Regression
if more than two = Multinominal Logistic Regression (not covered in this course!)
Objectives of Logistic Regression
Identify
- finds which factors (RFM) influence the likelihood of an event. (purchasing)
Predict
- if a customer will buy based on their RFM scores
Logistic Regression Assumptions
No specific distribution required
No equal variance needed
Multicolinearity matters
Omibus Test
Is our model a better fit than Block 0? (baseline with no IVs)
sig. results = its better to use this model than Block 0 model with no IVs
Cox and Snell R square / Negelkerke R square
similar to R square in linear regression
-usefulness of the model
-between (COX n.) and (Neg. no.) of the variability in the DV is explained by this set of IVs
Hosmer and Lemeshow Test
How well the predicted values match the actual observed values of the DV
A non-significant p-value (greater than 0.05) is what you want.
- It means there is no significant difference between the predicted and actual values, indicating the model is a good fit
Classification Accuracy of the model (Predicted vs Observed table)
How well the model predicts whether a purchase is made or not.
(how acurate are the predictions)
Exponentiated coefficients Exp(B)
shows the magnitude and direction of the effect of each IV on DV
Wald
similar to t-test