SUL Topic 3 - Logistic Regression Flashcards
Logistic Regression
A statistical method used to predict binary outcomes by analyzing the outcome’s relationship with one or more predictor variables.
1) Obtaining historical data
The first step in building a logistic regression model is to obtain historical data with a labeled column for the outcome you want to predict.
2) Partitioning data
Divide the data into a training set and a testing set, with the training set being larger, to build and test the model respectively.
3) Selecting relevant variables
Select relevant variables from the training data based on logic, domain expertise, or theory to build the model.
4) Building the model
The logistic regression algorithm creates a model, which is an equation, based on the selected variables and outcome.
5) Testing model accuracy
Test the accuracy of the model using the test set and a confusion matrix.
Training and test set division
Divide the data randomly into a training set (2/3) and a test set (1/3), unless it is a time series problem.
Cross-validation
Divide the data into k subsamples, use k-1 for training and the remaining one for testing, repeat k times to ensure all examples are used for both training and testing.
Data balancing
Balance the training data by deleting records from the most frequent category or duplicating records in the less frequent category to avoid poor models.
1st Consideration:
Overfitting
When a model is excessively complex and includes noise instead of underlying relationships, which can be avoided through cross-validation.
2nd Consideration:
Variable selection challenges
Deciding on the input variables to use can be challenging and is often based on logic, domain expertise, or theory.
3rd Consideration:
Minimum data requirements
A minimum of 50 cases per predictor is recommended, and larger datasets with balanced categories tend to produce better results.
Model performance evaluation
- Evaluate model performance using
ROC curve - Misclassification rate
- Accuracy rate
- Confusion matrix.
Improving model performance
- Adjust model parameters
- Try different modeling techniques
- Improve data quality
- Experiment with multiple models.
Selecting target and input variables (Summary #1)
Select a binary target variable to predict and relevant input variables with few missing values and low correlation.