SUL Topic 3 - Logistic Regression Flashcards

1
Q

Logistic Regression

A

A statistical method used to predict binary outcomes by analyzing the outcome’s relationship with one or more predictor variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

1) Obtaining historical data

A

The first step in building a logistic regression model is to obtain historical data with a labeled column for the outcome you want to predict.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

2) Partitioning data

A

Divide the data into a training set and a testing set, with the training set being larger, to build and test the model respectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3) Selecting relevant variables

A

Select relevant variables from the training data based on logic, domain expertise, or theory to build the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4) Building the model

A

The logistic regression algorithm creates a model, which is an equation, based on the selected variables and outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

5) Testing model accuracy

A

Test the accuracy of the model using the test set and a confusion matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Training and test set division

A

Divide the data randomly into a training set (2/3) and a test set (1/3), unless it is a time series problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cross-validation

A

Divide the data into k subsamples, use k-1 for training and the remaining one for testing, repeat k times to ensure all examples are used for both training and testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data balancing

A

Balance the training data by deleting records from the most frequent category or duplicating records in the less frequent category to avoid poor models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

1st Consideration:
Overfitting

A

When a model is excessively complex and includes noise instead of underlying relationships, which can be avoided through cross-validation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

2nd Consideration:
Variable selection challenges

A

Deciding on the input variables to use can be challenging and is often based on logic, domain expertise, or theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

3rd Consideration:
Minimum data requirements

A

A minimum of 50 cases per predictor is recommended, and larger datasets with balanced categories tend to produce better results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Model performance evaluation

A
  • Evaluate model performance using
    ROC curve
  • Misclassification rate
  • Accuracy rate
  • Confusion matrix.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Improving model performance

A
  • Adjust model parameters
  • Try different modeling techniques
  • Improve data quality
  • Experiment with multiple models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Selecting target and input variables (Summary #1)

A

Select a binary target variable to predict and relevant input variables with few missing values and low correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Preparing data (Summary #2)

A

Deal with missing, dirty, or duplicate data, remove outliers, and split data into training and validation sets or use cross-validation.

17
Q

Building and running the model (Summary #3)

A

Observe the statistical significance of each predictor variable and use a confusion matrix to compare and select the best model.

18
Q

Model validation (Summary #4)

A

Apply the model to the validation set to assess its accuracy on new data and avoid overfitting.