SUL Topic 3 - Logistic Regression Flashcards

Question 1

Q

Logistic Regression

Answer

A

A statistical method used to predict binary outcomes by analyzing the outcome’s relationship with one or more predictor variables.

Question 2

Q

1) Obtaining historical data

Answer

A

The first step in building a logistic regression model is to obtain historical data with a labeled column for the outcome you want to predict.

Question 3

Q

2) Partitioning data

Answer

A

Divide the data into a training set and a testing set, with the training set being larger, to build and test the model respectively.

Question 4

Q

3) Selecting relevant variables

Answer

A

Select relevant variables from the training data based on logic, domain expertise, or theory to build the model.

Question 5

Q

4) Building the model

Answer

A

The logistic regression algorithm creates a model, which is an equation, based on the selected variables and outcome.

Question 6

Q

5) Testing model accuracy

Answer

A

Test the accuracy of the model using the test set and a confusion matrix.

Question 7

Q

Training and test set division

Answer

A

Divide the data randomly into a training set (2/3) and a test set (1/3), unless it is a time series problem.

Question 8

Q

Cross-validation

Answer

A

Divide the data into k subsamples, use k-1 for training and the remaining one for testing, repeat k times to ensure all examples are used for both training and testing.

Question 9

Q

Data balancing

Answer

A

Balance the training data by deleting records from the most frequent category or duplicating records in the less frequent category to avoid poor models.

Question 10

Q

1st Consideration:
Overfitting

Answer

A

When a model is excessively complex and includes noise instead of underlying relationships, which can be avoided through cross-validation.

Question 11

Q

2nd Consideration:
Variable selection challenges

Answer

A

Deciding on the input variables to use can be challenging and is often based on logic, domain expertise, or theory.

Question 12

Q

3rd Consideration:
Minimum data requirements

Answer

A

A minimum of 50 cases per predictor is recommended, and larger datasets with balanced categories tend to produce better results.

Question 13

Q

Model performance evaluation

Answer

A

Evaluate model performance using
ROC curve
Misclassification rate
Accuracy rate
Confusion matrix.

Question 14

Q

Improving model performance

Answer

A

Adjust model parameters
Try different modeling techniques
Improve data quality
Experiment with multiple models.

Question 15

Q

Selecting target and input variables (Summary #1)

Answer

A

Select a binary target variable to predict and relevant input variables with few missing values and low correlation.

Question 16

Q

Preparing data (Summary #2)

Answer

Study These Flashcards

A

Deal with missing, dirty, or duplicate data, remove outliers, and split data into training and validation sets or use cross-validation.

Question 17

Q

Building and running the model (Summary #3)

Answer

Study These Flashcards

A

Observe the statistical significance of each predictor variable and use a confusion matrix to compare and select the best model.

Question 18

Q

Model validation (Summary #4)

Answer

Study These Flashcards

A

Apply the model to the validation set to assess its accuracy on new data and avoid overfitting.

SUL Topic 3 - Logistic Regression Flashcards

(18 cards)