Predictive Data Mining Flashcards
Accuracy
Measure of classification success defined as 1 minus the overall error rate.
Average error
The average difference between the actual values and the predicted values of observations in a data set.
Bagging
An ensemble method that generates a committee of models based on random samples drawn with replacement and makes predictions based on the average prediction of the set of models.
Bias
The tendency of a predictive model to overestimate or underestimate the value of a continuous outcome.
Boosting
An ensemble method that iteratively samples from the original training data to generate individual models that target observations that were mispredicted in previously generated models. Its predictions are based on the weighted average of the predictions of the individual models, where the weights are proportional to the individual models’ accuracy.
Class error rate
Percentage of observations of a given class misclassified by a model in a data set
Classification confusion matrix
A matrix showing the counts of actual versus predicted class values.
Classification tree
A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules.
Classification
A predictive data mining task requiring the prediction of an observation’s outcome class or category.
Cumulative lift chart
A chart used to present how well a model performs in identifying observations most likely to be in a given class as compared with random classification.
Cutoff value
The smallest value that the predicted probability of an observation can be for the observation to be classified as a given class.
Decile-wise lift chart
A chart used to present how well a model performs at identifying observations for each of the top k deciles most likely to be in a given class versus a random selection.
Ensemble method
A predictive data-mining approach in which a committee of individual classification or estimation models are generated and a prediction is made by combining these individual predictions.
Estimation
A predictive data mining task requiring the prediction of an observation’s continuous outcome value.
F1 score
A measure combining precision and sensitivity into a single metric.
False negative
The misclassification of a positive observation as negative.
False positive
The misclassification of a negative observation as positive
Features
A set of input variables used to predict an observation’s outcome class or continuous outcome value.
Impurity
Measure of the heterogeneity of observations in a classification tree.
K-nearest neighbor (K-NN)
A classification method that classifies an observation based on the class of the k observations most similar or nearest to it.
Logistic regression
A generalization of linear regression for predicting a categorical outcome variable.
Mallow’s Cp statistic
A measure in which small values approximately equal to the number of coefficients suggest promising logistic regression models.
Model overfitting
A situation in which a model explains random patterns in the data on which it is trained rather than just the relationships, resulting in training-set accuracy that far exceeds accuracy for the new data.
Observation (record)
A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database.
Overall error rate
The percentage of observations misclassified by a model in a data set.
Precision
The percentage of observations predicted to be in a given class that actually are in that class
Random Forest
A variant of the bagging ensemble method that generates a committee of classification or regression trees based on different random samples but restricts each individual tree to a limited number of randomly selected features (variables)
receiver operating characteristic (ROC) curve
A chart used to illustrate the tradeoff between a model’s ability to identify a given class’s observations and its complement class’s error rate.
Regression tree
A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules.
Root mean square error
A measure of the accuracy of an estimation method defined as the square root of the sum of squared deviations between the actual values and predicted values of observations.
Sensitivity; recall
The percentage of actual observations of a given class correctly identified, usually the positive class
Specificity
The percentage of actual observations of a given class correctly identified; usually the negative class
Supervised learning
Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.
Test set
Data set used to compute unbiased estimate of final predictive model’s accuracy.
Training set
Data used to build candidate predictive models.
Validation set
Data used to evaluate candidate predictive models.
Variable (feature)
A characteristic or quantity of interest that can take on different values.