Predictive Data Mining Flashcards by Chris Huskey

Accuracy

Measure of classification success defined as 1 minus the overall error rate.

How well did you know this?

Not at all

Perfectly

Average error

The average difference between the actual values and the predicted values of observations in a data set.

How well did you know this?

Not at all

Perfectly

Bagging

An ensemble method that generates a committee of models based on random samples drawn with replacement and makes predictions based on the average prediction of the set of models.

How well did you know this?

Not at all

Perfectly

Bias

The tendency of a predictive model to overestimate or underestimate the value of a continuous outcome.

How well did you know this?

Not at all

Perfectly

Boosting

An ensemble method that iteratively samples from the original training data to generate individual models that target observations that were mispredicted in previously generated models. Its predictions are based on the weighted average of the predictions of the individual models, where the weights are proportional to the individual models’ accuracy.

How well did you know this?

Not at all

Perfectly

Class error rate

Percentage of observations of a given class misclassified by a model in a data set

How well did you know this?

Not at all

Perfectly

Classification confusion matrix

A matrix showing the counts of actual versus predicted class values.

How well did you know this?

Not at all

Perfectly

Classification tree

A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules.

How well did you know this?

Not at all

Perfectly

Classification

A predictive data mining task requiring the prediction of an observation’s outcome class or category.

How well did you know this?

Not at all

Perfectly

Cumulative lift chart

A chart used to present how well a model performs in identifying observations most likely to be in a given class as compared with random classification.

How well did you know this?

Not at all

Perfectly

Cutoff value

The smallest value that the predicted probability of an observation can be for the observation to be classified as a given class.

How well did you know this?

Not at all

Perfectly

Decile-wise lift chart

A chart used to present how well a model performs at identifying observations for each of the top k deciles most likely to be in a given class versus a random selection.

How well did you know this?

Not at all

Perfectly

Ensemble method

A predictive data-mining approach in which a committee of individual classification or estimation models are generated and a prediction is made by combining these individual predictions.

How well did you know this?

Not at all

Perfectly

Estimation

A predictive data mining task requiring the prediction of an observation’s continuous outcome value.

How well did you know this?

Not at all

Perfectly

F1 score

A measure combining precision and sensitivity into a single metric.

How well did you know this?

Not at all

Perfectly

False negative

Study These Flashcards

The misclassification of a positive observation as negative.

False positive

Study These Flashcards

The misclassification of a negative observation as positive

Features

Study These Flashcards

A set of input variables used to predict an observation’s outcome class or continuous outcome value.

Impurity

Study These Flashcards

Measure of the heterogeneity of observations in a classification tree.

K-nearest neighbor (K-NN)

Study These Flashcards

A classification method that classifies an observation based on the class of the k observations most similar or nearest to it.

Logistic regression

Study These Flashcards

A generalization of linear regression for predicting a categorical outcome variable.

Mallow’s Cp statistic

Study These Flashcards

A measure in which small values approximately equal to the number of coefficients suggest promising logistic regression models.

Model overfitting

Study These Flashcards

A situation in which a model explains random patterns in the data on which it is trained rather than just the relationships, resulting in training-set accuracy that far exceeds accuracy for the new data.

Observation (record)

Study These Flashcards

A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database.

Overall error rate

The percentage of observations misclassified by a model in a data set.

Precision

The percentage of observations predicted to be in a given class that actually are in that class

Random Forest

A variant of the bagging ensemble method that generates a committee of classification or regression trees based on different random samples but restricts each individual tree to a limited number of randomly selected features (variables)

receiver operating characteristic (ROC) curve

A chart used to illustrate the tradeoff between a model's ability to identify a given class's observations and its complement class's error rate.

Regression tree

A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules.

Root mean square error

A measure of the accuracy of an estimation method defined as the square root of the sum of squared deviations between the actual values and predicted values of observations.

Sensitivity; recall

The percentage of actual observations of a given class correctly identified, usually the positive class

Specificity

The percentage of actual observations of a given class correctly identified; usually the negative class

Supervised learning

Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

Test set

Data set used to compute unbiased estimate of final predictive model's accuracy.

Training set

Data used to build candidate predictive models.

Validation set

Data used to evaluate candidate predictive models.

Variable (feature)

A characteristic or quantity of interest that can take on different values.

Predictive Data Mining Flashcards

(37 cards)