Predictive Data Mining Flashcards

1
Q

Accuracy

A

Measure of classification success defined as 1 minus the overall error rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Average error

A

The average difference between the actual values and the predicted values of observations in a data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Bagging

A

An ensemble method that generates a committee of models based on random samples drawn with replacement and makes predictions based on the average prediction of the set of models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Bias

A

The tendency of a predictive model to overestimate or underestimate the value of a continuous outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Boosting

A

An ensemble method that iteratively samples from the original training data to generate individual models that target observations that were mispredicted in previously generated models. Its predictions are based on the weighted average of the predictions of the individual models, where the weights are proportional to the individual models’ accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Class error rate

A

Percentage of observations of a given class misclassified by a model in a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification confusion matrix

A

A matrix showing the counts of actual versus predicted class values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Classification tree

A

A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Classification

A

A predictive data mining task requiring the prediction of an observation’s outcome class or category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cumulative lift chart

A

A chart used to present how well a model performs in identifying observations most likely to be in a given class as compared with random classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cutoff value

A

The smallest value that the predicted probability of an observation can be for the observation to be classified as a given class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Decile-wise lift chart

A

A chart used to present how well a model performs at identifying observations for each of the top k deciles most likely to be in a given class versus a random selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ensemble method

A

A predictive data-mining approach in which a committee of individual classification or estimation models are generated and a prediction is made by combining these individual predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Estimation

A

A predictive data mining task requiring the prediction of an observation’s continuous outcome value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

F1 score

A

A measure combining precision and sensitivity into a single metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

False negative

A

The misclassification of a positive observation as negative.

17
Q

False positive

A

The misclassification of a negative observation as positive

18
Q

Features

A

A set of input variables used to predict an observation’s outcome class or continuous outcome value.

19
Q

Impurity

A

Measure of the heterogeneity of observations in a classification tree.

20
Q

K-nearest neighbor (K-NN)

A

A classification method that classifies an observation based on the class of the k observations most similar or nearest to it.

21
Q

Logistic regression

A

A generalization of linear regression for predicting a categorical outcome variable.

22
Q

Mallow’s Cp statistic

A

A measure in which small values approximately equal to the number of coefficients suggest promising logistic regression models.

23
Q

Model overfitting

A

A situation in which a model explains random patterns in the data on which it is trained rather than just the relationships, resulting in training-set accuracy that far exceeds accuracy for the new data.

24
Q

Observation (record)

A

A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database.

25
Q

Overall error rate

A

The percentage of observations misclassified by a model in a data set.

26
Q

Precision

A

The percentage of observations predicted to be in a given class that actually are in that class

27
Q

Random Forest

A

A variant of the bagging ensemble method that generates a committee of classification or regression trees based on different random samples but restricts each individual tree to a limited number of randomly selected features (variables)

28
Q

receiver operating characteristic (ROC) curve

A

A chart used to illustrate the tradeoff between a model’s ability to identify a given class’s observations and its complement class’s error rate.

29
Q

Regression tree

A

A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules.

30
Q

Root mean square error

A

A measure of the accuracy of an estimation method defined as the square root of the sum of squared deviations between the actual values and predicted values of observations.

31
Q

Sensitivity; recall

A

The percentage of actual observations of a given class correctly identified, usually the positive class

32
Q

Specificity

A

The percentage of actual observations of a given class correctly identified; usually the negative class

33
Q

Supervised learning

A

Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

34
Q

Test set

A

Data set used to compute unbiased estimate of final predictive model’s accuracy.

35
Q

Training set

A

Data used to build candidate predictive models.

36
Q

Validation set

A

Data used to evaluate candidate predictive models.

37
Q

Variable (feature)

A

A characteristic or quantity of interest that can take on different values.