Chapter 2 Flashcards
models
SEMMA
Sample Explore Modify Model Assess
CRISP-DM (SPSS/IBM)
Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
Supervised Learning
Goal: Predict a single “target” or “outcome” variable
Training data, where target value is known
Score to data where value is not known
Methods: Classification and Prediction
Unsupervised Learning
Goal: Segment data into meaningful segments; detect patterns
There is no target (outcome) variable to predict or classify
Methods: Association rules, data reduction & exploration, visualization
Supervised: Classification
Goal: Predict categorical target (outcome) variable
Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy…
Each row is a case (customer, tax return, applicant)
Each column is a variable
Target variable is often binary (yes/no)
Supervised: Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return, applicant)
Each column is a variable
Taken together, classification and prediction constitute “predictive analytics”
Unsupervised: Association Rules
Goal: Produce rules that define “what goes with what”
Example: “If X was purchased, Y was also purchased”
Rows are transactions
Used in recommender systems – “Our records show you bought X, you may also like Y”
Also called “affinity analysis”
Unsupervised: Data Reduction
Distillation of complex/large data into simpler/smaller data
Reducing the number of variables/columns (e.g., principal components)
Reducing the number of records/rows (e.g., clustering)
Unsupervised: Data Visualization
Graphs and plots of data
Histograms, boxplots, bar charts, scatterplots
Especially useful to examine relationships between pairs of variables
Data Exploration
Use techniques of Reduction and Visualization
Steps in Data Mining
- Define/understand purpose
- Obtain data (may involve random sampling)
- Explore, clean, pre-process data
- Reduce the data; if supervised DM, partition it
- Specify task (classification, clustering, etc.)
- Choose the techniques (regression, CART, neural networks, etc.)
- Iterative implementation and “tuning”
- Assess results – compare models
- Deploy best model
Rare Event Oversampling
Often the event of interest is rare
Examples: response to mailing, fraud in taxes, …
Sampling may yield too few “interesting” cases to effectively train a model
A popular solution: oversample the rare cases to obtain a more balanced training set
Later, need to adjust results for the oversampling
Categorical
Ordered (low, medium, high)
Unordered (male, female)
In most other algorithms, must create binary dummies (number of dummies = number of categories – 1) [see Table 2.6 for R code]
Numeric
Continuous - the number means something. 1st, 2nd. 3rd
Integer
Most algorithms can handle numeric data
May occasionally need to “bin” into categories
outlier
an observation that is “extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague)