Chapter 2 Flashcards
models (23 cards)
SEMMA
Sample Explore Modify Model Assess
CRISP-DM (SPSS/IBM)
Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
Supervised Learning
Goal: Predict a single “target” or “outcome” variable
Training data, where target value is known
Score to data where value is not known
Methods: Classification and Prediction
Unsupervised Learning
Goal: Segment data into meaningful segments; detect patterns
There is no target (outcome) variable to predict or classify
Methods: Association rules, data reduction & exploration, visualization
Supervised: Classification
Goal: Predict categorical target (outcome) variable
Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy…
Each row is a case (customer, tax return, applicant)
Each column is a variable
Target variable is often binary (yes/no)
Supervised: Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return, applicant)
Each column is a variable
Taken together, classification and prediction constitute “predictive analytics”
Unsupervised: Association Rules
Goal: Produce rules that define “what goes with what”
Example: “If X was purchased, Y was also purchased”
Rows are transactions
Used in recommender systems – “Our records show you bought X, you may also like Y”
Also called “affinity analysis”
Unsupervised: Data Reduction
Distillation of complex/large data into simpler/smaller data
Reducing the number of variables/columns (e.g., principal components)
Reducing the number of records/rows (e.g., clustering)
Unsupervised: Data Visualization
Graphs and plots of data
Histograms, boxplots, bar charts, scatterplots
Especially useful to examine relationships between pairs of variables
Data Exploration
Use techniques of Reduction and Visualization
Steps in Data Mining
- Define/understand purpose
- Obtain data (may involve random sampling)
- Explore, clean, pre-process data
- Reduce the data; if supervised DM, partition it
- Specify task (classification, clustering, etc.)
- Choose the techniques (regression, CART, neural networks, etc.)
- Iterative implementation and “tuning”
- Assess results – compare models
- Deploy best model
Rare Event Oversampling
Often the event of interest is rare
Examples: response to mailing, fraud in taxes, …
Sampling may yield too few “interesting” cases to effectively train a model
A popular solution: oversample the rare cases to obtain a more balanced training set
Later, need to adjust results for the oversampling
Categorical
Ordered (low, medium, high)
Unordered (male, female)
In most other algorithms, must create binary dummies (number of dummies = number of categories – 1) [see Table 2.6 for R code]
Numeric
Continuous - the number means something. 1st, 2nd. 3rd
Integer
Most algorithms can handle numeric data
May occasionally need to “bin” into categories
outlier
an observation that is “extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague)
Detecting Outliers
An important step in data pre-processing is detecting outliers
Once detected, domain knowledge is required to determine if it is an error, or truly extreme.
In some contexts, finding outliers is the purpose of the DM exercise (airport security screening). This is called “anomaly detection”.
Handling Missing Data
Solution 1: Omission
If a small number of records have missing values, can omit them
If many records are missing values on a small set of variables, can drop those variables (or use proxies)
If many records have missing values, omission is not practical
Solution 2: Imputation [see Table 2.7 for R code]
Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its (non-missing) information
Normalizing (Standardizing) Data
Used in some techniques when variables with the largest scales would dominate and skew results
Puts all variables on same scale
Normalizing function: Subtract mean and divide by standard deviation
Alternative function: scale to 0-1 by subtracting minimum and dividing by the range
Useful when the data contain dummies and numeric
Overfitting
Statistical models can produce highly complex explanations of relationships between variables
The “fit” may be excellent
When used with new data, models of great complexity do not do so well.
Causes:
- Too many predictors
- A model with too many parameters
- Trying many different models
Consequence: Deployed model will not work as well as expected with completely new data.
Training partition
Training partition to develop the model
Validation partition
Validation partition to implement the model and evaluate its performance on “new” data
test partition
test partition to give unbiased estimate of its performance on new data
RMSE
RMSE = Root-mean-squared error = Square root of average squared error