Chapter 2 Flashcards

Question 1

Q

SEMMA

Answer

A

Sample
Explore
Modify
Model
Assess

Question 2

Q

CRISP-DM (SPSS/IBM)

Answer

A

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Question 3

Q

Supervised Learning

Answer

A

Goal: Predict a single “target” or “outcome” variable

Training data, where target value is known

Score to data where value is not known

Methods: Classification and Prediction

Question 4

Q

Unsupervised Learning

Answer

A

Goal: Segment data into meaningful segments; detect patterns

There is no target (outcome) variable to predict or classify

Methods: Association rules, data reduction & exploration, visualization

Question 5

Q

Supervised: Classification

Answer

A

Goal: Predict categorical target (outcome) variable
Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy…
Each row is a case (customer, tax return, applicant)
Each column is a variable
Target variable is often binary (yes/no)

Question 6

Q

Supervised: Prediction

Answer

A

Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return, applicant)
Each column is a variable
Taken together, classification and prediction constitute “predictive analytics”

Question 7

Q

Unsupervised: Association Rules

Answer

A

Goal: Produce rules that define “what goes with what”
Example: “If X was purchased, Y was also purchased”
Rows are transactions
Used in recommender systems – “Our records show you bought X, you may also like Y”
Also called “affinity analysis”

Question 8

Q

Unsupervised: Data Reduction

Answer

A

Distillation of complex/large data into simpler/smaller data
Reducing the number of variables/columns (e.g., principal components)
Reducing the number of records/rows (e.g., clustering)

Question 9

Q

Unsupervised: Data Visualization

Answer

A

Graphs and plots of data
Histograms, boxplots, bar charts, scatterplots
Especially useful to examine relationships between pairs of variables

Question 10

Q

Data Exploration

Answer

A

Use techniques of Reduction and Visualization

Question 11

Q

Steps in Data Mining

Answer

A

Define/understand purpose
Obtain data (may involve random sampling)
Explore, clean, pre-process data
Reduce the data; if supervised DM, partition it
Specify task (classification, clustering, etc.)
Choose the techniques (regression, CART, neural networks, etc.)
Iterative implementation and “tuning”
Assess results – compare models
Deploy best model

Question 12

Q

Rare Event Oversampling

Answer

A

Often the event of interest is rare
Examples: response to mailing, fraud in taxes, …
Sampling may yield too few “interesting” cases to effectively train a model
A popular solution: oversample the rare cases to obtain a more balanced training set
Later, need to adjust results for the oversampling

Question 13

Q

Categorical

Answer

A

Ordered (low, medium, high)
Unordered (male, female)
In most other algorithms, must create binary dummies (number of dummies = number of categories – 1) [see Table 2.6 for R code]

Question 14

Q

Numeric

Answer

A

Continuous - the number means something. 1st, 2nd. 3rd
Integer
Most algorithms can handle numeric data
May occasionally need to “bin” into categories

Question 15

Q

outlier

Answer

A

an observation that is “extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague)

Question 16

Q

Detecting Outliers

Answer

Study These Flashcards

A

An important step in data pre-processing is detecting outliers
Once detected, domain knowledge is required to determine if it is an error, or truly extreme.
In some contexts, finding outliers is the purpose of the DM exercise (airport security screening). This is called “anomaly detection”.

Question 17

Q

Handling Missing Data

Answer

Study These Flashcards

A

Solution 1: Omission
If a small number of records have missing values, can omit them
If many records are missing values on a small set of variables, can drop those variables (or use proxies)
If many records have missing values, omission is not practical
Solution 2: Imputation [see Table 2.7 for R code]
Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its (non-missing) information

Question 18

Q

Normalizing (Standardizing) Data

Answer

Study These Flashcards

A

Used in some techniques when variables with the largest scales would dominate and skew results
Puts all variables on same scale
Normalizing function: Subtract mean and divide by standard deviation
Alternative function: scale to 0-1 by subtracting minimum and dividing by the range
Useful when the data contain dummies and numeric

Question 19

Q

Overfitting

Answer

Study These Flashcards

A

Statistical models can produce highly complex explanations of relationships between variables
The “fit” may be excellent
When used with new data, models of great complexity do not do so well.

Causes:

Too many predictors
A model with too many parameters
Trying many different models

Consequence: Deployed model will not work as well as expected with completely new data.

Question 20

Q

Training partition

Answer

Study These Flashcards

A

Training partition to develop the model

Question 21

Q

Validation partition

Answer

Study These Flashcards

A

Validation partition to implement the model and evaluate its performance on “new” data

Question 22

Q

test partition

Answer

Study These Flashcards

A

test partition to give unbiased estimate of its performance on new data

Question 23

Q

RMSE

Answer

Study These Flashcards

A

RMSE = Root-mean-squared error = Square root of average squared error

Chapter 2 Flashcards

models (23 cards)