Chapter 2 Flashcards

models

1
Q

SEMMA

A
Sample
Explore
Modify
Model
Assess
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CRISP-DM (SPSS/IBM)

A
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Supervised Learning

A

Goal: Predict a single “target” or “outcome” variable

Training data, where target value is known

Score to data where value is not known

Methods: Classification and Prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unsupervised Learning

A

Goal: Segment data into meaningful segments; detect patterns

There is no target (outcome) variable to predict or classify

Methods: Association rules, data reduction & exploration, visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Supervised: Classification

A

Goal: Predict categorical target (outcome) variable
Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy…
Each row is a case (customer, tax return, applicant)
Each column is a variable
Target variable is often binary (yes/no)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Supervised: Prediction

A

Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
As in classification:
Each row is a case (customer, tax return, applicant)
Each column is a variable
Taken together, classification and prediction constitute “predictive analytics”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Unsupervised: Association Rules

A

Goal: Produce rules that define “what goes with what”
Example: “If X was purchased, Y was also purchased”
Rows are transactions
Used in recommender systems – “Our records show you bought X, you may also like Y”
Also called “affinity analysis”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Unsupervised: Data Reduction

A

Distillation of complex/large data into simpler/smaller data
Reducing the number of variables/columns (e.g., principal components)
Reducing the number of records/rows (e.g., clustering)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Unsupervised: Data Visualization

A

Graphs and plots of data
Histograms, boxplots, bar charts, scatterplots
Especially useful to examine relationships between pairs of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Exploration

A

Use techniques of Reduction and Visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Steps in Data Mining

A
  1. Define/understand purpose
  2. Obtain data (may involve random sampling)
  3. Explore, clean, pre-process data
  4. Reduce the data; if supervised DM, partition it
  5. Specify task (classification, clustering, etc.)
  6. Choose the techniques (regression, CART, neural networks, etc.)
  7. Iterative implementation and “tuning”
  8. Assess results – compare models
  9. Deploy best model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Rare Event Oversampling

A

Often the event of interest is rare
Examples: response to mailing, fraud in taxes, …
Sampling may yield too few “interesting” cases to effectively train a model
A popular solution: oversample the rare cases to obtain a more balanced training set
Later, need to adjust results for the oversampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Categorical

A

Ordered (low, medium, high)
Unordered (male, female)
In most other algorithms, must create binary dummies (number of dummies = number of categories – 1) [see Table 2.6 for R code]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Numeric

A

Continuous - the number means something. 1st, 2nd. 3rd
Integer
Most algorithms can handle numeric data
May occasionally need to “bin” into categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

outlier

A

an observation that is “extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Detecting Outliers

A

An important step in data pre-processing is detecting outliers
Once detected, domain knowledge is required to determine if it is an error, or truly extreme.
In some contexts, finding outliers is the purpose of the DM exercise (airport security screening). This is called “anomaly detection”.

17
Q

Handling Missing Data

A

Solution 1: Omission
If a small number of records have missing values, can omit them
If many records are missing values on a small set of variables, can drop those variables (or use proxies)
If many records have missing values, omission is not practical
Solution 2: Imputation [see Table 2.7 for R code]
Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its (non-missing) information

18
Q

Normalizing (Standardizing) Data

A

Used in some techniques when variables with the largest scales would dominate and skew results
Puts all variables on same scale
Normalizing function: Subtract mean and divide by standard deviation
Alternative function: scale to 0-1 by subtracting minimum and dividing by the range
Useful when the data contain dummies and numeric

19
Q

Overfitting

A

Statistical models can produce highly complex explanations of relationships between variables
The “fit” may be excellent
When used with new data, models of great complexity do not do so well.

Causes:

  • Too many predictors
  • A model with too many parameters
  • Trying many different models

Consequence: Deployed model will not work as well as expected with completely new data.

20
Q

Training partition

A

Training partition to develop the model

21
Q

Validation partition

A

Validation partition to implement the model and evaluate its performance on “new” data

22
Q

test partition

A

test partition to give unbiased estimate of its performance on new data

23
Q

RMSE

A

RMSE = Root-mean-squared error = Square root of average squared error