CRISP-DM Flashcards by Mari Weyand

What are the stages of CRISP-DM

Project Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

How well did you know this?

Not at all

Perfectly

What happens in the Project Understanding Phase?

Problem formulation
Mapping problem to a data mining task
Understanding resources

How well did you know this?

Not at all

Perfectly

Define Problem Formulation

Defining:
-objectives/deliverables
-criteria for success
-potential benefits/risks
-constrains/assumptions

for a data analysis project.

How well did you know this?

Not at all

Perfectly

Problem Formulation stage - time used vs importance

Time: 20%
Importance:80%

How well did you know this?

Not at all

Perfectly

What does ‘mapping problem to data mining task’ mean?

Part of the Project Understanding Phase.
Choosing data analysis method to use: classification, regression, deviation analysis, association analysis.

How well did you know this?

Not at all

Perfectly

What can be done to reduce communication problems in Project Understanding phase?

For ex:
Cognitive mapping
Rephrasing

How well did you know this?

Not at all

Perfectly

What are cognitive maps?

A visual representation of a problem that supports domain understanding.

Nodes are domain variables, arrows show direction and type of influence between nodes.

How well did you know this?

Not at all

Perfectly

What happens in the Data Understanding Phase?

Collecting the data
Using exploratory data analysis to get familiar with data and gain insights.
Evaluating data quality
Selecting interesting subsets that may contain patterns.

How well did you know this?

Not at all

Perfectly

What kind of goals can a data mining project have?

Type of data mining task (regression, classification, clustering)
Predictive accuracy, model flexibility, interpretability, run-time

How well did you know this?

Not at all

Perfectly

What types of methods are available for solving different types of problems?

Classification, regression, clustering, association analysis, neural networks.

How well did you know this?

Not at all

Perfectly

How can different requirements or desirable properties affect our choice of method?

If it needs to be highly interpretable, neural networks wouldn’t be a good fit. Some methods take more computational resources.

How well did you know this?

Not at all

Perfectly

How can a project go wrong from the very start?

If the data does not support the project goals.

How well did you know this?

Not at all

Perfectly

What is the main goal of Data Understanding phase?

How well did you know this?

Not at all

Perfectly

Rule of thumb for Data Understanding?

Never trust any data before some plausibility tests!

How well did you know this?

Not at all

Perfectly

What types of attributes are there?

How well did you know this?

Not at all

Perfectly

Can categorial attributes be represented as numerical?

How well did you know this?

Not at all

Perfectly

What are some typical problems with data quality?

How well did you know this?

Not at all

Perfectly

What are some data visualization techniques and what can they reveal?

What does z-score standardization do?

Why might you need z-score standardization?

What is dimensionality reduction?

Compare dimensionality reduction to feature selection.

How to select the number of principal components to use?

What are outliers?

How can outliers be detected?

What are missing values?

Are some types of missing values more difficult to deal with than others?

When does data preparation happen?

A part of it needs to happen in the data understanding phase, more in the data preparation phase in a more principled manner.

What happens in Data Preparation?

Feature extraction Data quality problems dealt with Normalization of features. Dimensionality reduction.

What is feature extraction?

Why do feature selection?

Why can it be beneficial to normalize feature values?

Overall problem types

Classification, regression, clustering, association analysis.

Model Structures

Nearest Neighbor predictor, linear model, if-else rule statements, decision trees, neural network, clustering.

What is the goal of a machine learning algorithm?

To find a good model that fits the problem.

How to choose between different models?

For example Cross-validation.

What two criteria need to be balanced in learning?

1) Fit to data (low error) 2) Model complexity

Define Overfitting

The model is too complex, and may simply model noise in data.

Underfitting

The model is too simple and may not capture the concept.

The more data your have, the more --

-- complex models can you afford to use.

Many theoretical approaches to defining what we mean by complex:

Regularization theory, minimum description length principle, Bayesian priors.

What does a hyperparameter do?

Examples of hyperparameters?

The max degree of polynomials on the above regression The number of neighbors in the k-nearest neighbor method The regularization parameter value for many methods (regularized linear regression models, support vector machines,...) Depth of decision tree The number of layers and the number of neurons per hidden layer with neural networks You might also be comparing models produced by different algorithms altogether