CRISP-DM Flashcards

1
Q

What are the stages of CRISP-DM

A

Project Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What happens in the Project Understanding Phase?

A

Problem formulation
Mapping problem to a data mining task
Understanding resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define Problem Formulation

A

Defining:
-objectives/deliverables
-criteria for success
-potential benefits/risks
-constrains/assumptions

for a data analysis project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Problem Formulation stage - time used vs importance

A

Time: 20%
Importance:80%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does ‘mapping problem to data mining task’ mean?

A

Part of the Project Understanding Phase.
Choosing data analysis method to use: classification, regression, deviation analysis, association analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What can be done to reduce communication problems in Project Understanding phase?

A

For ex:
Cognitive mapping
Rephrasing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are cognitive maps?

A

A visual representation of a problem that supports domain understanding.

Nodes are domain variables, arrows show direction and type of influence between nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What happens in the Data Understanding Phase?

A

Collecting the data
Using exploratory data analysis to get familiar with data and gain insights.
Evaluating data quality
Selecting interesting subsets that may contain patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What kind of goals can a data mining project have?

A

Type of data mining task (regression, classification, clustering)
Predictive accuracy, model flexibility, interpretability, run-time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What types of methods are available for solving different types of problems?

A

Classification, regression, clustering, association analysis, neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can different requirements or desirable properties affect our choice of method?

A

If it needs to be highly interpretable, neural networks wouldn’t be a good fit. Some methods take more computational resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can a project go wrong from the very start?

A

If the data does not support the project goals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the main goal of Data Understanding phase?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Rule of thumb for Data Understanding?

A

Never trust any data before some plausibility tests!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What types of attributes are there?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Can categorial attributes be represented as numerical?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are some typical problems with data quality?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some data visualization techniques and what can they reveal?

A
19
Q

What does z-score standardization do?

A
20
Q

Why might you need z-score standardization?

A
20
Q

Why might you need z-score standardization?

A
21
Q

What is dimensionality reduction?

A
22
Q

Compare dimensionality reduction to feature selection.

A
23
Q

How to select the number of principal components to use?

A
24
Q

What are outliers?

A
25
Q

How can outliers be detected?

A
26
Q

What are missing values?

A
27
Q

Are some types of missing values more difficult to deal with than others?

A
28
Q

When does data preparation happen?

A

A part of it needs to happen in the data understanding phase, more in the data preparation phase in a more principled manner.

29
Q

What happens in Data Preparation?

A

Feature extraction
Data quality problems dealt with
Normalization of features.
Dimensionality reduction.

30
Q

What is feature extraction?

A
31
Q

Why do feature selection?

A
32
Q

Why can it be beneficial to normalize feature values?

A
33
Q

Overall problem types

A

Classification, regression, clustering, association analysis.

34
Q

Model Structures

A

Nearest Neighbor predictor, linear model, if-else rule statements, decision trees, neural network, clustering.

35
Q

What is the goal of a machine learning algorithm?

A

To find a good model that fits the problem.

36
Q

How to choose between different models?

A

For example Cross-validation.

37
Q

What two criteria need to be balanced in learning?

A

1) Fit to data (low error)
2) Model complexity

38
Q

Define Overfitting

A

The model is too complex, and may simply model noise in data.

39
Q

Underfitting

A

The model is too simple and may not capture the concept.

40
Q

The more data your have, the more –

A

– complex models can you afford to use.

41
Q

Many theoretical approaches to defining what we mean by
complex:

A

Regularization theory, minimum description length
principle, Bayesian priors.

42
Q

What does a hyperparameter do?

A
43
Q

Examples of hyperparameters?

A

The max degree of polynomials on the above regression

The number of neighbors in the k-nearest neighbor method

The regularization parameter value for many methods
(regularized linear regression models, support vector
machines,…)
Depth of decision tree
The number of layers and the number of neurons per hidden
layer with neural networks
You might also be comparing models produced by different
algorithms altogether