Data Mining Flashcards

1
Q

What is the first step in the CRISP-DM framework?

A

Business Understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the second step in the CRISP-DM framework?

A

Data Understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the third step in the CRISP-DM framework?

A

Data Preparation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the fourth step in the CRISP-DM framework?

A

Modelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the fifth step in the CRISP-DM framework?

A

Evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the final step in the CRISP-DM framework?

A

Deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the most important steps in the CRISP-DM model?

A

Data understanding and prep

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a key aspect of data preparaton?

A

Data reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does data reduction do?

A

Removes unnecessary and misleading data

Reduces time taken for discovering knowledge

Improves quality of discovered knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the main techniques in data reduction?

A

Feature selection
Instance selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the definition of classification?

A

Given a set of (training) data, we find a model for the class feature as a function of the values of the other features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the goal of classification?

A

That new instances (i.e real data) are assigned a class as accurately as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an example of preprocessing?

A

When the data is images, remove subject of interest from background

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What may cause errors in classification?

A

Insufficient training data
Too few or too many features
Overfitting (learning too much from training)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the k-NN algorithm work?

A

First locate the nearest k instances with Euclidean distance

Take a vote amongst those if discrete answer, otherwise mean of those values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Euclidean distance?

A

Sq. Rt. ( (x1 - y1)^2 + … + (xn - yn)^2 )

17
Q

When should k-NN be considered?

A

Not more than ~20 features
Lots of training data

18
Q

What are the advantages of k-NN?

A
  • Training very fast
  • Can learn about complex
    target functions
  • Does not lose info
  • Can handle outliers if k is
    sufficient
19
Q

What are disadvantages of k-NN?

A
  • Large overhead due to
    calculation of distance of
    test instance from all
    training data
  • May be irrelevant features