Data Mining Flashcards
What is the first step in the CRISP-DM framework?
Business Understanding
What is the second step in the CRISP-DM framework?
Data Understanding
What is the third step in the CRISP-DM framework?
Data Preparation
What is the fourth step in the CRISP-DM framework?
Modelling
What is the fifth step in the CRISP-DM framework?
Evaluation
What is the final step in the CRISP-DM framework?
Deployment
What are the most important steps in the CRISP-DM model?
Data understanding and prep
What is a key aspect of data preparaton?
Data reduction
What does data reduction do?
Removes unnecessary and misleading data
Reduces time taken for discovering knowledge
Improves quality of discovered knowledge
What are the main techniques in data reduction?
Feature selection
Instance selection
What is the definition of classification?
Given a set of (training) data, we find a model for the class feature as a function of the values of the other features
What is the goal of classification?
That new instances (i.e real data) are assigned a class as accurately as possible
What is an example of preprocessing?
When the data is images, remove subject of interest from background
What may cause errors in classification?
Insufficient training data
Too few or too many features
Overfitting (learning too much from training)
How does the k-NN algorithm work?
First locate the nearest k instances with Euclidean distance
Take a vote amongst those if discrete answer, otherwise mean of those values
What is the Euclidean distance?
Sq. Rt. ( (x1 - y1)^2 + … + (xn - yn)^2 )
When should k-NN be considered?
Not more than ~20 features
Lots of training data
What are the advantages of k-NN?
- Training very fast
- Can learn about complex
target functions - Does not lose info
- Can handle outliers if k is
sufficient
What are disadvantages of k-NN?
- Large overhead due to
calculation of distance of
test instance from all
training data - May be irrelevant features