Practice Exam Solutions Flashcards
Draw a figure showing what the CRISP-DM process looks like and explain the steps inprocess.
Explain the differences between characteristic description and differential descriptionof models.
Characteristic description concentrate on intragroup comonalities, that is it describes what is typical or characteristic of the group (or cluster), ignoring whether other groups (or clusters) might share some of these characterisitics. These can be obtained by clustering trees.
Differential description concentrate on intergroup differences, that is it describes only what differentiates this group (or cluster) from the others, ignoring the characteristics that maybe be shared by objects within it. These can be obtained by decision trees.
1.3.Explain what is an appropriate baseline model based on the same concept as a decision tree.
A baseline model is a simple model that is used as basis for comparison with other models. A decision tree is based on the concept of information gain. Thus a simple baseline model is a decision tree limited to one internal node with the most informative feature.
A baseline model is a simple model that is used as basis for comparisonwith other models. A decision tree is based on the concept of information gain. Thusa simple baseline model is a decision tree limited to one internal node with the modesinformative feature.
Q. 1.4
1.5.Explain the differences between clustering and classification.
Classification attempts to predict, for each individual in a population,which of a (small) set of classes that individual belongs to. In this case, a data mining procedure results in a model that determines which class a new individual belongs to. Classification is a supervised method. Clustering attempts to group individuals ina population together by their similarity, but without regard to any specific purpose. Clustering is an unsupervised method.
2.1 The company is interested in finding the typical cell phone usage (call, sms and inter-net) of customers that are self-employed. Explain if given the above information youare able to find a solution, clearly identifying which canonical data mining methodyou would use and indicating if it is a supervised or unsupervised method.
The objective is to characterize the typical behavior of an individual, group,or population, which can be solved using profiling, an unsupervised data mining task. This task is possible to solve with the provided dataset, since we possess informationon cell phone usage (call, sms and internet) and we assume that self-employed is oneof the possibilities of field ’occupation’.
2.2.You are informed that at the moment they have a churn classifier in place that hasan accuracy of about 86%. What can you conclude from this statement regarding theaccuracy of this classifier.
There is no indication on how this result was obtained, namely:
- if any cross-validation (simple or k-fold) was performed;
- if this is an in-sample (training) or out-of-sample (validation/testing) value;
- what is the confusion matrix associated with this result (given that the dataset is unbalanced with a division of about 14% customers that churn and about 86%of customers that do not churn, the classifier could be classifying all customers as not churning and still obtaining the same accuracy).
Q. 2.3
2.4. Draw the entropy chart for the variable ‘service subscription type’.
Q 3.2