General Cards Flashcards
When you have calculated the information gain of each feature how do you determine the highest node in the tree?
The feature with the highest information gain is the root node.
What do you need to do if a branch has an entropy greater than 1?
You need to split it.
What do you need to remember about the entropy and information gain equations?
They use log2 not ln or log..
If you were to create a probability classification tree, would you require to use Laplace smoothing? Elaborate.
Session 1
Yes, Laplace is meant to moderate the influence of segments with only a few instances. Since the sample size is small,
Classify your problem into type 1 or type 2 data driven decision making (DDD)problem, e.g. according to Foster Provost & Tom Fawcett:
(1) decisions for which“discoveries” need to be made within data, and (2) decisions that repeat, especiallyat massive scale, and so decision-making can benefit from even small increases indecision-making accuracy based on data analysis.
Type 1: Discover the data. Discover patterns to better understand our customer group, (i.e. why do they stay, or churn)
Type 2: Apply a model, at a bigger scale. Due to scale, a small increase of accuracy has a large effect.
Explain and elaborate which canonical data mining task you match to the task described.
Type 1: All of them (f.I: Profiling, clustering) with exception of predictive techniques like classification and regression.
Type 2: classification, because of the categorical target variable (use regression for instance in calculating numeric target variables)
Research how you can apply logistic regression for the case that your explanatory(or input) data has categorical variables.
To perform logistic regression the variables, have to be numeric. If the input data is categorical it would have to be converted into numeric value using dummy variables. In the case that the categorical values are binary, 0 and 1 can be used. If they are not binary, the different levels can be defined by more complex 0 and 1 combinations.
d) You are now provided with yet another smaller dataset, presented in table 2.Sketch the observations and optimal separating hyperplane (including equation),as a maximal margin classifier (or maximum linear classifier). Include on thesketch the margin for the maximal margin hyperplane.
e) What is the effect of moving costumer 117 for the maximal margin classifier?What would be the effect if you were using a logistic regression?
e)
It shouldn’t change anything about the maximal margin classifier unless it moves within the margin maximising boundary, then it would change the margin, it shrinks. If we move to the other side of the boundary then SVM either crashes or it considers an error. You can adjust this by giving how many possible errors it can accept.
What is the purpose of the kernel approach in a support vector machine? How does changing the kernel changes the location of the decision boundary?
The kernel approach allows you to apply SVM in cases where we cannot find a straight line, i.e. we apply SVM non-linearly. The kernel function allows you to map the original features to another feature space (x-y coordinates). In other words, when the dataset is inseparable in the current dimension, add another one. Usually go one dimension up.
Changing the kernel will vary how the decision boundary is drawn. As the dimensions increase the decision boundaries can be become increasingly more complex.
Using only pen and paper, create a probability classification tree dtree 1 using only the variable Employed presented in Table 1. Assuming you use the first 13 customers to train this model and the remaining customers for validation, calculate the accuracy, the confusion matrix, sensitivity and specificity for both the train and validation data. While creating the tree you may find two different possibilities, report them both.
How do you calculate the validation in this case?
In this case, the model is just based of employed, so you have to see how well employed predicts DS in the validation set. Its not more than that.
How do you determine what a leaf node classifies?
It classifies based on the result that dominates, or in other words has the largest share.
Why do decision trees overfit? and logistic-regression or SVM? How can youcounteract overfitting?
Decision tress overfit when the number of nodes increase and thereby acquire details of the training set that are not characteristic of the general population. The model starts to generalise from fewer and fewer data, causing a large loss in holdout data accuracy.
Logistic regression and SVM overfit for the same reasons, but they all overfit to different degrees. Log reg is a lot more susceptible to extreme outliers than SVM. You increase dimensions, you overfit for both.
You can counteract overfitting by reducing model complexity. For trees specify a minimum number of instances that must be present in a leaf. The second strategy for trees is ‘pruning’, which is cut off leaves and branches, replacing them with leaves.
More generally, culling feature sets. Also, regularisation.
What is an important thing folds need to be?
Folds need to be stratified, i.e. they are characteristic of the whole dataset.
In k-fold cross validation, what is the maximum amount of k-folds that can beused? In what cases would this strategy be useful?
The maximum amount of folds is the same as the number of observations and this is called leave-one-out cross-validation (LOOCV). LOOCV provides an unbiased estimation of the test error, but it also increases vari-ability, since it is based in a single observation. This strategy is useful incases where we have a natural limited amount of data, such as rare medical conditions.
Explain in detail the steps to build a fitting graph for a decision tree and for alogistic regression classification model.