General Cards Flashcards by Sebastian Tapia

When you have calculated the information gain of each feature how do you determine the highest node in the tree?

The feature with the highest information gain is the root node.

How well did you know this?

Not at all

Perfectly

What do you need to do if a branch has an entropy greater than 1?

You need to split it.

How well did you know this?

Not at all

Perfectly

What do you need to remember about the entropy and information gain equations?

They use log₂ not ln or log..

How well did you know this?

Not at all

Perfectly

If you were to create a probability classification tree, would you require to use Laplace smoothing? Elaborate.

Session 1

Yes, Laplace is meant to moderate the influence of segments with only a few instances. Since the sample size is small,

How well did you know this?

Not at all

Perfectly

Classify your problem into type 1 or type 2 data driven decision making (DDD)problem, e.g. according to Foster Provost & Tom Fawcett:

(1) decisions for which“discoveries” need to be made within data, and (2) decisions that repeat, especiallyat massive scale, and so decision-making can benefit from even small increases indecision-making accuracy based on data analysis.

Type 1: Discover the data. Discover patterns to better understand our customer group, (i.e. why do they stay, or churn)

Type 2: Apply a model, at a bigger scale. Due to scale, a small increase of accuracy has a large effect.

How well did you know this?

Not at all

Perfectly

Explain and elaborate which canonical data mining task you match to the task described.

Type 1: All of them (f.I: Profiling, clustering) with exception of predictive techniques like classification and regression.

Type 2: classification, because of the categorical target variable (use regression for instance in calculating numeric target variables)

How well did you know this?

Not at all

Perfectly

Research how you can apply logistic regression for the case that your explanatory(or input) data has categorical variables.

To perform logistic regression the variables, have to be numeric. If the input data is categorical it would have to be converted into numeric value using dummy variables. In the case that the categorical values are binary, 0 and 1 can be used. If they are not binary, the different levels can be defined by more complex 0 and 1 combinations.

How well did you know this?

Not at all

Perfectly

d) You are now provided with yet another smaller dataset, presented in table 2.Sketch the observations and optimal separating hyperplane (including equation),as a maximal margin classifier (or maximum linear classifier). Include on thesketch the margin for the maximal margin hyperplane.
e) What is the effect of moving costumer 117 for the maximal margin classifier?What would be the effect if you were using a logistic regression?

It shouldn’t change anything about the maximal margin classifier unless it moves within the margin maximising boundary, then it would change the margin, it shrinks. If we move to the other side of the boundary then SVM either crashes or it considers an error. You can adjust this by giving how many possible errors it can accept.

How well did you know this?

Not at all

Perfectly

What is the purpose of the kernel approach in a support vector machine? How does changing the kernel changes the location of the decision boundary?

The kernel approach allows you to apply SVM in cases where we cannot find a straight line, i.e. we apply SVM non-linearly. The kernel function allows you to map the original features to another feature space (x-y coordinates). In other words, when the dataset is inseparable in the current dimension, add another one. Usually go one dimension up.

Changing the kernel will vary how the decision boundary is drawn. As the dimensions increase the decision boundaries can be become increasingly more complex.

How well did you know this?

Not at all

Perfectly

Using only pen and paper, create a probability classification tree dtree 1 using only the variable Employed presented in Table 1. Assuming you use the first 13 customers to train this model and the remaining customers for validation, calculate the accuracy, the confusion matrix, sensitivity and specificity for both the train and validation data. While creating the tree you may find two different possibilities, report them both.

How do you calculate the validation in this case?

In this case, the model is just based of employed, so you have to see how well employed predicts DS in the validation set. Its not more than that.

How well did you know this?

Not at all

Perfectly

How do you determine what a leaf node classifies?

It classifies based on the result that dominates, or in other words has the largest share.

How well did you know this?

Not at all

Perfectly

Why do decision trees overfit? and logistic-regression or SVM? How can youcounteract overfitting?

Decision tress overfit when the number of nodes increase and thereby acquire details of the training set that are not characteristic of the general population. The model starts to generalise from fewer and fewer data, causing a large loss in holdout data accuracy.

Logistic regression and SVM overfit for the same reasons, but they all overfit to different degrees. Log reg is a lot more susceptible to extreme outliers than SVM. You increase dimensions, you overfit for both.

You can counteract overfitting by reducing model complexity. For trees specify a minimum number of instances that must be present in a leaf. The second strategy for trees is ‘pruning’, which is cut off leaves and branches, replacing them with leaves.

More generally, culling feature sets. Also, regularisation.

How well did you know this?

Not at all

Perfectly

What is an important thing folds need to be?

Folds need to be stratified, i.e. they are characteristic of the whole dataset.

How well did you know this?

Not at all

Perfectly

In k-fold cross validation, what is the maximum amount of k-folds that can beused? In what cases would this strategy be useful?

The maximum amount of folds is the same as the number of observations and this is called leave-one-out cross-validation (LOOCV). LOOCV provides an unbiased estimation of the test error, but it also increases vari-ability, since it is based in a single observation. This strategy is useful incases where we have a natural limited amount of data, such as rare medical conditions.

How well did you know this?

Not at all

Perfectly

Explain in detail the steps to build a fitting graph for a decision tree and for alogistic regression classification model.

How well did you know this?

Not at all

Perfectly

Based ondtree 1, and Table 1, sketch a learning curve.

f you were to create a new classification tree based on information from Table 2,could you then test/validate with the data on Table 1? Explain your reasoning.Hint:: Check the ratio of digital subscriptions vs non-digital subscriptions.

No as the proportion with respect to the target variable differs significantly.

Considering the fixed yearly cost case, what is the expected profit for the model XY?

The new digital subscription will be rolled out to thecustomers for a weekly distribution at a cost of $50 a year. The current value of thepaper magazine is $60 per year. The marketing team has estimated that targeting acustomer will have a cost of $2 and in addition will incur the cost of a one year discountof 10% of the value of the subscription.

Fixed yearly cost of $20 dollars per costumer;

Considering the fixed yearly cost case, what is the expected profit for a random classifier?

What do you need to remember about a random classifier here?

The main thing here is that confusion matrix will be equally distributed across the columns.

Considering the pay per use, write down an equation to calculate the expectedprofit, which explicitly considers both the probability of response for costumerx(p(R|x)) and the costumer usage (u(x)).

Which kind of data and which canonical data mining task would you require toestimate the customer usage (u(x))? Would this data suffer from selection bias?

We would require data of the customers time on the website, more specifically the time they spent on an article.

As for the canonical data mining task, I would use regression since we want to know how much someone might use the digital platform.

Consider now the case where customers would be offered a tablet, as an incentiveto switch to digital subscription. The value of the tablet is $250. To offset this expenditure, the marketing team would like to target customers from whom the company would receive the greatest expected benefit from targeting them. Assume that the value of the costumer not accepting remains fixed at $60 per year. Evaluate the incentive of targeting the customer.

Considering only customers 1 to 7 (inclusive), classify customer David with respectto the three most similar customers using the Euclidean distance and majorityvote. David is 32 years old and has been a subscriber for 3 years.

What is the procedure for a complete linkage dendrogram?

When you’ve found the two closets points and made a cluster, you determine the distance to the other points/clusters by taking the maximum distance between two points/clusters you have merged and the other points/clusters.

So sometimes one part of the cluster will have a distance further away, sometimes the other.

Explain the differences between hierarchical clustering and objective based clustering.

The main difference is that hierarchical clustering focuses on the similarities between the individual instances & how similarities link them together while objective based functions like k-means clustering focuses on the clusters themselves. Within cluster distance minimised, distance between clusters needs to be maximised. Also, objective based clustering can take many forms based on the objective function used. Objective based functions also must be run many times as the clustering result is dependent upon the initial centroid locations, hierarchical clustering does not have this. Also, you need to define the number of clusters for objective based.

Given table 2, plot the observations, start by randomly assigning a cluster label to each observation and do five steps of the K-means clustering.

What is a FN?

A False Negative (FN) is when you predict N but its acc p, and not n.

Consider the logistic regression model f(x) =−1.48−0.11x₁+ 0.05x₂ where x₁ and x₂ are ‘Years of Subscription’ and ‘Age’ respectively, trained in a different data than the one presented table 3. Draw the ROC curve for this model and fora random classifier. To alleviate the calculation burden, you can use the first 10 customers and use only 4 points to draw the curve. Hint:: See figure 8.1 Forthese exercises, you are “ranking instead of classifying”.

Importantly here you choose the thresholds and then determine the TP and FP rate to get the point coordinates.

What should you do when you get a naive bayes question?

First you need to break up the data into separate tables against the target variable values. The table notes the counts of a feature against the target variable value. When you're calculating the probabilities in the numerator you're looking down the columns. Finding the appropiate value and taking that over the total of the column. Importantly, in the numerator you also have to multiply by the number of times the specific target value occurs over all target values. For the denominator, you take the total amount for that feature

Classify David using a naive Bayes classifier, based on the information from allcustomers. David is 32 years old, enjoys fishing and has been a subscriber for 3years. Show the steps of the computations. Explain the assumptions made.

Explain what is an appropriate baseline model based on the same concept as anaive Bayes classifier and based on the same concept as decision trees.