Knowledge based systems Flashcards

Question

Why is pruning of decision trees important?

Answer 1

Because of the risk of overfitting our data with the decision tree method (the last nodes will contain very few examples) we need to prune the tree which means that we reduce the tree by the nodes that worsen the results of the classification.

Answer 2

The aim of a decision system for classification is to produce reducts which are the sets of features we need to look at when classifying. If we have really big decision tables then we will get really long reducts and that means that we might overfit our data on the training data - meaning that the decision system is too specific to the training data and will have low performance on other new data. The remedy for this is to create approximate reductions that capture the patterns but not the specific noise of the specific training data. These “almost reducts” are shorter reducts that are a set of attributes that discern objects in U to some degree.

Answer 3

When getting the approximations we are purposely choosing a smaller reduct and seeing how well that reduct is approximating a larger reduct by looking at the error of reduct approximation of a reduct. For example, how big is the error when using reduct B to approximate reduct C (B-C). If the error is not that large we can keep the approximation i.e. the shorter more general reduct.For example, if the cardinality of the equivalence classes in the positive region for (G1,G2) is 2 + 3 + 2 = 7 / 18and the cardinality of the equivalence classes in the positive region for (G1, G2, G3) is 2 + 2 + 4 + 3 + 1 + 1 = 13 / 18and the universe has 18 objectsWe can calculate the error of using the reduct of (G1,G2) to approximate (G1,G2,G3) by taking 1 - (7 / 13) = 6 / 10 and if the error is relatively small the we can keep the approximation instead of the long reduct.

Answer 4

The discretization step determines how strictly we want to view the world and it involves transforming quantitative data into qualitative levels. For instance, temperature values can be transformed into a finite number of qualitative levels like high, medium, low. In the discretization process we search for a set of cuts we can divide the quantitative data into qualitative levels with. A partition of the universe A is a partition of every value set Va into non-overlapping intervals defined by a number of cuts. The size of the partition is the number of cuts we construct. The process of constructing this is called discretization. A = (U, A ⋃ {d}) becomes Ap = (U, Ap ⋃ {d}) which is the discretized system.

Answer 5

* User defined is good if we have domain knowledge*Infer cuts from the data- Equal frequency binning- Naive methods- Entropy based- linear discriminant discretization- boolean reasoning algorithm

Answer 6

We define all intervals of values we can find in our decision table (excluding intervals where no values are). Then we find the middle points of those intervals and define those as cuts. To then construct a set of a minimal number of cuts that still discerns all objects we use boolean reasoning I.e. we do not need all the cuts we originally constructed to discern the values.For this we list all combinations of cuts we need to discern all objects from each other (not including the pairs with the same decision) and these correspond to a discernibility matrix but for the cuts. We then simplify these to the final number of cuts.

Answer 7

Discretization method. Not a very smart algorithm for discretization but it is sometimes used anyway because it is very simple. It sorts the values of a feature in ascending order, you choose the number of bins and then it sorts the values into the bins so that each bin has an equal number of objects. If you choose 3 binds then you get 2 cuts.

Answer 8

Discretization method. This algorithm produces cuts between objects if they have different classes. This often produces a lot of cuts.

Answer 9

The RI of a specific feature is the weighted accuracy of the trees * the information gain of the feature split * the fraction of objects in the feature split compared to the number in the root. The weighted accuracy we get by taking 1/c*(the fractions of correctly classified objects compared to the total number of classifications to that class).

Answer 10

The information value we calculate when we are constructing a decision tree. When we choose splits we choose the slipts that reduces the information value to most first. info([x,y]) = -x/n * log2(x/n) - y/n * log2(y/n) where x and y are the number of objects from each decision and n is the total number of objects i.e. (n = x + y). If we want to split the dataset on a feature with three classes we calculate the information value of all of them and take the weighted average of those.

Answer 11

information value for entire dataset - weighted average of information value of the subsets.

Answer 12

The information gain ratio is the information gain / split information value. We need this ratio because the information gain value by itself is biased towards features with many values. We then choose the split with the highest information gain ratio for the first node in the tree.

Answer 13

Split Information is a measure of how evenly the data is divided by a particular split. if there are 3 classes:split_info(“size”) = info([5, 4, 5]) = info([5, 9]) + 9/14 x info([4, 5]).if there are 2 classes:split_info[2,2],[2,2] = info[4,4].

Answer 14

Cross validation is a way to test a classifier on data that it has not been trained on. We need to test the classifier on independent data because if we test it on the training data the model could be overfitted to that data and give skewed performance. K means cross validation means that we divide the dataset into k means of sections and the classifier is trained on k-1 sections and tested on the last part. You vary the testing section so that the classifier is trained and tested on all sections and then you take the average of the performance values.

Answer 15

Upper approximation - lower approximation.

Answer 16

number of sets in lower / number of sets in upper.

Answer 17

The decision the objects in the set can have. If any of the sets has more than one decision the decision table is not consistent.

Answer 18

Calculate information value for entire datasetCalculate information value for each class for each feature and thencalculate the weighted average and information gain for each split.Calculate split information value for each feature and divide the information gain by split information valueChoose the highest information gain ratio and construct the first layer of nodes.

Answer 19

info([x,y]) = -x/n * log2(x/n) - y/n * log2(y/n) where x and y are the number of objects from each decision and n is the total number of objects i.e. (n = x + y).

Answer 20

Get datasetdata preprocessing: discretization of data (could also be done after feature selection), remove incomplete data, deal with outliers ect.split data into training and testing setsfeature selectionperform rule based model with for example rosetta. Here we get rules and look at the quality of the model after running a permutation test with at least 20 permutations. Test on independent data.Visualize network

Answer 21

The pros are that you preserve the best features and the features are ranked by statistical significance. The cons are that getting the p-values for each feature is computationally costly and it is not possible to see variability in the data.

Answer 22

Use more samplesGet more featuresTest a different discretization method

Answer 23

SupportAccuracyStrengthCoverage

Answer 24

node size tells you the decision and coverage support. Intensity of node color tells us the relative importance of the feature. lines between the nodes tell you about how strongly connected the nodes are. Red, thick lines indicates stronger connections.Border size tells you how many times that feature is included in a rule.

Answer 25

Undersampling which is a part of R.ROSETTA deals with undersampling but the problem is that you cannot have too few samples. Class imbalance issue may lead to biased performance of the machine learning models.

Answer 26

After you have done the feature selection and before the split of the dataset into learning and testing.Because it is more efficient to discretisize the smaller significant dataset than the whole dataset and we do it before the split because otherwise we would have to do the discretization twice.

Answer 27

Read the literature to find sources that gives biological explanations to what your model shows.

Answer 28

permutation test

Answer 29

The two nodes often co-occur in rules and they therefore have some function in common.

Answer 30

Because of noise in the data.

Answer 31

Reducts we get from doing the boolean discernibility function and they tell us which features we can use to discern between decision classes.

Answer 32

If we do not have an external test set it is good to use cross validation because that means that we do not train and test the model on the same data.

Answer 33

20 objects is a rather small universe and we could increase accuracy by adding more samples since ROC 0.68 is not very far from random classification. To increase accuracy we could also try another discretization method. The dataset is balanced so the model is not biased but with mean accuracy 56% we still cannot trust the model very much. They should have used leave-one-out cross validation instead of 10 fold cross validation.

Answer 34

leave one out cross validation means that you train your model on all datapoints in your set except for one and is then tested on the left out datapoint. This is done until the model has been trained and tested on all datapoints in the set. 10 fold cross validation means that you divide your universe into 10 units and train on 9 of them and test on the remaining one. this is done 10 times until the model has been trained and tested on all 10 units. leave one out cross validation is better for small datasets, it is too computationally heavy to do on large datasets.

Answer 35

A permutation test can be used to get a p-value for if the model is significantly better than random chance. We first let the model classify the objects with the correct labels and we get a measure of the models performance. We then shuffle the labels on the objects and let the models classify the objects again. Repeat this 100 or 1000 times and compare the models performance to the random data.

Answer 36

When we train and test the model.

Answer 37

In supervised learning the labels of objects are known and the methods seek definitions of decision classes. In unsupervised learning the labels of the objects are unknown and the methods primarily seek the labels.

Answer 38

It is unsupervised learning because the method is seeking for the reducts which we do not know beforehand.

Answer 39

The distance matrix has the objects on y and x axis and then the distance values is the sum of the differences between the values for each attribute. In the hirerachial clustering we choose the smallest distance and those two objects will be gathered in one group. We then calculate the new distances to the remaining objects by taking the average of of the distance of the grouped objects to each remaining object. this is done until there is only two groups left.

Answer 40

Methods for computing reducts with different aims. The main aim of the Johnson algorithm is to find a feature a∈A that discerns the highest number of object pairs, it is a deterministic greedy algorithm.The genetic reducer is a stochastic method based on Darwin’s theory of natural selection. This is a heuristic algorithm for function optimization that follows the “survival of the fittest” idea

Knowledge based systems Flashcards

(64 cards)