Knowledge based systems for bioinformatics Flashcards

Question

Why is pruning of decision trees important?

Answer 1

Because of the risk of overfitting our data with the decision tree method (the last nodes will contain very few examples) we need to prune the tree which means that we reduce the tree by the nodes that worsen the results of the classification.

Answer 2

The aim of a decision system for classification is to produce reducts which are the sets of features we need to look at when classifying. If we have really big decision tables then we will get really long reducts and that means that we might overfit our data on the training data - meaning that the decision system is too specific to the training data and will have low performance on other new data. The remedy for this is to create approximate reductions that capture the patterns but not the specific noise of the specific training data. These “almost reducts” are shorter reducts that are a set of attributes that discern objects in U to some degree.

Answer 3

When getting the approximations we are purposely choosing a smaller reduct and seeing how well that reduct is approximating a larger reduct by looking at the error of reduct approximation of a reduct. For example, how big is the error when using reduct B to approximate reduct C (B-C). If the error is not that large we can keep the approximation i.e. the shorter more general reduct. For example, if the cardinality of the equivalence classes in the positive region for (G1,G2) is 2 + 3 + 2 = 7 / 18 and the cardinality of the equivalence classes in the positive region for (G1, G2, G3) is 2 + 2 + 4 + 3 + 1 + 1 = 13 / 18 and the universe has 18 objects We can calculate the error of using the reduct of (G1,G2) to approximate (G1,G2,G3) by taking 1 - (7 / 13) = 6 / 10 and if the error is relatively small the we can keep the approximation instead of the long reduct.

Answer 4

The discretization step determines how strictly we want to view the world and it involves transforming quantitative data into qualitative levels. For instance, temperature values can be transformed into a finite number of qualitative levels like high, medium, low. In the discretization process we search for a set of cuts we can divide the quantitative data into qualitative levels with. A partition of the universe A is a partition of every value set Va into non-overlapping intervals defined by a number of cuts. The size of the partition is the number of cuts we construct. The process of constructing this is called discretization. A = (U, A ⋃ {d}) becomes Ap = (U, Ap ⋃ {d}) which is the discretized system.

Answer 5

* User defined is good if we have domain knowledge *Infer cuts from the data - Equal frequency binning - Naive methods - Entropy based - linear discriminant discretization - boolean reasoning algorithm

Answer 6

We define all intervals of values we can find in our decision table (excluding intervals where no values are). Then we find the middle points of those intervals and define those as cuts. To then construct a set of a minimal number of cuts that still discerns all objects we use boolean reasoning I.e. we do not need all the cuts we originally constructed to discern the values. For this we list all combinations of cuts we need to discern all objects from each other (not including the pairs with the same decision) and these correspond to a discernibility matrix but for the cuts. We then simplify these to the final number of cuts.

Answer 7

Discretization method. Not a very smart algorithm for discretization but it is sometimes used anyway because it is very simple. 1. Choose number of bins 2. Sort values in ascending order 3. Divide the objects into the chosen number of bins so that there is equal number of objects in each bin. 4. Calculate cuts by taking the average of the values around the cuts.

Answer 8

Discretization method. This algorithm produces cuts between objects if they have different classes. This often produces a lot of cuts. 1. Sort values in ascending order and match each object with each decision. 2. Place cuts wherever the decision changes 3. Calculate cut value by taking the average of the two values around the cut.

Answer 9

The RI of a specific feature is the weighted accuracy of the trees * the information gain of the feature split * the fraction of objects in the feature split compared to the number in the root. The weighted accuracy we get by taking 1/c*(the fractions of correctly classified objects compared to the total number of classifications to that class).

Answer 10

The information value we calculate when we are constructing a decision tree. When we choose splits we choose the slipts that reduces the information value to most first. info([x,y]) = -x/n * log2(x/n) - y/n * log2(y/n) where x and y are the number of objects from each decision and n is the total number of objects i.e. (n = x + y). If we want to split the dataset on a feature with three classes we calculate the information value of all of them and take the weighted average of those.

Answer 11

information value for entire dataset - weighted average of information value of the subsets.

Answer 12

The information gain ratio is the information gain / split information value. We need this ratio because the information gain value by itself is biased towards features with many values. We then choose the split with the highest information gain ratio for the first node in the tree.

Answer 13

Split Information is a measure of how evenly the data is divided by a particular split. if there are 3 classes: split_info(“size”) = info([5, 4, 5]) = info([5, 9]) + 9/14 x info([4, 5]). if there are 2 classes: split_info[2,2],[2,2] = info[4,4].

Answer 14

Cross validation is a way to test a classifier on data that it has not been trained on. We need to test the classifier on independent data because if we test it on the training data the model could be overfitted to that data and give skewed performance. K means cross validation means that we divide the dataset into k means of sections and the classifier is trained on k-1 sections and tested on the last part. You vary the testing section so that the classifier is trained and tested on all sections and then you take the average of the performance values.

Answer 15

Upper approximation - lower approximation.

Answer 16

number of sets in lower / number of sets in upper.

Answer 17

The decision the objects in the set can have. If any of the sets has more than one decision the decision table is not consistent.

Answer 18

Calculate information value for entire dataset Calculate information value for each class for each feature and then calculate the weighted average and information gain for each split. Calculate split information value for each feature and divide the information gain by split information value Choose the highest information gain ratio and construct the first layer of nodes.

Answer 19

info([x,y]) = -x/n * log2(x/n) - y/n * log2(y/n) where x and y are the number of objects from each decision and n is the total number of objects i.e. (n = x + y).

Answer 20

Get dataset data preprocessing: discretization of data (could also be done after feature selection), remove incomplete data, deal with outliers ect. split data into training and testing sets feature selection perform rule based model with for example rosetta. Here we get rules and look at the quality of the model after running a permutation test with at least 20 permutations. Test on independent data. Visualize network

Answer 21

The pros are that you preserve the best features and the features are ranked by statistical significance. The cons are that getting the p-values for each feature is computationally costly and it is not possible to see variability in the data.

Answer 22

Use more samples Get more features Test a different discretization method

Answer 23

Support Accuracy Strength Coverage

Answer 24

node size tells you the decision and coverage support. Intensity of node color tells us the relative importance of the feature. lines between the nodes tell you about how strongly connected the nodes are. Red, thick lines indicates stronger connections. Border size tells you how many times that feature is included in a rule.

Answer 25

Undersampling which is a part of R.ROSETTA deals with undersampling but the problem is that you cannot have too few samples. Class imbalance issue may lead to biased performance of the machine learning models.

Answer 26

It is important that the discretization step is performed before the split of data into training and testing so they represent the model we get the results from. It can either be done before the mcfs or after but if we do it after we do not have to discretisize the features that are not important for the decision.

Answer 27

Read the literature to find sources that gives biological explanations to what your model shows.

Answer 28

permutation test

Answer 29

The two nodes often co-occur in rules and they therefore have some function in common.

Answer 30

Because of noise in the data.

Answer 31

Reducts we get from doing the boolean discernibility function and they tell us which features we can use to discern between decision classes.

Answer 32

If we do not have an external test set it is good to use cross validation because that means that we do not train and test the model on the same data.

Answer 33

20 objects is a rather small universe and we could increase accuracy by adding more samples since ROC 0.68 is not very far from random classification. To increase accuracy we could also try another discretization method. The dataset is balanced so the model is not biased but with mean accuracy 56% we still cannot trust the model very much. They should have used leave-one-out cross validation instead of 10 fold cross validation.

Answer 34

leave one out cross validation means that you train your model on all datapoints in your set except for one and is then tested on the left out datapoint. This is done until the model has been trained and tested on all datapoints in the set. 10 fold cross validation means that you divide your universe into 10 units and train on 9 of them and test on the remaining one. this is done 10 times until the model has been trained and tested on all 10 units. leave one out cross validation is better for small datasets, it is too computationally heavy to do on large datasets.

Answer 35

A permutation test can be used to get a p-value for if the model is significantly better than random chance. We first let the model classify the objects with the correct labels and we get a measure of the models performance. We then shuffle the labels on the objects and let the models classify the objects again. Repeat this 100 or 1000 times and compare the models performance to the random data.

Answer 36

When we train and test the model.

Answer 37

In supervised learning the labels of objects are known and the methods seek definitions of decision classes. In unsupervised learning the labels of the objects are unknown and the methods primarily seek the labels.

Answer 38

It is unsupervised learning because the method is seeking for the reducts which we do not know beforehand.

Answer 39

The distance matrix has the objects on y and x axis and then the distance values is the sum of the differences between the values for each attribute. In the hirerachial clustering we choose the smallest distance and those two objects will be gathered in one group. We then calculate the new distances to the remaining objects by taking the average of of the distance of the grouped objects to each remaining object. this is done until there is only two groups left.

Answer 40

Methods for computing reducts with different aims. The main aim of the Johnson algorithm is to find a feature a∈A that discerns the highest number of object pairs, it is a deterministic greedy algorithm. The genetic reducer is a stochastic method based on Darwin’s theory of natural selection. This is a heuristic algorithm for function optimization that follows the “survival of the fittest” idea. It will generally produce longer rules because it will keep optimizing the rules at the expense of being specific.

Answer 41

Gene-ontology, KEGG, ensemble.

Answer 42

Clinical findings.

Answer 43

The training data may be very similar to the testing data which means that we run the risk of overfitting our data which would mean that we would overestimate the performance of the model.

Answer 44

(A∨B)∧A=A It states that if you have a statement where A OR B is true, and you combine it with A , it is equivalent to just having A. This law helps simplify boolean expressions.

Answer 45

Symmetric = if (a,b) is in the relation then (b,a) is also in the relation. Transitive = if (a,b) and (b,c) is the the relation then (a,c) are also in the relation. Reflexive = every element is related to itself (a,a). a in relation to b if a = b. Not reflexive because a = a and b = b but symmetric because b = a and transitive because if b = c then a = c.

Answer 46

It is not symmetric because b > a does not hold. It is transitive because if b > c then a > c. It is not reflexive because a > a does not hold.

Answer 47

The discernibility matrix for a specific subset of features.

Answer 48

Because in rough sets we use boolean reasoning and for that we need discrete data.

Answer 49

1. Check if we need to simply according to (a ν b) ⋀ a = a 2. Create a truth table with all combinations of boolean functions. 3. Add values for all expressions, anything we AND with 0 = 0. 4. Check if b ≤ ¬(a ⋀ b) is true for each row, if not then the relation does not hold.

Answer 50

-If all rules have the same support and coverage then use the same node size of all. -Different node colors for the different discretizations of the features. -Different sized node borders for how often they occur in rules - Different colored lines for how strongly inter-connected the features are.

Answer 51

Interpret the models Use GO annotations

Answer 52

RBM and decision trees are transparent modeling techniques and are directly interpretable. Random forest and such are black box modeling techniques and cannot be interpreted. Use these if there is a large number of annotated examples and there is no need to interpret the model. The setback of RBM is that all features in the rule are treated as equally important.

Answer 53

Naive discretization usually produces a lot of cuts which gives a lot of discretization levels which can make the model very complex but the benefit is that all object in each discretization level will have the same decision. Equal frequency binning produces smaller number of cuts but the objects in each bin will be of different decision classes.

Answer 54

If a is a prime implicant of b it means that a is one of the minimal product terms necessary to represent the Boolean function b In other words, a contributes to the coverage of b and cannot be further simplified without losing its ability to represent b. (a ∧ b ∧ c) v ( a ∧ b).

Knowledge based systems for bioinformatics Flashcards

(78 cards)