Knowledge based systems for bioinformatics Flashcards
Describe supervised and unsupervised learning. Give examples of both.
Supervised learning (classification) is when the model is trained on data where the objects have predefined labels. Rule based classification, regression and decision tree classification are examples of supervised learning.
Unsupervised learning is when the model must find patterns in a unlabeled dataset. For example clustering methods.
When is it more suited to use supervised learning over unsupervised learning?
When we have data where the objects have known labeles and we for instance want to learn to classify them based on a set of features. When we want to predict output from input.
When is it more suited to use unsupervised learning over supervised learning?
When we have data with unknown labels and we want to find patterns in the data. For example k-means clustering or hirerarchial clustering.
A problem we can face is if we should prioritze interpretibility or performance of our models? When should we prioritize what?
If the models application is human life dependent then the performance (sensitivity and specificity) should be prioritized.
If we want to find patterns in very complex research data the need for the model to be easily interpreted becomes more important. However very low performance means that we simply cannot trust the models predictions. Interpretability becomes more important in research.
What is undersampling? When is it necessary?
Undersampling is necessary when the distribution of classes are unequal. To get an equal representation of the classes the algorithm randomely removes objects from the majority class without replacement , this to avoid biased performance because of the imbalance.
What is feature selection? When is it necessary?
External features selection with for example MCFS is necessary if the dataset has more features than objects in the universe and it is done to reduce the dimensionality of features to those that most affect the classification.
The reducts reated in rule based classification is also a type of feature selection that we can do for a smaller number of features.
What is CutOffPermutations in MCFS? What should you set it to?
CutOff Permuatations is the number of permutations we run when we compare the produced RI values from the features selection with the RI values obtained with the stochasticly labeled objects to see if the RI value from the feature selected model is significantly better than random chance.
The cutoff should be set to at least 20 to get significant results.
Explain MCFS, why is it needed and how is it performed?
Monte Carlo Feature Selection is required for datasets with many features to reduce the dataset to only have the features that affect the classification the most.
The MCFS splits the original feature set into s number of randomly selected smaller subsets of size m. If m is 0.1 the the subsets only contain 10% of the original features. The subsets are then splitted into training set and test set with the usual distribution of 2/3 training and 1/3 testing. For each subset a decision tree is created and it is tested on randomly chosen training sets and tested on randomly chosen test sets. The result of all decision trees is a RI value for each feature.
To test the significance of the RI value we run a permutation test with at least 20 runs to see if the obtained RI value is better then the RI value we can get by random chance.
What is a confusion matrix in MCFS? What is the accuracy we expect? How do we calculate accuracy?
From each test of the decision trees constructed we get a summarized confusion matrix of the predictions. In this matrix we can see how many times the predicitions were correct and incorrect from this we can calculate the performance (sensitivity and specificity that gives overall accuracy) of the predictions.
We want accuracy over the expected 0.5 because that indicates that the model is correct more often than random chance.
Accuracy = (TP +TN) / (TP + TN + FP + FN). The number of correct predictions divided by total.
What is the goal of rule based decision systems in rough set theory?
To be able to predict output from input. Predict the decision attribute of an object based on the values of the features given.
How do we define a decision system?
A = (U, A ⋃ {d}). Decision system A is defined by universe U, attributes A and decision d in unioin with A.
What is Va in a decision system?
The value sets for each feature in the decision system.
What is indiscernibility classes?
Groups of objects in the universe that we can not discern from each other by looking at a specific set of features because they take on the same values for each feature (excluding the decision).
What is the upper and lower approximations?
For a given decison value we set the lower approximation which are the indiscernibility classes that certainly belongs to that decision value because all individuals in the group take on that decision value.
The upper approximation is the lower approximation + the sets that may belong to that decision value, I.e we do not know because the individuals in the group take on either of the decision values.
What is the positive region?
The positive region is the indiscernibility groups that certainly belongs to either of the decision classes.
What is the rough membership function?
With the rough membership function we can decide if the indiscernibility classes of a specific set of features is a good model for prediction of a decision class.
To do the rough membership function we take the cardinality of the objects in a group that has the decision class give divided by the total number of objects in the indisceernibility class. We do this for each group and see the range of values and from this we can interpret if the feature is associated to a specific decision class.
What is a discernibility matrix, boolean discernibility function and reducts?
From the equivalence classes created we can create a matrix over which attributes discern the classes from each other.
From this matrix we can define and compute the boolean discernibility function.
Line up all features from the matrix and cross out common ones to get the reducts. Rewrite the reducts according to the boolean discernibility law and simplify again to get final reduct.
What is the relative/modulo discernibility matrix?
Same this as discernibility matrix but for those instances where the classes have different values for attributes but the same decision we set -. Then we do the boolean discernibilty function and get the reducts and rules.
What is the workflow for creating and running rule based decision systems?
If the dataset has many features then perform MCFS to reduce the umber of features.
Define decision table with top ranked features and do discretization to get qualitative levels of the features.
Define equivalence classes.
Generate indiscernibility matrix and do boolean indiscernibility function to get the reducts.
Define the rules from the reducts.
Look at the quality of the rules and define the performance of the classifier based on permutation test.
Test the classifier on independent data/ perform k cross validation.
Read littérature to see if research agrees with the model.
Define accuracy, coverage, strength and support of constructed rules.
The support of a rule is the number of occurrences in the universe that match both the attribute value and decision class given in the rule.
The accuracy of a rule is (support / cardinality of when the attribute value matches the one given in the rules,)
coverage of a rule is (support / cardinality of when the decision class matches the one given in the rule).
The strength of a rule is the (support / the number of objects in the universe).
What are the perfomance variables when we look at the performance of a classifier? How are they calculated? What is a ROC curve?
Sensitivity = (TP / TP + FN)
Specificity = (TN / TN + FP)
If we put sensitivity on the y-axis and specificity on the x-axis we get a roc curver where the AUC represents the performance of the classifier. We want performance over 0.5 (linear line) becuase that indicated that the performance of the classifier is better than random chance.
Rough sets vs decision trees?
Decision trees and rough sets are two different ways of classification. Rough sets do however treat all features in the system as equally important whilst decision trees have a hierarchical structure.
Explain decision trees
The root of the tree represents the entire dataset and the first split from the node is the most important because it divides the largest number of features. Further down in the tree we have other nodes that represent smaller splits and divergences in the data. The leafs represent the final classifications and we can follow the branches from root to leaf to get a sort of “rule” for the classification.
How can we calculate which is the most important split for a decision tree? What is information value and information gain?
The most important split is the one where the resulting classes reduce the information value of the entire dataset the most, this split will be the root.
First calculate the information value of the entire dataset and then calculate for each new dataset in a split and choose the one that reduces the original information value the most. If we split the root into 3 classes we calculate the information value for each and take a weighted value of those to compare with the original information value.
“The information gain” is the difference in information values and we want that value to be as high as possible.
Why is pruning of decision trees important?
Because of the risk of overfitting our data with the decision tree method (the last nodes will contain very few examples) we need to prune the tree which means that we reduce the tree by the nodes that worsen the results of the classification.
What are approximation reducts? Why do we need them?
The aim of a decision system for classification is to produce reducts which are the sets of features we need to look at when classifying. If we have really big decision tables then we will get really long reducts and that means that we might overfit our data on the training data - meaning that the decision system is too specific to the training data and will have low performance on other new data.
The remedy for this is to create approximate reductions that capture the patterns but not the specific noise of the specific training data. These “almost reducts” are shorter reducts that are a set of attributes that discern objects in U to some degree.
How are approximation reducts obtained?
When getting the approximations we are purposely choosing a smaller reduct and seeing how well that reduct is approximating a larger reduct by looking at the error of reduct approximation of a reduct. For example, how big is the error when using reduct B to approximate reduct C (B-C). If the error is not that large we can keep the approximation i.e. the shorter more general reduct.
For example, if the cardinality of the equivalence classes in the positive region for (G1,G2) is
2 + 3 + 2 = 7 / 18
and the cardinality of the equivalence classes in the positive region for (G1, G2, G3) is
2 + 2 + 4 + 3 + 1 + 1 = 13 / 18
and the universe has 18 objects
We can calculate the error of using the reduct of (G1,G2) to approximate (G1,G2,G3) by taking 1 - (7 / 13) = 6 / 10 and if the error is relatively small the we can keep the approximation instead of the long reduct.
What is discretization and why do we do it? What is a partition?
The discretization step determines how strictly we want to view the world and it involves transforming quantitative data into qualitative levels. For instance, temperature values can be transformed into a finite number of qualitative levels like high, medium, low. In the discretization process we search for a set of cuts we can divide the quantitative data into qualitative levels with.
A partition of the universe A is a partition of every value set Va into non-overlapping intervals defined by a number of cuts. The size of the partition is the number of cuts we construct. The process of constructing this is called discretization. A = (U, A ⋃ {d}) becomes Ap = (U, Ap ⋃ {d}) which is the discretized system.
What are the different methods we can use to obtain partitions?
- User defined is good if we have domain knowledge
*Infer cuts from the data
- Equal frequency binning
- Naive methods
- Entropy based
- linear discriminant discretization
- boolean reasoning algorithm
Explain the boolean reasoning discretization
We define all intervals of values we can find in our decision table (excluding intervals where no values are). Then we find the middle points of those intervals and define those as cuts. To then construct a set of a minimal number of cuts that still discerns all objects we use boolean reasoning I.e. we do not need all the cuts we originally constructed to discern the values.
For this we list all combinations of cuts we need to discern all objects from each other (not including the pairs with the same decision) and these correspond to a discernibility matrix but for the cuts. We then simplify these to the final number of cuts.
Explain the equal frequency binning algorithm
Discretization method.
Not a very smart algorithm for discretization but it is sometimes used anyway because it is very simple.
- Choose number of bins
- Sort values in ascending order
- Divide the objects into the chosen number of bins so that there is equal number of objects in each bin.
- Calculate cuts by taking the average of the values around the cuts.