Knowledge based systems Flashcards
Describe supervised and unsupervised learning. Give examples of both.
Supervised learning (classification) is when the model is trained on data where the objects have predefined labels. Rule based learning and decision trees are examples of supervised learning. Unsupervised learning is when the model must find patterns in a unlabeled dataset. For example clustering methods.
When is it more suited to use supervised learning over unsupervised learning?
When we have data where the objects have known labeles and we for instance want to learn to classify them based on a set of features. When we want to predict output from input.
When is it more suited to use unsupervised learning over supervised learning?
When we have data with unknown labels and we want to find patterns in the data. For example k-means clustering or hirerarchial clustering.
A problem we can face is if we should prioritze interpretibility or performance of our models? When should we prioritize what?
If the models application is human life dependent then the performance (sensitivity and specificity) should be prioritized. If we want to find patterns in very complex research data the need for the model to be easily interpreted becomes more important. However very low performance means that we simply cannot trust the models predictions. Interpretability becomes more important in research.
What is undersampling? When is it necessary?
Undersampling is necessary when the distribution of classes are unequal. To get an equal representation of the classes the algorithm randomely removes objects from the majority class without replacement , this to avoid biased performance because of the imbalance.
What is feature selection? When is it necessary?
External features selection with for example MCFS is necessary if the dataset has more features than objects in the universe and it is done to reduce the dimensionality of features to those that most affect the classification. The reducts reated in rule based classification is also a type of feature selection that we can do for a smaller number of features.
What is CutOffPermutations in MCFS? What should you set it to?
CutOff Permuatations is the number of permutations we run when we compare the produced RI values from the features selection with the RI values obtained with the stochasticly labeled objects to see if the RI value from the feature selected model is significantly better than random chance. The cutoff should be set to at least 20 to get significant results.
Explain MCFS, why is it needed and how is it performed?
Monte Carlo Feature Selection is required for datasets with many features to reduce the dataset to only have the features that affect the classification the most. The MCFS splits the original feature set into s number of randomly selected smaller subsets of size m. If m is 0.1 the the subsets only contain 10% of the original features. The subsets are then splitted into training set and test set with the usual distribution of 2/3 training and 1/3 testing. For each subset a decision tree is created and it is tested on randomly chosen training sets and tested on randomly chosen test sets. The result of all decision trees is a RI value for each feature. To test the significance of the RI value we run a permutation test with at least 20 runs to see if the obtained RI value is better then the RI value we can get by random chance.
What is a confusion matrix in MCFS? What is the accuracy we expect? How do we calculate accuracy?
From each test of the decision trees constructed we get a summarized confusion matrix of the predictions. In this matrix we can see how many times the predicitions were correct and incorrect from this we can calculate the performance (sensitivity and specificity that gives overall accuracy) of the predictions. We want accuracy over the expected 0.5 because that indicates that the model is correct more often than random chance. Accuracy = (TP +TN) / (TP + TN + FP + FN). The number of correct predictions divided by total.
What is the goal of rule based decision systems in rough set theory?
To be able to predict output from input. Predict the decision attribute of an object based on the values of the features given.
How do we define a decision system?
A = (U, A ⋃ {d}). Decision system A is defined by universe U, attributes A and decision d in unioin with A.
What is Va in a decision system?
The value sets for each feature in the decision system.
What is indiscernibility classes?
Groups of objects in the universe that we can not discern from each other by looking at a specific set of features because they take on the same values for each feature (excluding the decision).
What is the upper and lower approximations?
For a given decison value we set the lower approximation which are the indiscernibility classes that certainly belongs to that decision value because all individuals in the group take on that decision value. The upper approximation is the lower approximation + the sets that may belong to that decision value, I.e we do not know because the individuals in the group take on either of the decision values.
What is the positive region?
The positive region is the indiscernibility groups that certainly belongs to either of the decision classes.
What is the rough membership function?
With the rough membership function we can decide if the indiscernibility classes of a specific set of features is a good model for prediction of a decision class. To do the rough membership function we take the cardinality of the objects in a group that has the decision class give divided by the total number of objects in the indisceernibility class. We do this for each group and see the range of values and from this we can interpret if the feature is associated to a specific decision class.
What is a discernibility matrix, boolean discernibility function and reducts?
From the equivalence classes created we can create a matrix over which attributes discern the classes from each other. From this matrix we can define and compute the boolean discernibility function which will tell us which attributes discern the groups in a simplified way. Line up all features from the matrix and cross out common ones to get the reducts. From the reducts we define rules for the decision system.
What is the relative discernibility matrix?
Same this as discernibility matrix but for those instances where the classes have different values for attributes but the same decision we set -. Then we do the boolean discernibilty function and get the reducts and rules.
What is the workflow for creating and running rule based decision systems?
If the dataset has many features then perform MCFS to reduce the umber of features. Define decision table with top ranked features and do discretization to get qualitative levels of the features. Define equivalence classes. Generate indiscernibility matrix and do boolean indiscernibility function to get the reducts. Define the rules from the reducts. Look at the quality of the rules and define the performance of the classifier based on permutation test.Test the classifier on independent data/ perform k cross validation. Read littérature to see if research agrees with the model.
Define accuracy, coverage, strength and support of constructed rules.
The support of a rule is the number of occurrences in the universe that match both the attribute value and decision class given in the rule. The accuracy of a rule is (support / cardinality of when the attribute value matches the one given in the rules,)coverage of a rule is (support / cardinality of when the decision class matches the one given in the rule). The strength of a rule is the (support / the number of objects in the universe).
What are the perfomance variables when we look at the performance of a classifier? How are they calculated? What is a ROC curve?
Sensitivity = (TP / TP + FN)Specificity = (TN / TN + FP)If we put sensitivity on the y-axis and specificity on the x-axis we get a roc curver where the AUC represents the performance of the classifier. We want performance over 0.5 (linear line) becuase that indicated that the performance of the classifier is better than random chance.
Rough sets vs decision trees?
Decision trees and rough sets are two different ways of classification. Rough sets do however treat all features in the system as equally important whilst decision trees have a hierarchical structure.
Explain decision trees
The root of the tree represents the entire dataset and the first split from the node is the most important because it divides the largest number of features. Further down in the tree we have other nodes that represent smaller splits and divergences in the data. The leafs represent the final classifications and we can follow the branches from root to leaf to get a sort of “rule” for the classification.
How can we calculate which is the most important split for a decision tree? What is information value and information gain?
The most important split is the one where the resulting classes reduce the information value of the entire dataset the most, this split will be the root. First calculate the information value of the entire dataset and then calculate for each new dataset in a split and choose the one that reduces the original information value the most. If we split the root into 3 classes we calculate the information value for each and take a weighted value of those to compare with the original information value. “The information gain” is the difference in information values and we want that value to be as high as possible.
Why is pruning of decision trees important?
Because of the risk of overfitting our data with the decision tree method (the last nodes will contain very few examples) we need to prune the tree which means that we reduce the tree by the nodes that worsen the results of the classification.