Knowledge based systems Flashcards

1
Q

Describe supervised and unsupervised learning. Give examples of both.

A

Supervised learning (classification) is when the model is trained on data where the objects have predefined labels. Rule based learning and decision trees are examples of supervised learning. Unsupervised learning is when the model must find patterns in a unlabeled dataset. For example clustering methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When is it more suited to use supervised learning over unsupervised learning?

A

When we have data where the objects have known labeles and we for instance want to learn to classify them based on a set of features. When we want to predict output from input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When is it more suited to use unsupervised learning over supervised learning?

A

When we have data with unknown labels and we want to find patterns in the data. For example k-means clustering or hirerarchial clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A problem we can face is if we should prioritze interpretibility or performance of our models? When should we prioritize what?

A

If the models application is human life dependent then the performance (sensitivity and specificity) should be prioritized. If we want to find patterns in very complex research data the need for the model to be easily interpreted becomes more important. However very low performance means that we simply cannot trust the models predictions. Interpretability becomes more important in research.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is undersampling? When is it necessary?

A

Undersampling is necessary when the distribution of classes are unequal. To get an equal representation of the classes the algorithm randomely removes objects from the majority class without replacement , this to avoid biased performance because of the imbalance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is feature selection? When is it necessary?

A

External features selection with for example MCFS is necessary if the dataset has more features than objects in the universe and it is done to reduce the dimensionality of features to those that most affect the classification. The reducts reated in rule based classification is also a type of feature selection that we can do for a smaller number of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is CutOffPermutations in MCFS? What should you set it to?

A

CutOff Permuatations is the number of permutations we run when we compare the produced RI values from the features selection with the RI values obtained with the stochasticly labeled objects to see if the RI value from the feature selected model is significantly better than random chance. The cutoff should be set to at least 20 to get significant results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain MCFS, why is it needed and how is it performed?

A

Monte Carlo Feature Selection is required for datasets with many features to reduce the dataset to only have the features that affect the classification the most. The MCFS splits the original feature set into s number of randomly selected smaller subsets of size m. If m is 0.1 the the subsets only contain 10% of the original features. The subsets are then splitted into training set and test set with the usual distribution of 2/3 training and 1/3 testing. For each subset a decision tree is created and it is tested on randomly chosen training sets and tested on randomly chosen test sets. The result of all decision trees is a RI value for each feature. To test the significance of the RI value we run a permutation test with at least 20 runs to see if the obtained RI value is better then the RI value we can get by random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a confusion matrix in MCFS? What is the accuracy we expect? How do we calculate accuracy?

A

From each test of the decision trees constructed we get a summarized confusion matrix of the predictions. In this matrix we can see how many times the predicitions were correct and incorrect from this we can calculate the performance (sensitivity and specificity that gives overall accuracy) of the predictions. We want accuracy over the expected 0.5 because that indicates that the model is correct more often than random chance. Accuracy = (TP +TN) / (TP + TN + FP + FN). The number of correct predictions divided by total.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the goal of rule based decision systems in rough set theory?

A

To be able to predict output from input. Predict the decision attribute of an object based on the values of the features given.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we define a decision system?

A

A = (U, A ⋃ {d}). Decision system A is defined by universe U, attributes A and decision d in unioin with A.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Va in a decision system?

A

The value sets for each feature in the decision system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is indiscernibility classes?

A

Groups of objects in the universe that we can not discern from each other by looking at a specific set of features because they take on the same values for each feature (excluding the decision).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the upper and lower approximations?

A

For a given decison value we set the lower approximation which are the indiscernibility classes that certainly belongs to that decision value because all individuals in the group take on that decision value. The upper approximation is the lower approximation + the sets that may belong to that decision value, I.e we do not know because the individuals in the group take on either of the decision values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the positive region?

A

The positive region is the indiscernibility groups that certainly belongs to either of the decision classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the rough membership function?

A

With the rough membership function we can decide if the indiscernibility classes of a specific set of features is a good model for prediction of a decision class. To do the rough membership function we take the cardinality of the objects in a group that has the decision class give divided by the total number of objects in the indisceernibility class. We do this for each group and see the range of values and from this we can interpret if the feature is associated to a specific decision class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a discernibility matrix, boolean discernibility function and reducts?

A

From the equivalence classes created we can create a matrix over which attributes discern the classes from each other. From this matrix we can define and compute the boolean discernibility function which will tell us which attributes discern the groups in a simplified way. Line up all features from the matrix and cross out common ones to get the reducts. From the reducts we define rules for the decision system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the relative discernibility matrix?

A

Same this as discernibility matrix but for those instances where the classes have different values for attributes but the same decision we set -. Then we do the boolean discernibilty function and get the reducts and rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the workflow for creating and running rule based decision systems?

A

If the dataset has many features then perform MCFS to reduce the umber of features. Define decision table with top ranked features and do discretization to get qualitative levels of the features. Define equivalence classes. Generate indiscernibility matrix and do boolean indiscernibility function to get the reducts. Define the rules from the reducts. Look at the quality of the rules and define the performance of the classifier based on permutation test.Test the classifier on independent data/ perform k cross validation. Read littérature to see if research agrees with the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Define accuracy, coverage, strength and support of constructed rules.

A

The support of a rule is the number of occurrences in the universe that match both the attribute value and decision class given in the rule. The accuracy of a rule is (support / cardinality of when the attribute value matches the one given in the rules,)coverage of a rule is (support / cardinality of when the decision class matches the one given in the rule). The strength of a rule is the (support / the number of objects in the universe).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the perfomance variables when we look at the performance of a classifier? How are they calculated? What is a ROC curve?

A

Sensitivity = (TP / TP + FN)Specificity = (TN / TN + FP)If we put sensitivity on the y-axis and specificity on the x-axis we get a roc curver where the AUC represents the performance of the classifier. We want performance over 0.5 (linear line) becuase that indicated that the performance of the classifier is better than random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Rough sets vs decision trees?

A

Decision trees and rough sets are two different ways of classification. Rough sets do however treat all features in the system as equally important whilst decision trees have a hierarchical structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Explain decision trees

A

The root of the tree represents the entire dataset and the first split from the node is the most important because it divides the largest number of features. Further down in the tree we have other nodes that represent smaller splits and divergences in the data. The leafs represent the final classifications and we can follow the branches from root to leaf to get a sort of “rule” for the classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can we calculate which is the most important split for a decision tree? What is information value and information gain?

A

The most important split is the one where the resulting classes reduce the information value of the entire dataset the most, this split will be the root. First calculate the information value of the entire dataset and then calculate for each new dataset in a split and choose the one that reduces the original information value the most. If we split the root into 3 classes we calculate the information value for each and take a weighted value of those to compare with the original information value. “The information gain” is the difference in information values and we want that value to be as high as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why is pruning of decision trees important?

A

Because of the risk of overfitting our data with the decision tree method (the last nodes will contain very few examples) we need to prune the tree which means that we reduce the tree by the nodes that worsen the results of the classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are approximation reducts? Why do we need them?

A

The aim of a decision system for classification is to produce reducts which are the sets of features we need to look at when classifying. If we have really big decision tables then we will get really long reducts and that means that we might overfit our data on the training data - meaning that the decision system is too specific to the training data and will have low performance on other new data. The remedy for this is to create approximate reductions that capture the patterns but not the specific noise of the specific training data. These “almost reducts” are shorter reducts that are a set of attributes that discern objects in U to some degree.

27
Q

How are approximation reducts obtained?

A

When getting the approximations we are purposely choosing a smaller reduct and seeing how well that reduct is approximating a larger reduct by looking at the error of reduct approximation of a reduct. For example, how big is the error when using reduct B to approximate reduct C (B-C). If the error is not that large we can keep the approximation i.e. the shorter more general reduct.For example, if the cardinality of the equivalence classes in the positive region for (G1,G2) is 2 + 3 + 2 = 7 / 18and the cardinality of the equivalence classes in the positive region for (G1, G2, G3) is 2 + 2 + 4 + 3 + 1 + 1 = 13 / 18and the universe has 18 objectsWe can calculate the error of using the reduct of (G1,G2) to approximate (G1,G2,G3) by taking 1 - (7 / 13) = 6 / 10 and if the error is relatively small the we can keep the approximation instead of the long reduct.

28
Q

What is discretization and why do we do it? What is a partition?

A

The discretization step determines how strictly we want to view the world and it involves transforming quantitative data into qualitative levels. For instance, temperature values can be transformed into a finite number of qualitative levels like high, medium, low. In the discretization process we search for a set of cuts we can divide the quantitative data into qualitative levels with. A partition of the universe A is a partition of every value set Va into non-overlapping intervals defined by a number of cuts. The size of the partition is the number of cuts we construct. The process of constructing this is called discretization. A = (U, A ⋃ {d}) becomes Ap = (U, Ap ⋃ {d}) which is the discretized system.

29
Q

What are the different methods we can use to obtain partitions?

A
  • User defined is good if we have domain knowledge*Infer cuts from the data- Equal frequency binning- Naive methods- Entropy based- linear discriminant discretization- boolean reasoning algorithm
30
Q

Explain the boolean reasoning discretization

A

We define all intervals of values we can find in our decision table (excluding intervals where no values are). Then we find the middle points of those intervals and define those as cuts. To then construct a set of a minimal number of cuts that still discerns all objects we use boolean reasoning I.e. we do not need all the cuts we originally constructed to discern the values.For this we list all combinations of cuts we need to discern all objects from each other (not including the pairs with the same decision) and these correspond to a discernibility matrix but for the cuts. We then simplify these to the final number of cuts.

31
Q

Explain the equal frequency binning algorithm

A

Discretization method. Not a very smart algorithm for discretization but it is sometimes used anyway because it is very simple. It sorts the values of a feature in ascending order, you choose the number of bins and then it sorts the values into the bins so that each bin has an equal number of objects. If you choose 3 binds then you get 2 cuts.

32
Q

Explain the naive discretization algorithm

A

Discretization method. This algorithm produces cuts between objects if they have different classes. This often produces a lot of cuts.

33
Q

How do we calculate the relative importance value for a feature when doing mcfs?

A

The RI of a specific feature is the weighted accuracy of the trees * the information gain of the feature split * the fraction of objects in the feature split compared to the number in the root. The weighted accuracy we get by taking 1/c*(the fractions of correctly classified objects compared to the total number of classifications to that class).

34
Q

What is the information value? How do we calculate it?

A

The information value we calculate when we are constructing a decision tree. When we choose splits we choose the slipts that reduces the information value to most first. info([x,y]) = -x/n * log2(x/n) - y/n * log2(y/n) where x and y are the number of objects from each decision and n is the total number of objects i.e. (n = x + y). If we want to split the dataset on a feature with three classes we calculate the information value of all of them and take the weighted average of those.

35
Q

What is the information gain value?

A

information value for entire dataset - weighted average of information value of the subsets.

36
Q

What is the information gain ratio? Why do we need it?

A

The information gain ratio is the information gain / split information value. We need this ratio because the information gain value by itself is biased towards features with many values. We then choose the split with the highest information gain ratio for the first node in the tree.

37
Q

What is the split information value?

A

Split Information is a measure of how evenly the data is divided by a particular split. if there are 3 classes:split_info(“size”) = info([5, 4, 5]) = info([5, 9]) + 9/14 x info([4, 5]).if there are 2 classes:split_info[2,2],[2,2] = info[4,4].

38
Q

What is k means cross-validation? Why do we need it?

A

Cross validation is a way to test a classifier on data that it has not been trained on. We need to test the classifier on independent data because if we test it on the training data the model could be overfitted to that data and give skewed performance. K means cross validation means that we divide the dataset into k means of sections and the classifier is trained on k-1 sections and tested on the last part. You vary the testing section so that the classifier is trained and tested on all sections and then you take the average of the performance values.

39
Q

What is the boundary region?

A

Upper approximation - lower approximation.

40
Q

What is the accuracy of lower and upper approximations?

A

number of sets in lower / number of sets in upper.

41
Q

What is the generalized decision attribute of equivalence classes?

A

The decision the objects in the set can have. If any of the sets has more than one decision the decision table is not consistent.

42
Q

How can we decide what the root of a decision tree should be?

A

Calculate information value for entire datasetCalculate information value for each class for each feature and thencalculate the weighted average and information gain for each split.Calculate split information value for each feature and divide the information gain by split information valueChoose the highest information gain ratio and construct the first layer of nodes.

43
Q

How do you calculate the information value?

A

info([x,y]) = -x/n * log2(x/n) - y/n * log2(y/n) where x and y are the number of objects from each decision and n is the total number of objects i.e. (n = x + y).

44
Q

What are the different steps in creating a rule based model?

A

Get datasetdata preprocessing: discretization of data (could also be done after feature selection), remove incomplete data, deal with outliers ect.split data into training and testing setsfeature selectionperform rule based model with for example rosetta. Here we get rules and look at the quality of the model after running a permutation test with at least 20 permutations. Test on independent data.Visualize network

45
Q

What are the pros and cons for using mcfs?

A

The pros are that you preserve the best features and the features are ranked by statistical significance. The cons are that getting the p-values for each feature is computationally costly and it is not possible to see variability in the data.

46
Q

How can you increase the mean accuracy and AUC of your model?

A

Use more samplesGet more featuresTest a different discretization method

47
Q

Which information would you consider for highlighting the quality of a given rule?

A

SupportAccuracyStrengthCoverage

48
Q

Explain how you interpret a VisuNet graph, what does the following parameters mean?- node size- lines between nodes- border size

A

node size tells you the decision and coverage support. Intensity of node color tells us the relative importance of the feature. lines between the nodes tell you about how strongly connected the nodes are. Red, thick lines indicates stronger connections.Border size tells you how many times that feature is included in a rule.

49
Q

Which technique would you consider to cope with odd distribution of objects and which problem you may encounter in this particular case?What could be the problem with unbalanced datasets?

A

Undersampling which is a part of R.ROSETTA deals with undersampling but the problem is that you cannot have too few samples. Class imbalance issue may lead to biased performance of the machine learning models.

50
Q

When should discretization be performed in the workflow?

A

After you have done the feature selection and before the split of the dataset into learning and testing.Because it is more efficient to discretisize the smaller significant dataset than the whole dataset and we do it before the split because otherwise we would have to do the discretization twice.

51
Q

How would you support your findings from a rule based model?

A

Read the literature to find sources that gives biological explanations to what your model shows.

52
Q

Which technique is used by mcfs to find a cut-off in the assessment of statistical significance of features?

A

permutation test

53
Q

What is the interpretation of two strongly connected nodes in VisuNet?

A

The two nodes often co-occur in rules and they therefore have some function in common.

54
Q

When examining VisuNet rule networks for a binary classifier we sometimes see the same descriptor be used in rules for opposite decision classes, why?

A

Because of noise in the data.

55
Q

Explain what reducts are.

A

Reducts we get from doing the boolean discernibility function and they tell us which features we can use to discern between decision classes.

56
Q

When is it good to use cross validation?

A

If we do not have an external test set it is good to use cross validation because that means that we do not train and test the model on the same data.

57
Q

You build a model for classifying leukemia with 20 objects in your universe. 10 i classified as malignant and 10 as not malignant. Mean accuracy is 56% and ROC is 0.68. 10-fold cross validation was performed.What do you think of this model?

A

20 objects is a rather small universe and we could increase accuracy by adding more samples since ROC 0.68 is not very far from random classification. To increase accuracy we could also try another discretization method. The dataset is balanced so the model is not biased but with mean accuracy 56% we still cannot trust the model very much. They should have used leave-one-out cross validation instead of 10 fold cross validation.

58
Q

What is the difference between 10 fold cross validation and leave one out cross validation? When is it better to use leave one out cross validation?

A

leave one out cross validation means that you train your model on all datapoints in your set except for one and is then tested on the left out datapoint. This is done until the model has been trained and tested on all datapoints in the set. 10 fold cross validation means that you divide your universe into 10 units and train on 9 of them and test on the remaining one. this is done 10 times until the model has been trained and tested on all 10 units. leave one out cross validation is better for small datasets, it is too computationally heavy to do on large datasets.

59
Q

Describe how randomization techniques can be used to assess the significance of a classifier?

A

A permutation test can be used to get a p-value for if the model is significantly better than random chance. We first let the model classify the objects with the correct labels and we get a measure of the models performance. We then shuffle the labels on the objects and let the models classify the objects again. Repeat this 100 or 1000 times and compare the models performance to the random data.

60
Q

When is the cross validations done?

A

When we train and test the model.

61
Q

State the differences between supervised and unsupervised learning.

A

In supervised learning the labels of objects are known and the methods seek definitions of decision classes. In unsupervised learning the labels of the objects are unknown and the methods primarily seek the labels.

62
Q

Is reduct computation in an information system an example of supervised or unsupervised learning?

A

It is unsupervised learning because the method is seeking for the reducts which we do not know beforehand.

63
Q

Say that you have a dataset with four objects and 5 features in it. How would you construct a distance matrix?How would you then do hierarchal clustering to divide the objects into two groups?

A

The distance matrix has the objects on y and x axis and then the distance values is the sum of the differences between the values for each attribute. In the hirerachial clustering we choose the smallest distance and those two objects will be gathered in one group. We then calculate the new distances to the remaining objects by taking the average of of the distance of the grouped objects to each remaining object. this is done until there is only two groups left.

64
Q

What is a reducer in r.rosetta? Johnson reducer vs genetic reducer?

A

Methods for computing reducts with different aims. The main aim of the Johnson algorithm is to find a feature a∈A that discerns the highest number of object pairs, it is a deterministic greedy algorithm.The genetic reducer is a stochastic method based on Darwin’s theory of natural selection. This is a heuristic algorithm for function optimization that follows the “survival of the fittest” idea