know Flashcards

Question

can we trust the AUC with low samples

Answer 1

It is questionable if we should trust the model performance with such low values.

Answer 2

when we dont have external test set to test our data

Answer 3

TP/(TP+FN) TRUE POSITIVE RATE

Answer 4

TN/(TN+FP) TRUE NEGATIVE RATE

Answer 5

its the area under the ROC curve and its obtained by changing the threshold (cut-off) between true positive rate and true negative rate,

Answer 6

* Increase the number of permutations in MCFS * iNCREASE NUMBER IN BOTH SAMPLES * CHANGE REDUCER e.g johnson to genetic * decrese or increase the number of features * detect and remove objects that are wrongly classified

Answer 7

node size= tells you the decision and coverage support. Intensity of node color = tells us the relative importance of the feature. lines between the nodes tell you about how strongly connected the nodes are. Red, thick lines indicates stronger connections. Border size tells you how many times that feature is included in a rule.

Answer 8

use a TOP-DOWN approach and construct the tree RECURSIVELY one split at a time. For each split, the one with the highest ratio is chosen. The tree is finished when there are no possible splits that reduce the information value further. * Top-down * recursive * one at a time

Answer 9

The root of the tree represents the entire dataset and the first split from the node is the most important because it divides the largest number of features. Further down in the tree, we have other nodes that represent smaller splits and divergences in the data. The leafs represent the final classifications and we can follow the branches from root to leaf to get a sort of “rule” for the classification.

Answer 10

Identify an ordered list of attributes that best discriminates between/among decision classes. Good for identifying the most important features for classification in very large sets of features. External features selection with for example MCFS is necessary if the dataset has more features than objects in the universe and it is done to reduce the dimensionality of features to those that most affect the classification.

Answer 11

1. create S SUBSETS of m attributes chosen at random from the original d attributes 2. s is chosen so that the difference in ranking between ten iterations is small and stable 3. divide the subsets into training and tests set t times 4. for each training set make a tree classifier 5. evaluate on thhe test 6. calculñate the relative importance

Answer 12

1. CALCULAS INFO DE TODO 2. separas por atributo y calculas el info size de cada uno por sus diferentes clases. info ([2,3]) 3. SACAS el weighted de todos las clas de por ejemplo gen 1 donde sumas si era 2, 3 entonces 5/suma de todo + info de los otros 4. gain resta de original - weighted 5. split_info = sacas el info de la suma de los posibles classes de los dos [] sacas suma 5. gain ratio =. gain / split _info

Answer 13

1. put aside an external validation set of subject samples 2. data processing, remove imcomplete data 3. feature selection: perform a feature selection to select the most important feautures and reduce noise

Answer 14

Advantages= preservation of the features ranking of the features and statisticsal significance of them little feature shadowing Disadvantages= Not possible to explain variability in the data computational expensive

Answer 15

Undersampling

Answer 16

before the split of test and training

Answer 17

uses evolution and continues to search for better

Answer 18

((AvB)^A =A

Answer 19

permutation test

Answer 20

ENSEMBL GENE ONTOLOGY KEGG

know Flashcards

(44 cards)