know Flashcards
universe
attributes
Va
all go U={x,x,x}
indiscernibility relation IND{A}
son las equivalences y van en {{z,z},{s}}
aqui tienes que fijarte que te pide
B lower es
100 cierto
B upper
si y no
positive region es
todo lo que en base a lo que te dicen es cierto ojo lo falso tambien cuenta
decision system
A=(U, A U {d})
Boundary region es
b lower - b upper
accuracy of the approximation
b upper / b lowe
generalized decision
es cuando pones cada INDa osea la equivalence class con su decision
no es consistente si tiene varios numeros
decision relative discernability matrix
haces la tabla con las equivalence y pones las diferencias aqui es DECISION RELATIVE entonces si tienen la misma decision pones TETA
boolean discernabilityfunction
es poner todas las diferencias pero en (xx) y las escibres asi
fA(attributos) = (X^b) v (ddhdh)
y luego haces la simplificacion
ojo con la simplificacion
tiene que tener todo los valores con lo que debes simplificar por eso es bueno tener una pqeuña
support?
Number of objects that fulfill the rule
te van a dar una regla y support es todos los que la cumplen completamente
accuracy es…
A fraction of correctly classified objects for the rule conditions
support / todos los que tiene la misma regla pero no la misma decision. entonces todos los que estan en la misma equivalnece class|
coverage es…
support /# de objetos que son la regla contraria
strenght…
support/todos los objetos en el universo
what is the point of using Boolean reasoning in rough sets?
Boolean reasoning is used to obtain the reducts
Why do we need to do discretization?
Because in rough sets we use boolean reasoning and for that we need discrete data.
supervised learning
data with decision classes /labels
classification problems
case-control studies
algorithm =. decision trees or rule-based learning
unsupervised
unknown decision classes
looking for patterns in data
hierarchical clusterings
performance or interpretability
performance for things involving life
interpretability for complex coding or data analysis model complex
interpretable ML techniques aim at giving legible answers for predictions
cutoffpermutation value needs to be set to at least…
20! to have significant results
when is undersampling neccesary
when the distribution of classes is unequal
e.g. 20 controles y 5 pacientes
What is the classification accuracy?
what is the expected value?
Accuracy is the power/strenght of our model. We want accuracy over the expected 0.5 because that indicates that the model is correct more often than random chance.
Accuracy = (TP +TN) / (TP + TN + FP + FN). The number of correct predictions divided by total.
can we trust the AUC with low samples
It is questionable if we should trust the model performance with such low values.
when is appropiate to de k-fold cross validation
when we dont have external test set to test our data
sensitivity
TP/(TP+FN)
TRUE POSITIVE RATE
specificity
TN/(TN+FP)
TRUE NEGATIVE RATE
AUC MEANING
its the area under the ROC curve and its obtained by changing the threshold (cut-off) between true positive rate and true negative rate,
How to improve a rule based model. Improving the accuracy and AUC
- Increase the number of permutations in MCFS
- iNCREASE NUMBER IN BOTH SAMPLES
- CHANGE REDUCER e.g johnson to genetic
- decrese or increase the number of features
- detect and remove objects that are wrongly classified
Explain how you interpret a VisuNet graph, what does the following parameters mean?
- node size
- lines between nodes
- border size
node size= tells you the decision and coverage support.
Intensity of node color = tells us the relative importance of the feature.
lines between the nodes tell you about how strongly connected the nodes are. Red, thick lines indicates stronger connections.
Border size tells you how many times that feature is included in a rule.
describe a strategy for construction decision trees.
use a TOP-DOWN approach and construct the tree RECURSIVELY one split at a time. For each split, the one with the highest ratio is chosen. The tree is finished when there are no possible splits that reduce the information value further.
* Top-down
* recursive
* one at a time
Explain decision trees
The root of the tree represents the entire dataset and the first split from the node is the most important because it divides the largest number of features. Further down in the tree, we have other nodes that represent smaller splits and divergences in the data. The leafs represent the final classifications and we can follow the branches from root to leaf to get a sort of “rule” for the classification.
What is feature selection and when should we use it?
Identify an ordered list of attributes that best discriminates between/among decision classes.
Good for identifying the most important features for classification in very large sets of features.
External features selection with for example MCFS is necessary if the dataset has more features than objects in the universe and it is done to reduce the dimensionality of features to those that most affect the classification.
MCFS MAIN STEPS
- create S SUBSETS of m attributes chosen at random from the original d attributes
- s is chosen so that the difference in ranking between ten iterations is small and stable
- divide the subsets into training and tests set t times
- for each training set make a tree classifier
- evaluate on thhe test
- calculñate the relative importance
DECISION TREE STEPS
- CALCULAS INFO DE TODO
- separas por atributo y calculas el info size de cada uno por sus diferentes clases. info ([2,3])
- SACAS el weighted de todos las clas de por ejemplo gen 1 donde sumas si era 2, 3 entonces 5/suma de todo + info de los otros
- gain resta de original - weighted
- split_info = sacas el info de la suma de los posibles classes de los dos [] sacas suma
- gain ratio =. gain / split _info
steps for creating a rule-bassed model
- put aside an external validation set of subject samples
- data processing, remove imcomplete data
- feature selection: perform a feature selection to select the most important feautures and reduce noise
advantages and disadvantages of MCFS=
Advantages=
preservation of the features
ranking of the features and statisticsal significance of them
little feature shadowing
Disadvantages=
Not possible to explain variability in the data
computational expensive
odd distibution of objects e.g 20 cases and 2 control
Undersampling
discretization,when should it be performed
before the split of test and training
genetic algorithm
uses evolution and continues to search for better
boolean expression law used in simplificarion
((AvB)^A =A
technique used in MCFS to find a cutoff
permutation test
resources to interpret gene expression levels
ENSEMBL GENE ONTOLOGY KEGG