Classification: Basic Concepts Flashcards
What is classification?
model or classifier is constructed to predict class (categorical) labels
What is a numeric prediction?
model constructed predicts a continuous-valued function, or ordered value, as opposed to a class label
What is regression analysis?
statistical methodology that is most often used for numeric prediction
What are the two major types of prediction problems?
Classification and numeric predictions
What are the two steps in data classification?
learning step and classification step
what is the learning step
The training phase where classification algorithm builds teh classifier by analyzing or learning form a training set with the associated class labels.
What is classification step?
model used to predict class label for a given data
What is the accuracy of a cllassifier?
In a given test set, is the percentage of test set tuples that are correctly classified by the classifier.
What is desicion tree induction?
learning of decision trees from class-labeled training tuples
What is a desicion tree?
flowchart-like tree structure
What does each internal node (non leaf node) denotes?
a test on an attribute
What does each branch in a desicion tree represent?
outcome of the test
What does each leaf node represent?
terminal node holds the class label
How are decision trees used for classification?”
Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for that tuple
Why do we do attribute selection measures?
used to select the attribute that bests partitions the tuples into distinct classes.
What is an attribute selection method?
heuristic procedure for selecting the attribute that bests discriminates the given tuple according to class
What are two examples of attribute_selection_methods?
ifnormation gain and gini index
Does gini index enforce non binary or binary trees?
binary
What is the other name ofr attribute selection methods?
splitting rules
What is the bias in information gain?
bias towards multivariate attributes
What is the bias in gain ratio?
prefers unbalanced splits in which one partition is much smaller than the others
What is the gini index bias towards?
bias towards multivariate attributes and it favors tests that result in equal sized partitions and purity in both partitions
When does the gini index has difficulti with?
when the # of classes is large
What is the best attribute selection method in regards to not being biased for multivariate attributes?
Minimum description length (MDL)
what does MDL stand for? What does it do?
Minimum description length uses encoding techniques to define best desicion tree as the one that requires the fewest number of bits to both encode the tree and encode the exceptions to the tree. basically simplest solution is the best.
what is a multivariate split?
form of attribute (Feature) construction where new attributes are built based on existing ones
What does tree prunning does?
addresses the problem of overfitting the data in desicion trees
What types of measures do tree prunning uses?
statitstical measures to remove the least-reliable branches
What are the two most common tree prunning approaches?
prepruning and postprunning
How does preprunning work?
tree is pruned by halting construction early. upon halting node becomes leaf and the leaf may hold the most frequent class among the subset tuples or the probability distribution of those tuples
How does postprunning work?
removes subtrees from fully grown trees. removes branches and replaces it with a leaf which is labeled with the most frequent class among the subtree being replaced
What is the cost complexity?
prunning algorithm that considers the cost complexity of a tree to be a function of the number of leaves in the tree and the error rate of the tree
What is the error rate?
percentage of tuples misclassified