Chapter 8 Flashcards
supervised learning(classification)
the training data (observations, measurments, etc.) are accompanied by labels indicating the class of the observations new data is classified based on the training set
Unsupervised learning (clusterng)
THe class labels of training data is unknown Given a set of measurments, observations, etc. with the aim of establishing the existence of classes or clusters in the data.
Classification
predicts categorical calss labesl (discrete or nominal) Classifies data based on the training set and the vaslues (class labels) in a classifying attribute and uses it in classifying new data
Numeric prediction
models continous-valued functions, i.e., predicts unknown or missing values credit loan aproval medical diagnosis fruad detection web page categorization
Learning step (model construction)
describing a ser of predetermined classes
each tuple/sample is assumed to belong to a predefined class, as determined by the class ;abe; attribute
the setr of tuples ued for model construction is the traning set
Classification step (model usage)
for classifying future or unknown objects
Estimate accuracy
the known label of test sample is compared with the classified reult from the model
acurracy rate
is the percentage of test set samples thgat are correctly classified by the model
test set
independent of traning set (otherwise overfitting)
ckassify new data
if the accuracy is acceptable
validation test set
if the test set is used to select modeles
decision tree induction
the learning of decison trees form class labled training tuples
internal node
denotes a test on an attriubute
branch
represents an outcome of the test
leaf node or terminal node
holds a class label
root node
top most node in a tree
d-data partition
initially it is the complete ser of training tuples and their associated class labels
attribute-list
list ofattrivutes describes the tuples
attribute_selection_method
a heuristic procedure for selecting the attribute that best disctiminates the given tuples according to class
gini index
enforce the resulting tree to be binary
information cange
allows multiway splits
determine splitting criterion
idelly the resulting partitions at each branch are as pure as possible
pure partition
when all the tuples in it belong to the same class
discrete valued partitioning
each variable has a partition or line to it
continous valued partition
there is a slit point resaulting in two partitions either below or over the split
discrete-valued and a binary tree must be produced
A question is asked and the outcome is the partition or line
attribute selection measures
splitting rules
The splitting attribute
the attribute habving the best score fot the measure, either maxiamise or minimize, is chosen for the given tuple.
entropy
is the state of disorder, confusion, surprise. uncertainty, and disorganization ( second law of thermal dynamics, ethropy increases over time)
The more heterogenous the event the more uncertainty
less heterogeneous, more homogenous thge event, less uncertainty
x-axis
the probability of the event
y-axis
is the heterogeneity or the imurity denoted by H(X) - Entropy (impuraty measure)
uncertantiy chjanges depending on the likelihood of an event
pr(x=0) no uncertainty, pr(x=0.0) maximum uncerainty, pr(x=1) no uncertainty
higher entropy
higher uncertainty
lower entropy
lower uncertainty
info(D)
is just the average amount of infromation needed to identify the class label of a tuple in D
High ingformation gain
minimizes the information needed to classify truple in the resulting partitions, and reflects the least randomness or impurity in the partitions
information gained
defined as the differnce between the priginal information required ( based on just the proportion of classes) and the new requirement (obtained after partitioning on A)
Gain(a)
tells us how much would be ained by brachinbg on A. it is the expected reduction in the information requirement caused by knowing the value of A. THe attribute A with the highest information gain is chosen as the splitting attribute node N
gain ratio for atrribute selection (c4.5)
information gain meaure is biased towars atrributes with a large number of values. IT uses gain ratio to overcome the problem (normalization to information gain)
maximum gain ratio
the attribute is slected as the splitting attribute
gini index measure the impuritty of D
if a data set D contins examples form n classes =, giuni index , gini(d) is define
The gini index considers a binary split for each attribute
Smallest gini(D)
the larges reduction in impuity is chosen to split the node
information gain
biased towards multivaalued attributes ( attributes with a large nu,ber of values)
Gain ratio
tends to prefer unbalanced splits in which one p[partition is much smaller than the others
gigni index
biased to multivalued attributes
has difficulty when number of classes is large
ttends to favor test that result in equal-sized portions and purity in both partitions
evaluation metrics
how can we measure accuracy? other metrics to consider?
postive tuples samples
tuples of the main class of interest
negative tuples
all other tuples
true positve
the postitve tuples that were correctly labled by the classsifier
ture negatives
the negative tuples that were correctly labled by the classifier
false positve
the negaive tuples that were incorrectyl labled as psoitive
false negative
the poistive tiuples that were mislabled as negaive
ressubstitution error
error rate on training set insted of a test set
classifier accuracy or recognition rate
percentage of test set tuples that are correctly classified (TP + TN)/All
error rate
misclassification rate; 1-accuracy
(fp+fn)/All
Class imbalance problem
one calss may be rare, e.g. fraud or cancer significant majority of the negative class and minority of the positive class
sensitivity
true pisitve recognition rate; tp/p
specifitity
true negatice recognition rate; tn/n
precision
exactness, what % of tuples that the classifier labeled as positive are actually psitive
recall
completeness, what % of positive tuples did the classifier label as positive, the perfect score is 1.0
perfect percision score
1.0 for a class C means that every tuple that the classifier leaves as belonging to class C does indeed being to class C. it does not tell us anything about the number of class C tuples that the classifier mislabled
perfect recall score
1.0 for C means that very item from class C was labeled as such, but ut does not tell us how many thoeth tuples were incorrectly labeled as belonging to class C.
f measeure
combine precision and recall in one formula, harmonic mean of precision and recall gives equal weight to ptrecision and recall
accuracy
classifier accuracy, predicting class label
speed
time to sonstruct the model, taining time
time to use the model. classification or prediction time
robustness
handling noise and missing values
scalability
efficieny in disk resident database
interpretability
undersatading and insight provided by the model