Chapter 8 Flashcards

1
Q

supervised learning(classification)

A
the training data (observations, measurments, etc.) are accompanied by labels indicating the class of  the observations
new data is classified based on the training set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Unsupervised learning (clusterng)

A
THe class labels of training data is unknown
Given a set of measurments, observations, etc. with the aim of establishing the existence of classes or clusters in the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classification

A
predicts categorical calss labesl (discrete or nominal)
Classifies data based on the training set and the vaslues (class labels) in a classifying attribute and uses it in classifying new data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Numeric prediction

A
models continous-valued functions, i.e., predicts unknown or missing values
credit loan aproval
medical diagnosis
fruad detection
web page categorization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Learning step (model construction)

A

describing a ser of predetermined classes
each tuple/sample is assumed to belong to a predefined class, as determined by the class ;abe; attribute
the setr of tuples ued for model construction is the traning set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Classification step (model usage)

A

for classifying future or unknown objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Estimate accuracy

A

the known label of test sample is compared with the classified reult from the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

acurracy rate

A

is the percentage of test set samples thgat are correctly classified by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

test set

A

independent of traning set (otherwise overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ckassify new data

A

if the accuracy is acceptable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

validation test set

A

if the test set is used to select modeles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

decision tree induction

A

the learning of decison trees form class labled training tuples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

internal node

A

denotes a test on an attriubute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

branch

A

represents an outcome of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

leaf node or terminal node

A

holds a class label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

root node

A

top most node in a tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

d-data partition

A

initially it is the complete ser of training tuples and their associated class labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

attribute-list

A

list ofattrivutes describes the tuples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

attribute_selection_method

A

a heuristic procedure for selecting the attribute that best disctiminates the given tuples according to class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

gini index

A

enforce the resulting tree to be binary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

information cange

A

allows multiway splits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

determine splitting criterion

A

idelly the resulting partitions at each branch are as pure as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

pure partition

A

when all the tuples in it belong to the same class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

discrete valued partitioning

A

each variable has a partition or line to it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
continous valued partition
there is a slit point resaulting in two partitions either below or over the split
26
discrete-valued and a binary tree must be produced
A question is asked and the outcome is the partition or line
27
attribute selection measures
splitting rules
28
The splitting attribute
the attribute habving the best score fot the measure, either maxiamise or minimize, is chosen for the given tuple.
29
entropy
is the state of disorder, confusion, surprise. uncertainty, and disorganization ( second law of thermal dynamics, ethropy increases over time)
30
The more heterogenous the event the more uncertainty
less heterogeneous, more homogenous thge event, less uncertainty
31
x-axis
the probability of the event
32
y-axis
is the heterogeneity or the imurity denoted by H(X) - Entropy (impuraty measure)
33
uncertantiy chjanges depending on the likelihood of an event
pr(x=0) no uncertainty, pr(x=0.0) maximum uncerainty, pr(x=1) no uncertainty
34
higher entropy
higher uncertainty
35
lower entropy
lower uncertainty
36
info(D)
is just the average amount of infromation needed to identify the class label of a tuple in D
37
High ingformation gain
minimizes the information needed to classify truple in the resulting partitions, and reflects the least randomness or impurity in the partitions
38
information gained
defined as the differnce between the priginal information required ( based on just the proportion of classes) and the new requirement (obtained after partitioning on A)
39
Gain(a)
tells us how much would be ained by brachinbg on A. it is the expected reduction in the information requirement caused by knowing the value of A. THe attribute A with the highest information gain is chosen as the splitting attribute node N
40
gain ratio for atrribute selection (c4.5)
information gain meaure is biased towars atrributes with a large number of values. IT uses gain ratio to overcome the problem (normalization to information gain)
41
maximum gain ratio
the attribute is slected as the splitting attribute
42
gini index measure the impuritty of D
if a data set D contins examples form n classes =, giuni index , gini(d) is define The gini index considers a binary split for each attribute
43
Smallest gini(D)
the larges reduction in impuity is chosen to split the node
44
information gain
biased towards multivaalued attributes ( attributes with a large nu,ber of values)
45
Gain ratio
tends to prefer unbalanced splits in which one p[partition is much smaller than the others
46
gigni index
biased to multivalued attributes has difficulty when number of classes is large ttends to favor test that result in equal-sized portions and purity in both partitions
47
evaluation metrics
how can we measure accuracy? other metrics to consider?
48
postive tuples samples
tuples of the main class of interest
49
negative tuples
all other tuples
50
true positve
the postitve tuples that were correctly labled by the classsifier
51
ture negatives
the negative tuples that were correctly labled by the classifier
52
false positve
the negaive tuples that were incorrectyl labled as psoitive
53
false negative
the poistive tiuples that were mislabled as negaive
54
ressubstitution error
error rate on training set insted of a test set
55
classifier accuracy or recognition rate
percentage of test set tuples that are correctly classified (TP + TN)/All
56
error rate
misclassification rate; 1-accuracy | (fp+fn)/All
57
Class imbalance problem
``` one calss may be rare, e.g. fraud or cancer significant majority of the negative class and minority of the positive class ```
58
sensitivity
true pisitve recognition rate; tp/p
59
specifitity
true negatice recognition rate; tn/n
60
precision
exactness, what % of tuples that the classifier labeled as positive are actually psitive
61
recall
completeness, what % of positive tuples did the classifier label as positive, the perfect score is 1.0
62
perfect percision score
1.0 for a class C means that every tuple that the classifier leaves as belonging to class C does indeed being to class C. it does not tell us anything about the number of class C tuples that the classifier mislabled
63
perfect recall score
1.0 for C means that very item from class C was labeled as such, but ut does not tell us how many thoeth tuples were incorrectly labeled as belonging to class C.
64
f measeure
combine precision and recall in one formula, harmonic mean of precision and recall gives equal weight to ptrecision and recall
65
accuracy
classifier accuracy, predicting class label
66
speed
time to sonstruct the model, taining time | time to use the model. classification or prediction time
67
robustness
handling noise and missing values
68
scalability
efficieny in disk resident database
69
interpretability
undersatading and insight provided by the model