Chapter 8 Flashcards

1
Q

supervised learning(classification)

A
the training data (observations, measurments, etc.) are accompanied by labels indicating the class of  the observations
new data is classified based on the training set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Unsupervised learning (clusterng)

A
THe class labels of training data is unknown
Given a set of measurments, observations, etc. with the aim of establishing the existence of classes or clusters in the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classification

A
predicts categorical calss labesl (discrete or nominal)
Classifies data based on the training set and the vaslues (class labels) in a classifying attribute and uses it in classifying new data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Numeric prediction

A
models continous-valued functions, i.e., predicts unknown or missing values
credit loan aproval
medical diagnosis
fruad detection
web page categorization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Learning step (model construction)

A

describing a ser of predetermined classes
each tuple/sample is assumed to belong to a predefined class, as determined by the class ;abe; attribute
the setr of tuples ued for model construction is the traning set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Classification step (model usage)

A

for classifying future or unknown objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Estimate accuracy

A

the known label of test sample is compared with the classified reult from the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

acurracy rate

A

is the percentage of test set samples thgat are correctly classified by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

test set

A

independent of traning set (otherwise overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ckassify new data

A

if the accuracy is acceptable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

validation test set

A

if the test set is used to select modeles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

decision tree induction

A

the learning of decison trees form class labled training tuples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

internal node

A

denotes a test on an attriubute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

branch

A

represents an outcome of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

leaf node or terminal node

A

holds a class label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

root node

A

top most node in a tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

d-data partition

A

initially it is the complete ser of training tuples and their associated class labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

attribute-list

A

list ofattrivutes describes the tuples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

attribute_selection_method

A

a heuristic procedure for selecting the attribute that best disctiminates the given tuples according to class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

gini index

A

enforce the resulting tree to be binary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

information cange

A

allows multiway splits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

determine splitting criterion

A

idelly the resulting partitions at each branch are as pure as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

pure partition

A

when all the tuples in it belong to the same class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

discrete valued partitioning

A

each variable has a partition or line to it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

continous valued partition

A

there is a slit point resaulting in two partitions either below or over the split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

discrete-valued and a binary tree must be produced

A

A question is asked and the outcome is the partition or line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

attribute selection measures

A

splitting rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

The splitting attribute

A

the attribute habving the best score fot the measure, either maxiamise or minimize, is chosen for the given tuple.

29
Q

entropy

A

is the state of disorder, confusion, surprise. uncertainty, and disorganization ( second law of thermal dynamics, ethropy increases over time)

30
Q

The more heterogenous the event the more uncertainty

A

less heterogeneous, more homogenous thge event, less uncertainty

31
Q

x-axis

A

the probability of the event

32
Q

y-axis

A

is the heterogeneity or the imurity denoted by H(X) - Entropy (impuraty measure)

33
Q

uncertantiy chjanges depending on the likelihood of an event

A

pr(x=0) no uncertainty, pr(x=0.0) maximum uncerainty, pr(x=1) no uncertainty

34
Q

higher entropy

A

higher uncertainty

35
Q

lower entropy

A

lower uncertainty

36
Q

info(D)

A

is just the average amount of infromation needed to identify the class label of a tuple in D

37
Q

High ingformation gain

A

minimizes the information needed to classify truple in the resulting partitions, and reflects the least randomness or impurity in the partitions

38
Q

information gained

A

defined as the differnce between the priginal information required ( based on just the proportion of classes) and the new requirement (obtained after partitioning on A)

39
Q

Gain(a)

A

tells us how much would be ained by brachinbg on A. it is the expected reduction in the information requirement caused by knowing the value of A. THe attribute A with the highest information gain is chosen as the splitting attribute node N

40
Q

gain ratio for atrribute selection (c4.5)

A

information gain meaure is biased towars atrributes with a large number of values. IT uses gain ratio to overcome the problem (normalization to information gain)

41
Q

maximum gain ratio

A

the attribute is slected as the splitting attribute

42
Q

gini index measure the impuritty of D

A

if a data set D contins examples form n classes =, giuni index , gini(d) is define
The gini index considers a binary split for each attribute

43
Q

Smallest gini(D)

A

the larges reduction in impuity is chosen to split the node

44
Q

information gain

A

biased towards multivaalued attributes ( attributes with a large nu,ber of values)

45
Q

Gain ratio

A

tends to prefer unbalanced splits in which one p[partition is much smaller than the others

46
Q

gigni index

A

biased to multivalued attributes
has difficulty when number of classes is large
ttends to favor test that result in equal-sized portions and purity in both partitions

47
Q

evaluation metrics

A

how can we measure accuracy? other metrics to consider?

48
Q

postive tuples samples

A

tuples of the main class of interest

49
Q

negative tuples

A

all other tuples

50
Q

true positve

A

the postitve tuples that were correctly labled by the classsifier

51
Q

ture negatives

A

the negative tuples that were correctly labled by the classifier

52
Q

false positve

A

the negaive tuples that were incorrectyl labled as psoitive

53
Q

false negative

A

the poistive tiuples that were mislabled as negaive

54
Q

ressubstitution error

A

error rate on training set insted of a test set

55
Q

classifier accuracy or recognition rate

A

percentage of test set tuples that are correctly classified (TP + TN)/All

56
Q

error rate

A

misclassification rate; 1-accuracy

(fp+fn)/All

57
Q

Class imbalance problem

A
one calss may be rare, e.g. fraud or cancer
significant majority of the negative class and minority of the positive class
58
Q

sensitivity

A

true pisitve recognition rate; tp/p

59
Q

specifitity

A

true negatice recognition rate; tn/n

60
Q

precision

A

exactness, what % of tuples that the classifier labeled as positive are actually psitive

61
Q

recall

A

completeness, what % of positive tuples did the classifier label as positive, the perfect score is 1.0

62
Q

perfect percision score

A

1.0 for a class C means that every tuple that the classifier leaves as belonging to class C does indeed being to class C. it does not tell us anything about the number of class C tuples that the classifier mislabled

63
Q

perfect recall score

A

1.0 for C means that very item from class C was labeled as such, but ut does not tell us how many thoeth tuples were incorrectly labeled as belonging to class C.

64
Q

f measeure

A

combine precision and recall in one formula, harmonic mean of precision and recall gives equal weight to ptrecision and recall

65
Q

accuracy

A

classifier accuracy, predicting class label

66
Q

speed

A

time to sonstruct the model, taining time

time to use the model. classification or prediction time

67
Q

robustness

A

handling noise and missing values

68
Q

scalability

A

efficieny in disk resident database

69
Q

interpretability

A

undersatading and insight provided by the model