CSCI 343 Quiz 2 Flashcards

Question

How to train in Naive Bayes

Answer 1

estimate the prior P(Y=v) as the fraction of records with Y=v & estimate P(Xi=u | Y=v) as the fraction of records with Y=v for which Xi=u

Answer 2

how confident the algorithm is

Answer 3

all features are independent given the class label Y

Answer 4

no because if a pixel is black, the ones next to it are more likely to be black as well

Answer 5

XOR (exclusive or -- either X1 or X2 but NOT both)

Answer 6

never (but it still performs well)

Answer 7

ML works with very small numbers, but computing the probability of 2000 independent coin flips (.5)^2000 would be output as zero

Answer 8

logarithms

Answer 9

log(x) + log(y)

Answer 10

training set -> induction (tree induction algorithm) -> learn model -> model (decision tree) -> apply model -> deduction -> test set

Answer 11

start from the root of the tree and move according to the splitting attributes

Answer 12

``` Dt = set of training records that reach a node t if Dt contains records that all belong to the same class, create a leaf node for that class if Dt is an empty set, then create a leaf node for the default class if Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets recurse ```

Answer 13

split the records based on an attribute test that optimizes certain criterion

Answer 14

how to split records and how to know when to stop splitting

Answer 15

use as many partitions as distinct values | ex: CarType -> Family, Sports, Luxury

Answer 16

divide values into two subsets; need to find optimal partitioning ex: CarType -> {Sports, Luxury} and {Family}

Answer 17

an ordinal categorical attribute

Answer 18

discretize once at the beginning (ex: split at the median)

Answer 19

ranges can be found by equal interval bucketing, equal frequency bucketing, or clustering

Answer 20

look at value of the object (ex: every 10 points, every $5000)

Answer 21

take from percentiles so there's an equal number of records in each range (ex: each quartile)

Answer 22

look for gaps (ex: give ones closest together the same grade)

Answer 23

(A < v) or (A >= V) | consider all possible splits and finds the best cut; can be computationally expensive

Answer 24

purer records (ex: Class 0 may have 1 but Class 1 has 7 -- want these numbers to be far apart and at least one to get as close to zero as possible)

Answer 25

homogeneous

Answer 26

non-homogeous, high degree of impurity

Answer 27

homogeneous, low degree of impurity

Answer 28

Gini Index, Entropy, Misclassification Error

Answer 29

1 - the sum of all the (p sub j given t)^2

Answer 30

``` P(C1) = 2/6, P(C2) = 4/6 GINI = 1 - (2/6)^2 - (4/6)^2 = 0.444 ```

Answer 31

GINI for Node1 * proportion of records that come down that way + GINI for Node2 * proportion of records that come down that way

Answer 32

sort the attribute on values, linearly scan these values, each time updating the count matrix and computing the Gini index, **choose the split position that has the smallest Gini index**

Answer 33

GINI index computations

Answer 34

choose the split that achieves the most reduction

Answer 35

1 - max P(i | t)

Answer 36

when all records belong to the same class, when all the records have similar attribute values (maybe at least 90% in same class), early termination based on "level" or number of records narrowed down

Answer 37

inexpensive to construct, extremely fast classification, easy to interpret small trees, accuracy is comparable to other techniques for similar data sets

CSCI 343 Quiz 2 Flashcards

(70 cards)