CSCI 343 Quiz 2 Flashcards

1
Q

K-nearest neighbor (KNN) is a ? learner

A

lazy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

a lazy learner has no…

A

no training, no model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

in a typical machine learning algorithm, ? takes more time; in a KNN algorithm, ? takes more time

A

typical: training takes time
KNN: testing/prediction takes time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

KNN uses a

A

similarity measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

the parameter k for KNN is usually

A

odd & small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

K-nearest neighbor works by

A

finding the k closest samples, using majority rules to pick the label; can also use weighted average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

the parameter k for KNN is defaulted to be

A

5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Naive Bayes Classification can be used for (examples)

A

spam classification, medical diagnosis, weather

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Naive Bayes: given ? predict ?

A

given features x1, x2, …, xn, predict label Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

conditional probability

A

the probability that the event B will occur, given event A has already occurred; P(B|A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

if A and B are independent, the conditional probability of B given A is

A

P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

if A and B are NOT independent, the probability of both events occurring is

A

P(A,B) = P(A) * P(B|A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

the conditional probability of B given A (NOT independent) is

A

P(B|A) = P(A,B) / P(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bayes Rule written out

A

P(Y | X1, …, Xn) = [P(X1, …, Xn | Y) * P(Y)] / P(X1, …, Xn)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Bayes Rule in simpler terms

A

(likelihood * prior) / normalization constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

likelihood in Bayes Rule

A

the probability of everything given that it’s Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

prior in Bayes Rule

A

assume uniform distribution P(Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

how many parameters are required to specify the prior for the digit recognition example?

A

one – Y (so what number it is)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how many parameters are required to specify the likelihood for the digit recognition example?

A

2(2^900 - 1) because for each Y, all X’s could either exist or not exist (0 or 1) for all 900 pixels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

the problem with explicitly modeling P(X1,…,Xn | Y) is

A

there are usually too many parameters (we’ll run out of space, time, and need tons of training data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Naive Bayes Assumption

A

assume that all features are independent given the class label Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Naive Bayes Model equation

A

P(X1,…,Xn | Y) = the product from i to n of P(Xi | Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to use the Naive Bayes Model:

A

assume independence and use whichever likelihood is equal to the product of the probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why is the Naive Bayes Model useful?

A

changes the number of parameters from exponential to linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How to train in Naive Bayes

A

estimate the prior P(Y=v) as the fraction of records with Y=v & estimate P(Xi=u | Y=v) as the fraction of records with Y=v for which Xi=u

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

In practice, some counts can be zero. Fix this by adding “virtual” counts. This is called

A

smoothing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What’s nice about Naive Bayes is that it returns probabilities, which tell us ?

A

how confident the algorithm is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Naive Bayes assumption

A

all features are independent given the class label Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

does the Naive Bayes assumption hold for the digit recognition problem?

A

no because if a pixel is black, the ones next to it are more likely to be black as well

30
Q

an example where conditional independence fails is

A

XOR (exclusive or – either X1 or X2 but NOT both)

31
Q

the Naive Bayes assumption is almost (never/always) true

A

never (but it still performs well)

32
Q

numeric underflow

A

ML works with very small numbers, but computing the probability of 2000 independent coin flips (.5)^2000 would be output as zero

33
Q

to fix numeric underflow, instead of comparing P(Y=5 | X1, …, Xn) with P(Y=6 | X1, …, Xn), compare their ?

A

logarithms

34
Q

it is better to ? of probabilities rather than multiplying probabilities

A

sum logs

35
Q

log(xy) =

A

log(x) + log(y)

36
Q

can there be more than one decision tree that fits the same data?

A

yes

37
Q

decision tree classification process

A

training set -> induction (tree induction algorithm) -> learn model -> model (decision tree) -> apply model -> deduction -> test set

38
Q

how to apply decision tree model to test data

A

start from the root of the tree and move according to the splitting attributes

39
Q

Hunt’s Algorithm

A
Dt = set of training records that reach a node t
if Dt contains records that all belong to the same class, create a leaf node for that class
if Dt is an empty set, then create a leaf node for the default class
if Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets
recurse
40
Q

greedy strategy for tree induction

A

split the records based on an attribute test that optimizes certain criterion

41
Q

issues with tree induction

A

how to split records and how to know when to stop splitting

42
Q

multi-way split

A

use as many partitions as distinct values

ex: CarType -> Family, Sports, Luxury

43
Q

binary split

A

divide values into two subsets; need to find optimal partitioning
ex: CarType -> {Sports, Luxury} and {Family}

44
Q

discretization forms

A

an ordinal categorical attribute

45
Q

static discretization

A

discretize once at the beginning (ex: split at the median)

46
Q

dynamic discretization

A

ranges can be found by equal interval bucketing, equal frequency bucketing, or clustering

47
Q

equal interval bucketing

A

look at value of the object (ex: every 10 points, every $5000)

48
Q

equal frequency bucketing

A

take from percentiles so there’s an equal number of records in each range (ex: each quartile)

49
Q

clustering

A

look for gaps (ex: give ones closest together the same grade)

50
Q

binary decision splitting

A

(A < v) or (A >= V)

consider all possible splits and finds the best cut; can be computationally expensive

51
Q

the best splits will result in

A

purer records (ex: Class 0 may have 1 but Class 1 has 7 – want these numbers to be far apart and at least one to get as close to zero as possible)

52
Q

nodes with ? class distribution are preferred

A

homogeneous

53
Q

ex: Class 0 - 5, Class 1 - 5

A

non-homogeous, high degree of impurity

54
Q

ex: Class 0 - 9, Class 1 - 1

A

homogeneous, low degree of impurity

55
Q

measures of node impurity

A

Gini Index, Entropy, Misclassification Error

56
Q

GINI(t) =

A

1 - the sum of all the (p sub j given t)^2

57
Q

you want to (minimize/maximize) the GINI

A

minimize

58
Q

GINI’s range is

A

0-0.5

59
Q

GINI example:
C1 - 2
C2 - 4

A
P(C1) = 2/6, P(C2) = 4/6
GINI = 1 - (2/6)^2 - (4/6)^2 = 0.444
60
Q

GINI index for binary attributes

A

GINI for Node1 * proportion of records that come down that way + GINI for Node2 * proportion of records that come down that way

61
Q

for efficient computation on where to split, for each attribute, …

A

sort the attribute on values, linearly scan these values, each time updating the count matrix and computing the Gini index, choose the split position that has the smallest Gini index

62
Q

Entropy based computations are similar to the ?

A

GINI index computations

63
Q

Entropy’s range is

A

0-1

64
Q

want to (minimize/maximize) entropy

A

minimize

65
Q

want to (minimize/maximize) GAIN

A

maximize

66
Q

GAIN of a split is a function of entropy; to maximize it, we

A

choose the split that achieves the most reduction

67
Q

classification error at a node t Error(t) =

A

1 - max P(i | t)

68
Q

want to (minimize/maximize) Error

A

minimize

69
Q

when to stop splitting

A

when all records belong to the same class, when all the records have similar attribute values (maybe at least 90% in same class), early termination based on “level” or number of records narrowed down

70
Q

advantages of decision tree based classification

A

inexpensive to construct, extremely fast classification, easy to interpret small trees, accuracy is comparable to other techniques for similar data sets