CSCI 343 Quiz 2 Flashcards

1
Q

K-nearest neighbor (KNN) is a ? learner

A

lazy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

a lazy learner has no…

A

no training, no model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

in a typical machine learning algorithm, ? takes more time; in a KNN algorithm, ? takes more time

A

typical: training takes time
KNN: testing/prediction takes time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

KNN uses a

A

similarity measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

the parameter k for KNN is usually

A

odd & small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

K-nearest neighbor works by

A

finding the k closest samples, using majority rules to pick the label; can also use weighted average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

the parameter k for KNN is defaulted to be

A

5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Naive Bayes Classification can be used for (examples)

A

spam classification, medical diagnosis, weather

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Naive Bayes: given ? predict ?

A

given features x1, x2, …, xn, predict label Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

conditional probability

A

the probability that the event B will occur, given event A has already occurred; P(B|A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

if A and B are independent, the conditional probability of B given A is

A

P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

if A and B are NOT independent, the probability of both events occurring is

A

P(A,B) = P(A) * P(B|A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

the conditional probability of B given A (NOT independent) is

A

P(B|A) = P(A,B) / P(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bayes Rule written out

A

P(Y | X1, …, Xn) = [P(X1, …, Xn | Y) * P(Y)] / P(X1, …, Xn)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Bayes Rule in simpler terms

A

(likelihood * prior) / normalization constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

likelihood in Bayes Rule

A

the probability of everything given that it’s Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

prior in Bayes Rule

A

assume uniform distribution P(Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

how many parameters are required to specify the prior for the digit recognition example?

A

one – Y (so what number it is)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how many parameters are required to specify the likelihood for the digit recognition example?

A

2(2^900 - 1) because for each Y, all X’s could either exist or not exist (0 or 1) for all 900 pixels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

the problem with explicitly modeling P(X1,…,Xn | Y) is

A

there are usually too many parameters (we’ll run out of space, time, and need tons of training data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Naive Bayes Assumption

A

assume that all features are independent given the class label Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Naive Bayes Model equation

A

P(X1,…,Xn | Y) = the product from i to n of P(Xi | Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to use the Naive Bayes Model:

A

assume independence and use whichever likelihood is equal to the product of the probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why is the Naive Bayes Model useful?

A

changes the number of parameters from exponential to linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How to train in Naive Bayes
estimate the prior P(Y=v) as the fraction of records with Y=v & estimate P(Xi=u | Y=v) as the fraction of records with Y=v for which Xi=u
26
In practice, some counts can be zero. Fix this by adding "virtual" counts. This is called
smoothing
27
What's nice about Naive Bayes is that it returns probabilities, which tell us ?
how confident the algorithm is
28
Naive Bayes assumption
all features are independent given the class label Y
29
does the Naive Bayes assumption hold for the digit recognition problem?
no because if a pixel is black, the ones next to it are more likely to be black as well
30
an example where conditional independence fails is
XOR (exclusive or -- either X1 or X2 but NOT both)
31
the Naive Bayes assumption is almost (never/always) true
never (but it still performs well)
32
numeric underflow
ML works with very small numbers, but computing the probability of 2000 independent coin flips (.5)^2000 would be output as zero
33
to fix numeric underflow, instead of comparing P(Y=5 | X1, ..., Xn) with P(Y=6 | X1, ..., Xn), compare their ?
logarithms
34
it is better to ? of probabilities rather than multiplying probabilities
sum logs
35
log(xy) =
log(x) + log(y)
36
can there be more than one decision tree that fits the same data?
yes
37
decision tree classification process
training set -> induction (tree induction algorithm) -> learn model -> model (decision tree) -> apply model -> deduction -> test set
38
how to apply decision tree model to test data
start from the root of the tree and move according to the splitting attributes
39
Hunt's Algorithm
``` Dt = set of training records that reach a node t if Dt contains records that all belong to the same class, create a leaf node for that class if Dt is an empty set, then create a leaf node for the default class if Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets recurse ```
40
greedy strategy for tree induction
split the records based on an attribute test that optimizes certain criterion
41
issues with tree induction
how to split records and how to know when to stop splitting
42
multi-way split
use as many partitions as distinct values | ex: CarType -> Family, Sports, Luxury
43
binary split
divide values into two subsets; need to find optimal partitioning ex: CarType -> {Sports, Luxury} and {Family}
44
discretization forms
an ordinal categorical attribute
45
static discretization
discretize once at the beginning (ex: split at the median)
46
dynamic discretization
ranges can be found by equal interval bucketing, equal frequency bucketing, or clustering
47
equal interval bucketing
look at value of the object (ex: every 10 points, every $5000)
48
equal frequency bucketing
take from percentiles so there's an equal number of records in each range (ex: each quartile)
49
clustering
look for gaps (ex: give ones closest together the same grade)
50
binary decision splitting
(A < v) or (A >= V) | consider all possible splits and finds the best cut; can be computationally expensive
51
the best splits will result in
purer records (ex: Class 0 may have 1 but Class 1 has 7 -- want these numbers to be far apart and at least one to get as close to zero as possible)
52
nodes with ? class distribution are preferred
homogeneous
53
ex: Class 0 - 5, Class 1 - 5
non-homogeous, high degree of impurity
54
ex: Class 0 - 9, Class 1 - 1
homogeneous, low degree of impurity
55
measures of node impurity
Gini Index, Entropy, Misclassification Error
56
GINI(t) =
1 - the sum of all the (p sub j given t)^2
57
you want to (minimize/maximize) the GINI
minimize
58
GINI's range is
0-0.5
59
GINI example: C1 - 2 C2 - 4
``` P(C1) = 2/6, P(C2) = 4/6 GINI = 1 - (2/6)^2 - (4/6)^2 = 0.444 ```
60
GINI index for binary attributes
GINI for Node1 * proportion of records that come down that way + GINI for Node2 * proportion of records that come down that way
61
for efficient computation on where to split, for each attribute, ...
sort the attribute on values, linearly scan these values, each time updating the count matrix and computing the Gini index, **choose the split position that has the smallest Gini index**
62
Entropy based computations are similar to the ?
GINI index computations
63
Entropy's range is
0-1
64
want to (minimize/maximize) entropy
minimize
65
want to (minimize/maximize) GAIN
maximize
66
GAIN of a split is a function of entropy; to maximize it, we
choose the split that achieves the most reduction
67
classification error at a node t Error(t) =
1 - max P(i | t)
68
want to (minimize/maximize) Error
minimize
69
when to stop splitting
when all records belong to the same class, when all the records have similar attribute values (maybe at least 90% in same class), early termination based on "level" or number of records narrowed down
70
advantages of decision tree based classification
inexpensive to construct, extremely fast classification, easy to interpret small trees, accuracy is comparable to other techniques for similar data sets