Yet Another Deck Flashcards

1
Q

Regression is a data mining task of predicting the value of the target (_) by building a model based on one of more predictors

A

numerical variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

regression options

A

* decision tree (freq table) * multiple linear regression (co variance matrix) * k-nearest neighbor (similarity functions) * artificial neural networks (other) * support vector machine (other) Dinosaurs made Kites and Vikings. Natural theory of regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The ID3 algorithm can be used to construct as decision tree for regression by

A

replacing information gain with standard deviation reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

the standard deviation reduction for ID3 regression is based on the

A

decrease in standard deviation after a dataset is split on an attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

constructing an ID3 decision tree is all about finding attribute that returns the

A

highest standard deviation reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

when building an ID3 regression decision tree, a branch with standard deviation of more than zero _

A

requires further splitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Decision trees. To stop splitting forever we need some termination criteria, for example, when the _ becomes smaller than a certain fraction of the _

A

standard deviation, standard deviation of for the full dataset (e.g. 5%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Decision trees. To stop splitting forever we need some termination criteria, for example, when too _

A

few instances remain in the branch (e.g. 3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Decision trees. To stop splitting forever we need some termination criteria. Then when the number of instances is more than one at a leaf node we _

A

calculate the average as the final value for the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Logistic regression predicts

A

the probability of an outcome than can only have two values (i.e. a dichotomy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The prediction for logistic regression is based on the use of one of several predictors (_ & _)

A

numerical, categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A linear regression is not appropriate for predicting the value of a binary variable for two reasons (1) linear regression will

A

predict values outside the acceptable range (0 to 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A linear regression is not appropriate for predicting the value of a binary variable for two reasons (2) since dichotomous experiments can only have one of two possible values for each experiment, the residuals will not

A

be normally distributed about the predicted line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logistic regression produces a _ which is _

A

logistic curve, limited to the values between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

logistic regression is similar to linear regression but the curve is constructed using the natural logarithm of the _ of the target variable, rather than the probability

A

odds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

logistic regression is similar to linear regression but the predictors do not

A

have to be normally distributed or have equal variance in the group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses _ to obtain the model coefficients that relate predictors to the targer

A

maximum likelihood estimation (MLE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Association rules is a pattern that states when an event occurs _

A

another event occurs with a certain probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Most instance-based learners use:

A

Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Alternative to Euclidean distance

A

Manhattan, City-Block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

It is usual to normalize all attribute values to:

A

normalize attribute values to between 0-1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Normalizing Euclidean distance: symbolic attributes (non-numeric). The difference between two different values is usually expressed:

A

one (mismatch), zero (match)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

normalizing Euclidean distance formula - missing attributes are:

A

taken to be 1 (maximally different)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

normalizing Euclidean distance formula: For numeric attributes, the diff between two missing values is also taken as 1. However, if just one value is missing, the distance can be:

A

taken as (normalized) size of the other value X or 1-X, whichever is larger (as large as possibly)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

instance-based learning is slow because you are:

A

calculating distance from every member of the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Nearest neighbors can be found using a:

A

kD-tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

kD-Tree

A

Binary tree which divides input space with a hyperplane. k = no# of attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

kD-trees: note that the hyperplanes are not _

A

decision boundaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Kd-tree: choosing the median value may yield skinny hyperrectangles. Rather,

A

use the mean & point closest to that.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Kd-tree: instance-based learning advantage; can update it incrementally. To do this for a kd-tree, determine which _ contains the new point and find its _.

A

leaf node, hyperrectangle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Kd-tree: corners of rectangular regions awkward? Use:

A

hyperspheres, rather than hyperrectangles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

kD-trees do not depend on regions being disjoint, the _ defines k-dimensional hyperspheres (“_”) that cover the data points, and _

A

ball tree, balls, arranges them into a tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Ball Tree: regions can overlap, but points in the overlap are assigned to _

A

only one of the overlapping balls

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Nearest-neighbor instance-based learning: k-nearest-neighbor: k of nearest neighbors: determine class using:

A

a majority vote

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

kD Trees: worthwhile only when the attribute number is small:

A

up to 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Ball trees, an instance of a more general structure called: _. Sophisticated algorithms can create trees that deal successfully with _ of dimensions.

A

a metric tree, thousands

37
Q

kD Trees: Instead of storing all training instances, you can _.

A

compress into regions

38
Q

kD trees: discretizing numeric attributes into intervals (& for nominal “intervals” consisting of a single point). Determine which intervals the test-instance resides by voting. This is called:

A

voting feature intervals

39
Q

Distretizing into intervals, then determining which intervals test-instances reside, by voting.

A

voting feature intervals

40
Q

“Different ways in which the result of clustering can be expressed: _ (instances belong in one group). _ (instances belong in several groups). _ (instances belongs to groups with a probability), or _ (hierarchical).”

A

exclusive overlapping probabilistic hierachical

41
Q

The classic clustering technique is called _

A

k-means

42
Q

K-means. What does the parameter k stand for?

A

how many clusters are being sought

43
Q

k-means - how are clusters chosen?

A

k-clusters are chosen at random, Euclidean distance metric

44
Q

k-means - all instances are assigned _ according to the _

A

to one cluster center, Euclidean

45
Q

k-means: ‘means’ part: the _ of the instances in each cluster

A

Centroid or mean, these are the new center values for their respective clusters.

46
Q

k-means: Iteration continues until the same points k-means: Iteration continues until the same points are assigned to each cluster in:are assigned to each cluster in:

A

consecutive rounds

47
Q

k-means: choosing the cluster center to be the centroid _ from each of the cluster’s points to its center.

A

minimizes the total sqrd distance

48
Q

k-means: to increase the chance of finding a global minimum people often_ and choose the best final result—the one with the smallest _.

A

run the algorithm several times with different initial choices, total sqrd distance

49
Q

k-means clustering can be dramatically improved by careful choice of the initial cluster centers, often called _.

A

seeds

50
Q

k-means: Instead of beginning with an arbitrary set of seeds, can choose initial seed, then choose a second seed with a probability that is _ the distance from the first.

A

proportional to the square of

51
Q

k-means; seeding intelligently using the procedure, called _ improves both speed and accuracy over the original algorithm with random seeds.

A

k-means++

52
Q

1R for 1-rule

A

one-level decision tree, in the form of a set of rules, that test one attribute

53
Q

1R: deals with Missing Attributes in simple but effective ways. Missing is treated as

A

another attribute value

54
Q

1R: deals with Numeric Attributes in simple but effective ways. Use the _

A

discretization

55
Q

1R -may form an excessively large number of categories (ID_code) This phenomenon is known as _

A

overfitting

56
Q

1R: to avoid overfitting when discretizing a numeric attribute, a minimim limit is imposed on

A

the no# of instances of the majority class in each partition.

57
Q

the gain ratio, provided that the information gain for that attribute is at least as great as the _ information gain for all the attributes examined.

A

maximizes, average

58
Q

Information gain is biased _ even though these won’t work on new data

A

towards attributes with many values

59
Q

Use Gain ratio to deal with the problem with information gain, look at the _

A

split entropy

60
Q

split entroy formula, relates the size of this subset relative to _

A

the size of entire set

61
Q

gain ratio formula

A

Gain(S,A)/SplitEntropy(S,A)

62
Q

A series of improvements to ID3 culminated in a practical and influential system for decision tree induction called _

A

c4.5

63
Q

covering algorithms: seek a way of covering all instances in a class & excluding all else - at each stage identifying a rule that covers some of the instances.

A

covering approach

64
Q

COVERING ALGORITHMS: CONSTRUCTING RULES lead to

A

Rules, not trees

65
Q

covering algorithms: replicated subtree problem

A

rules can be symetric whereas trees must select one attribute to split on first

66
Q

covering algorithms: A decision tree split takes all classes into account in trying to maximize the purity of the split, whereas

A

rules concentrates on one class at a time, disregarding other classes

67
Q

covering algorithms: accuracy/condidence

A

instances predicted correctly, as a proportion of the instances to which the rule applies

68
Q

covering algorithms: coverage also called

A

support

69
Q

covering algorithms: accuracy also called

A

confidence

70
Q

covering algorithms: In the event of a tie, we

A

choose the rule with the greater coverage

71
Q

coverage

A

covering algorithms: number of instances predicted correctly

72
Q

PRISM can be described as _

A

separate-and-conquer algorithms

73
Q

PRISM method

A

Identify rule that covers many instances, separate out, continue on remaining instances

74
Q

the information value for info([2, 3]) is _ bits

A

−2/5 × log 2/5 − 3/5 × log 3/5

75
Q

u is the

A

mean

76
Q

Naïve Bayes: Word frequencies can be accommodated by applying a _

A

modified form of Naïve Bayes called multinomial Naïve Bayes.

77
Q

Naïve Bayes: a document can be viewed as a _

A

bag of words

78
Q

Naïve Bayes. If you suspect it isn’t normal but don’t know the actual distribution, there are procedures for “_” that do not assume any particular distribution for the attribute values.

A

kernel density estimation

79
Q

information gain of the ID or Day of the week as attribute would be the same as

A

the information at the root

80
Q

gain ratio modification can overcompensate. One fix is to choose the attribute that _, provided that the information gain for that attribute is at least as great as the _ for all the attributes examined.

A

that maximizes the gain ratio, average information gain

81
Q

information: amount of information obtained by making a decision. When the number of yes’s and no’s equal:

A

information reaches a maximum

82
Q

choose best splitting attribute:

information gain
gain(temperature) = 0.571 bits
gain(humidity) = 0.971 bits
gain(windy) = 0.020 bits

A

humidty

83
Q

An important domain for machine learning is document classification, in which each instance _ and the instance’s class is the_

A

represents a document, document topic

84
Q

One of the really nice things about Naïve Bayes is that _.

A

missing values are no problem

85
Q

This method goes by the name of _ because it’s based on Bayes’ rule and assumes independence —it is only valid to multiply probabilities when the events are _.

A

naïve bayes, independent

86
Q

The gain ratio is derived by taking into account _ the dataset, disregarding any information about the class

A

the number and size of daughter nodes into which an attribute splits

87
Q

information is measured in:

A

units called bits

88
Q
A