Yet Another Deck Flashcards

1
Q

Regression is a data mining task of predicting the value of the target (_) by building a model based on one of more predictors

A

numerical variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

regression options

A

* decision tree (freq table) * multiple linear regression (co variance matrix) * k-nearest neighbor (similarity functions) * artificial neural networks (other) * support vector machine (other) Dinosaurs made Kites and Vikings. Natural theory of regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The ID3 algorithm can be used to construct as decision tree for regression by

A

replacing information gain with standard deviation reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

the standard deviation reduction for ID3 regression is based on the

A

decrease in standard deviation after a dataset is split on an attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

constructing an ID3 decision tree is all about finding attribute that returns the

A

highest standard deviation reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

when building an ID3 regression decision tree, a branch with standard deviation of more than zero _

A

requires further splitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Decision trees. To stop splitting forever we need some termination criteria, for example, when the _ becomes smaller than a certain fraction of the _

A

standard deviation, standard deviation of for the full dataset (e.g. 5%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Decision trees. To stop splitting forever we need some termination criteria, for example, when too _

A

few instances remain in the branch (e.g. 3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Decision trees. To stop splitting forever we need some termination criteria. Then when the number of instances is more than one at a leaf node we _

A

calculate the average as the final value for the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Logistic regression predicts

A

the probability of an outcome than can only have two values (i.e. a dichotomy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The prediction for logistic regression is based on the use of one of several predictors (_ & _)

A

numerical, categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A linear regression is not appropriate for predicting the value of a binary variable for two reasons (1) linear regression will

A

predict values outside the acceptable range (0 to 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A linear regression is not appropriate for predicting the value of a binary variable for two reasons (2) since dichotomous experiments can only have one of two possible values for each experiment, the residuals will not

A

be normally distributed about the predicted line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logistic regression produces a _ which is _

A

logistic curve, limited to the values between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

logistic regression is similar to linear regression but the curve is constructed using the natural logarithm of the _ of the target variable, rather than the probability

A

odds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

logistic regression is similar to linear regression but the predictors do not

A

have to be normally distributed or have equal variance in the group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses _ to obtain the model coefficients that relate predictors to the targer

A

maximum likelihood estimation (MLE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Association rules is a pattern that states when an event occurs _

A

another event occurs with a certain probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Most instance-based learners use:

A

Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Alternative to Euclidean distance

A

Manhattan, City-Block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

It is usual to normalize all attribute values to:

A

normalize attribute values to between 0-1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Normalizing Euclidean distance: symbolic attributes (non-numeric). The difference between two different values is usually expressed:

A

one (mismatch), zero (match)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

normalizing Euclidean distance formula - missing attributes are:

A

taken to be 1 (maximally different)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

normalizing Euclidean distance formula: For numeric attributes, the diff between two missing values is also taken as 1. However, if just one value is missing, the distance can be:

A

taken as (normalized) size of the other value X or 1-X, whichever is larger (as large as possibly)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
instance-based learning is slow because you are:
calculating distance from every member of the training set
26
Nearest neighbors can be found using a:
kD-tree
27
kD-Tree
Binary tree which divides input space with a hyperplane. k = no# of attributes.
28
kD-trees: note that the hyperplanes are not \_
decision boundaries
29
Kd-tree: choosing the median value may yield skinny hyperrectangles. Rather,
use the mean & point closest to that.
30
Kd-tree: instance-based learning advantage; can update it incrementally. To do this for a kd-tree, determine which _ contains the new point and find its \_.
leaf node, hyperrectangle
31
Kd-tree: corners of rectangular regions awkward? Use:
hyperspheres, rather than hyperrectangles
32
kD-trees do not depend on regions being disjoint, the _ defines k-dimensional hyperspheres ("\_") that cover the data points, and \_
ball tree, balls, arranges them into a tree
33
Ball Tree: regions can overlap, but points in the overlap are assigned to \_
only one of the overlapping balls
34
Nearest-neighbor instance-based learning: k-nearest-neighbor: k of nearest neighbors: determine class using:
a majority vote
35
kD Trees: worthwhile only when the attribute number is small:
up to 10
36
Ball trees, an instance of a more general structure called: \_. Sophisticated algorithms can create trees that deal successfully with _ of dimensions.
a metric tree, thousands
37
kD Trees: Instead of storing all training instances, you can \_.
compress into regions
38
kD trees: discretizing numeric attributes into intervals (& for nominal "intervals" consisting of a single point). Determine which intervals the test-instance resides by voting. This is called:
voting feature intervals
39
Distretizing into intervals, then determining which intervals test-instances reside, by voting.
voting feature intervals
40
"Different ways in which the result of clustering can be expressed: _ (instances belong in one group). _ (instances belong in several groups). _ (instances belongs to groups with a probability), or _ (hierarchical)."
exclusive overlapping probabilistic hierachical
41
The classic clustering technique is called \_
k-means
42
K-means. What does the parameter k stand for?
how many clusters are being sought
43
k-means - how are clusters chosen?
k-clusters are chosen at random, Euclidean distance metric
44
k-means - all instances are assigned _ according to the \_
to one cluster center, Euclidean
45
k-means: 'means' part: the _ of the instances in each cluster
Centroid or mean, these are the new center values for their respective clusters.
46
k-means: Iteration continues until the same points k-means: Iteration continues until the same points are assigned to each cluster in:are assigned to each cluster in:
consecutive rounds
47
k-means: choosing the cluster center to be the centroid _ from each of the cluster's points to its center.
minimizes the total sqrd distance
48
k-means: to increase the chance of finding a global minimum people often\_ and choose the best final result—the one with the smallest \_.
run the algorithm several times with different initial choices, total sqrd distance
49
k-means clustering can be dramatically improved by careful choice of the initial cluster centers, often called \_.
seeds
50
k-means: Instead of beginning with an arbitrary set of seeds, can choose initial seed, then choose a second seed with a probability that is _ the distance from the first.
proportional to the square of
51
k-means; seeding intelligently using the procedure, called _ improves both speed and accuracy over the original algorithm with random seeds.
k-means++
52
1R for 1-rule
one-level decision tree, in the form of a set of rules, that test one attribute
53
1R: deals with Missing Attributes in simple but effective ways. Missing is treated as
another attribute value
54
1R: deals with Numeric Attributes in simple but effective ways. Use the \_
discretization
55
1R -may form an excessively large number of categories (ID\_code) This phenomenon is known as \_
overfitting
56
1R: to avoid overfitting when discretizing a numeric attribute, a minimim limit is imposed on
the no# of instances of the majority class in each partition.
57
the gain ratio, provided that the information gain for that attribute is at least as great as the _ information gain for all the attributes examined.
maximizes, average
58
Information gain is biased _ even though these won't work on new data
towards attributes with many values
59
Use Gain ratio to deal with the problem with information gain, look at the \_
split entropy
60
split entroy formula, relates the size of this subset relative to \_
the size of entire set
61
gain ratio formula
Gain(S,A)/SplitEntropy(S,A)
62
A series of improvements to ID3 culminated in a practical and influential system for decision tree induction called \_
c4.5
63
covering algorithms: seek a way of covering all instances in a class & excluding all else - at each stage identifying a rule that covers some of the instances.
covering approach
64
COVERING ALGORITHMS: CONSTRUCTING RULES lead to
Rules, not trees
65
covering algorithms: replicated subtree problem
rules can be symetric whereas trees must select one attribute to split on first
66
covering algorithms: A decision tree split takes all classes into account in trying to maximize the purity of the split, whereas
rules concentrates on one class at a time, disregarding other classes
67
covering algorithms: accuracy/condidence
instances predicted correctly, as a proportion of the instances to which the rule applies
68
covering algorithms: coverage also called
support
69
covering algorithms: accuracy also called
confidence
70
covering algorithms: In the event of a tie, we
choose the rule with the greater coverage
71
coverage
covering algorithms: number of instances predicted correctly
72
PRISM can be described as \_
separate-and-conquer algorithms
73
PRISM method
Identify rule that covers many instances, separate out, continue on remaining instances
74
the information value for info([2, 3]) is _ bits
−2/5 × log 2/5 − 3/5 × log 3/5
75
u is the
mean
76
Naïve Bayes: Word frequencies can be accommodated by applying a \_
modified form of Naïve Bayes called multinomial Naïve Bayes.
77
Naïve Bayes: a document can be viewed as a \_
bag of words
78
Naïve Bayes. If you suspect it isn't normal but don't know the actual distribution, there are procedures for "\_" that do not assume any particular distribution for the attribute values.
kernel density estimation
79
information gain of the ID or Day of the week as attribute would be the same as
the information at the root
80
gain ratio modification can overcompensate. One fix is to choose the attribute that \_, provided that the information gain for that attribute is at least as great as the _ for all the attributes examined.
that maximizes the gain ratio, average information gain
81
information: amount of information obtained by making a decision. When the number of yes's and no's equal:
information reaches a maximum
82
# choose best splitting attribute: information gain gain(temperature) = 0.571 bits gain(humidity) = 0.971 bits gain(windy) = 0.020 bits
humidty
83
An important domain for machine learning is document classification, in which each instance _ and the instance's class is the\_
represents a document, document topic
84
One of the really nice things about Naïve Bayes is that \_.
missing values are no problem
85
This method goes by the name of _ because it's based on Bayes' rule and assumes independence —it is only valid to multiply probabilities when the events are \_.
naïve bayes, independent
86
The gain ratio is derived by taking into account _ the dataset, disregarding any information about the class
the number and size of daughter nodes into which an attribute splits
87
information is measured in:
units called bits
88