Yet Another Deck Flashcards
Regression is a data mining task of predicting the value of the target (_) by building a model based on one of more predictors
numerical variable
regression options
* decision tree (freq table) * multiple linear regression (co variance matrix) * k-nearest neighbor (similarity functions) * artificial neural networks (other) * support vector machine (other) Dinosaurs made Kites and Vikings. Natural theory of regression.
The ID3 algorithm can be used to construct as decision tree for regression by
replacing information gain with standard deviation reduction
the standard deviation reduction for ID3 regression is based on the
decrease in standard deviation after a dataset is split on an attribute
constructing an ID3 decision tree is all about finding attribute that returns the
highest standard deviation reduction
when building an ID3 regression decision tree, a branch with standard deviation of more than zero _
requires further splitting
Decision trees. To stop splitting forever we need some termination criteria, for example, when the _ becomes smaller than a certain fraction of the _
standard deviation, standard deviation of for the full dataset (e.g. 5%)
Decision trees. To stop splitting forever we need some termination criteria, for example, when too _
few instances remain in the branch (e.g. 3)
Decision trees. To stop splitting forever we need some termination criteria. Then when the number of instances is more than one at a leaf node we _
calculate the average as the final value for the target
Logistic regression predicts
the probability of an outcome than can only have two values (i.e. a dichotomy)
The prediction for logistic regression is based on the use of one of several predictors (_ & _)
numerical, categorical
A linear regression is not appropriate for predicting the value of a binary variable for two reasons (1) linear regression will
predict values outside the acceptable range (0 to 1)
A linear regression is not appropriate for predicting the value of a binary variable for two reasons (2) since dichotomous experiments can only have one of two possible values for each experiment, the residuals will not
be normally distributed about the predicted line
Logistic regression produces a _ which is _
logistic curve, limited to the values between 0 and 1
logistic regression is similar to linear regression but the curve is constructed using the natural logarithm of the _ of the target variable, rather than the probability
odds
logistic regression is similar to linear regression but the predictors do not
have to be normally distributed or have equal variance in the group
Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses _ to obtain the model coefficients that relate predictors to the targer
maximum likelihood estimation (MLE)
Association rules is a pattern that states when an event occurs _
another event occurs with a certain probability
Most instance-based learners use:
Euclidean distance
Alternative to Euclidean distance
Manhattan, City-Block
It is usual to normalize all attribute values to:
normalize attribute values to between 0-1.
Normalizing Euclidean distance: symbolic attributes (non-numeric). The difference between two different values is usually expressed:
one (mismatch), zero (match)
normalizing Euclidean distance formula - missing attributes are:
taken to be 1 (maximally different)
normalizing Euclidean distance formula: For numeric attributes, the diff between two missing values is also taken as 1. However, if just one value is missing, the distance can be:
taken as (normalized) size of the other value X or 1-X, whichever is larger (as large as possibly)
instance-based learning is slow because you are:
calculating distance from every member of the training set
Nearest neighbors can be found using a:
kD-tree
kD-Tree
Binary tree which divides input space with a hyperplane. k = no# of attributes.
kD-trees: note that the hyperplanes are not _
decision boundaries
Kd-tree: choosing the median value may yield skinny hyperrectangles. Rather,
use the mean & point closest to that.
Kd-tree: instance-based learning advantage; can update it incrementally. To do this for a kd-tree, determine which _ contains the new point and find its _.
leaf node, hyperrectangle
Kd-tree: corners of rectangular regions awkward? Use:
hyperspheres, rather than hyperrectangles
kD-trees do not depend on regions being disjoint, the _ defines k-dimensional hyperspheres (“_”) that cover the data points, and _
ball tree, balls, arranges them into a tree
Ball Tree: regions can overlap, but points in the overlap are assigned to _
only one of the overlapping balls
Nearest-neighbor instance-based learning: k-nearest-neighbor: k of nearest neighbors: determine class using:
a majority vote
kD Trees: worthwhile only when the attribute number is small:
up to 10
Ball trees, an instance of a more general structure called: _. Sophisticated algorithms can create trees that deal successfully with _ of dimensions.
a metric tree, thousands
kD Trees: Instead of storing all training instances, you can _.
compress into regions
kD trees: discretizing numeric attributes into intervals (& for nominal “intervals” consisting of a single point). Determine which intervals the test-instance resides by voting. This is called:
voting feature intervals
Distretizing into intervals, then determining which intervals test-instances reside, by voting.
voting feature intervals
“Different ways in which the result of clustering can be expressed: _ (instances belong in one group). _ (instances belong in several groups). _ (instances belongs to groups with a probability), or _ (hierarchical).”
exclusive overlapping probabilistic hierachical
The classic clustering technique is called _
k-means
K-means. What does the parameter k stand for?
how many clusters are being sought
k-means - how are clusters chosen?
k-clusters are chosen at random, Euclidean distance metric
k-means - all instances are assigned _ according to the _
to one cluster center, Euclidean
k-means: ‘means’ part: the _ of the instances in each cluster
Centroid or mean, these are the new center values for their respective clusters.
k-means: Iteration continues until the same points k-means: Iteration continues until the same points are assigned to each cluster in:are assigned to each cluster in:
consecutive rounds
k-means: choosing the cluster center to be the centroid _ from each of the cluster’s points to its center.
minimizes the total sqrd distance
k-means: to increase the chance of finding a global minimum people often_ and choose the best final result—the one with the smallest _.
run the algorithm several times with different initial choices, total sqrd distance
k-means clustering can be dramatically improved by careful choice of the initial cluster centers, often called _.
seeds
k-means: Instead of beginning with an arbitrary set of seeds, can choose initial seed, then choose a second seed with a probability that is _ the distance from the first.
proportional to the square of
k-means; seeding intelligently using the procedure, called _ improves both speed and accuracy over the original algorithm with random seeds.
k-means++
1R for 1-rule
one-level decision tree, in the form of a set of rules, that test one attribute
1R: deals with Missing Attributes in simple but effective ways. Missing is treated as
another attribute value
1R: deals with Numeric Attributes in simple but effective ways. Use the _
discretization
1R -may form an excessively large number of categories (ID_code) This phenomenon is known as _
overfitting
1R: to avoid overfitting when discretizing a numeric attribute, a minimim limit is imposed on
the no# of instances of the majority class in each partition.
the gain ratio, provided that the information gain for that attribute is at least as great as the _ information gain for all the attributes examined.
maximizes, average
Information gain is biased _ even though these won’t work on new data
towards attributes with many values
Use Gain ratio to deal with the problem with information gain, look at the _
split entropy
split entroy formula, relates the size of this subset relative to _
the size of entire set
gain ratio formula
Gain(S,A)/SplitEntropy(S,A)
A series of improvements to ID3 culminated in a practical and influential system for decision tree induction called _
c4.5
covering algorithms: seek a way of covering all instances in a class & excluding all else - at each stage identifying a rule that covers some of the instances.
covering approach
COVERING ALGORITHMS: CONSTRUCTING RULES lead to
Rules, not trees
covering algorithms: replicated subtree problem
rules can be symetric whereas trees must select one attribute to split on first
covering algorithms: A decision tree split takes all classes into account in trying to maximize the purity of the split, whereas
rules concentrates on one class at a time, disregarding other classes
covering algorithms: accuracy/condidence
instances predicted correctly, as a proportion of the instances to which the rule applies
covering algorithms: coverage also called
support
covering algorithms: accuracy also called
confidence
covering algorithms: In the event of a tie, we
choose the rule with the greater coverage
coverage
covering algorithms: number of instances predicted correctly
PRISM can be described as _
separate-and-conquer algorithms
PRISM method
Identify rule that covers many instances, separate out, continue on remaining instances
the information value for info([2, 3]) is _ bits
−2/5 × log 2/5 − 3/5 × log 3/5
u is the
mean
Naïve Bayes: Word frequencies can be accommodated by applying a _
modified form of Naïve Bayes called multinomial Naïve Bayes.
Naïve Bayes: a document can be viewed as a _
bag of words
Naïve Bayes. If you suspect it isn’t normal but don’t know the actual distribution, there are procedures for “_” that do not assume any particular distribution for the attribute values.
kernel density estimation
information gain of the ID or Day of the week as attribute would be the same as
the information at the root
gain ratio modification can overcompensate. One fix is to choose the attribute that _, provided that the information gain for that attribute is at least as great as the _ for all the attributes examined.
that maximizes the gain ratio, average information gain
information: amount of information obtained by making a decision. When the number of yes’s and no’s equal:
information reaches a maximum
choose best splitting attribute:
information gain
gain(temperature) = 0.571 bits
gain(humidity) = 0.971 bits
gain(windy) = 0.020 bits
humidty
An important domain for machine learning is document classification, in which each instance _ and the instance’s class is the_
represents a document, document topic
One of the really nice things about Naïve Bayes is that _.
missing values are no problem
This method goes by the name of _ because it’s based on Bayes’ rule and assumes independence —it is only valid to multiply probabilities when the events are _.
naïve bayes, independent
The gain ratio is derived by taking into account _ the dataset, disregarding any information about the class
the number and size of daughter nodes into which an attribute splits
information is measured in:
units called bits