Yet Another Deck Flashcards by Nigel Stafford

Regression is a data mining task of predicting the value of the target (_) by building a model based on one of more predictors

numerical variable

How well did you know this?

Not at all

Perfectly

regression options

* decision tree (freq table) * multiple linear regression (co variance matrix) * k-nearest neighbor (similarity functions) * artificial neural networks (other) * support vector machine (other) Dinosaurs made Kites and Vikings. Natural theory of regression.

How well did you know this?

Not at all

Perfectly

The ID3 algorithm can be used to construct as decision tree for regression by

replacing information gain with standard deviation reduction

How well did you know this?

Not at all

Perfectly

the standard deviation reduction for ID3 regression is based on the

decrease in standard deviation after a dataset is split on an attribute

How well did you know this?

Not at all

Perfectly

constructing an ID3 decision tree is all about finding attribute that returns the

highest standard deviation reduction

How well did you know this?

Not at all

Perfectly

when building an ID3 regression decision tree, a branch with standard deviation of more than zero _

requires further splitting

How well did you know this?

Not at all

Perfectly

Decision trees. To stop splitting forever we need some termination criteria, for example, when the _ becomes smaller than a certain fraction of the _

standard deviation, standard deviation of for the full dataset (e.g. 5%)

How well did you know this?

Not at all

Perfectly

Decision trees. To stop splitting forever we need some termination criteria, for example, when too _

few instances remain in the branch (e.g. 3)

How well did you know this?

Not at all

Perfectly

Decision trees. To stop splitting forever we need some termination criteria. Then when the number of instances is more than one at a leaf node we _

calculate the average as the final value for the target

How well did you know this?

Not at all

Perfectly

Logistic regression predicts

the probability of an outcome than can only have two values (i.e. a dichotomy)

How well did you know this?

Not at all

Perfectly

The prediction for logistic regression is based on the use of one of several predictors (_ & _)

numerical, categorical

How well did you know this?

Not at all

Perfectly

A linear regression is not appropriate for predicting the value of a binary variable for two reasons (1) linear regression will

predict values outside the acceptable range (0 to 1)

How well did you know this?

Not at all

Perfectly

A linear regression is not appropriate for predicting the value of a binary variable for two reasons (2) since dichotomous experiments can only have one of two possible values for each experiment, the residuals will not

be normally distributed about the predicted line

How well did you know this?

Not at all

Perfectly

Logistic regression produces a _ which is _

logistic curve, limited to the values between 0 and 1

How well did you know this?

Not at all

Perfectly

logistic regression is similar to linear regression but the curve is constructed using the natural logarithm of the _ of the target variable, rather than the probability

odds

How well did you know this?

Not at all

Perfectly

logistic regression is similar to linear regression but the predictors do not

have to be normally distributed or have equal variance in the group

How well did you know this?

Not at all

Perfectly

Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses _ to obtain the model coefficients that relate predictors to the targer

maximum likelihood estimation (MLE)

How well did you know this?

Not at all

Perfectly

Association rules is a pattern that states when an event occurs _

another event occurs with a certain probability

How well did you know this?

Not at all

Perfectly

Most instance-based learners use:

Euclidean distance

How well did you know this?

Not at all

Perfectly

Alternative to Euclidean distance

Manhattan, City-Block

How well did you know this?

Not at all

Perfectly

It is usual to normalize all attribute values to:

normalize attribute values to between 0-1.

How well did you know this?

Not at all

Perfectly

Normalizing Euclidean distance: symbolic attributes (non-numeric). The difference between two different values is usually expressed:

one (mismatch), zero (match)

How well did you know this?

Not at all

Perfectly

normalizing Euclidean distance formula - missing attributes are:

taken to be 1 (maximally different)

How well did you know this?

Not at all

Perfectly

normalizing Euclidean distance formula: For numeric attributes, the diff between two missing values is also taken as 1. However, if just one value is missing, the distance can be:

taken as (normalized) size of the other value X or 1-X, whichever is larger (as large as possibly)

How well did you know this?

Not at all

Perfectly

instance-based learning is slow because you are:

calculating distance from every member of the training set

Nearest neighbors can be found using a:

kD-tree

kD-Tree

Binary tree which divides input space with a hyperplane. k = no# of attributes.

kD-trees: note that the hyperplanes are not \_

decision boundaries

Kd-tree: choosing the median value may yield skinny hyperrectangles. Rather,

use the mean & point closest to that.

Kd-tree: instance-based learning advantage; can update it incrementally. To do this for a kd-tree, determine which _ contains the new point and find its \_.

leaf node, hyperrectangle

Kd-tree: corners of rectangular regions awkward? Use:

hyperspheres, rather than hyperrectangles

kD-trees do not depend on regions being disjoint, the _ defines k-dimensional hyperspheres ("\_") that cover the data points, and \_

ball tree, balls, arranges them into a tree

Ball Tree: regions can overlap, but points in the overlap are assigned to \_

only one of the overlapping balls

Nearest-neighbor instance-based learning: k-nearest-neighbor: k of nearest neighbors: determine class using:

a majority vote

kD Trees: worthwhile only when the attribute number is small:

up to 10

Ball trees, an instance of a more general structure called: \_. Sophisticated algorithms can create trees that deal successfully with _ of dimensions.

a metric tree, thousands

kD Trees: Instead of storing all training instances, you can \_.

compress into regions

kD trees: discretizing numeric attributes into intervals (& for nominal "intervals" consisting of a single point). Determine which intervals the test-instance resides by voting. This is called:

voting feature intervals

Distretizing into intervals, then determining which intervals test-instances reside, by voting.

voting feature intervals

"Different ways in which the result of clustering can be expressed: _ (instances belong in one group). _ (instances belong in several groups). _ (instances belongs to groups with a probability), or _ (hierarchical)."

exclusive overlapping probabilistic hierachical

The classic clustering technique is called \_

k-means

K-means. What does the parameter k stand for?

how many clusters are being sought

k-means - how are clusters chosen?

k-clusters are chosen at random, Euclidean distance metric

k-means - all instances are assigned _ according to the \_

to one cluster center, Euclidean

k-means: 'means' part: the _ of the instances in each cluster

Centroid or mean, these are the new center values for their respective clusters.

k-means: Iteration continues until the same points k-means: Iteration continues until the same points are assigned to each cluster in:are assigned to each cluster in:

consecutive rounds

k-means: choosing the cluster center to be the centroid _ from each of the cluster's points to its center.

minimizes the total sqrd distance

k-means: to increase the chance of finding a global minimum people often\_ and choose the best final result—the one with the smallest \_.

run the algorithm several times with different initial choices, total sqrd distance

k-means clustering can be dramatically improved by careful choice of the initial cluster centers, often called \_.

seeds

k-means: Instead of beginning with an arbitrary set of seeds, can choose initial seed, then choose a second seed with a probability that is _ the distance from the first.

proportional to the square of

k-means; seeding intelligently using the procedure, called _ improves both speed and accuracy over the original algorithm with random seeds.

k-means++

1R for 1-rule

one-level decision tree, in the form of a set of rules, that test one attribute

1R: deals with Missing Attributes in simple but effective ways. Missing is treated as

another attribute value

1R: deals with Numeric Attributes in simple but effective ways. Use the \_

discretization

1R -may form an excessively large number of categories (ID\_code) This phenomenon is known as \_

overfitting

1R: to avoid overfitting when discretizing a numeric attribute, a minimim limit is imposed on

the no# of instances of the majority class in each partition.

the gain ratio, provided that the information gain for that attribute is at least as great as the _ information gain for all the attributes examined.

maximizes, average

Information gain is biased _ even though these won't work on new data

towards attributes with many values

Use Gain ratio to deal with the problem with information gain, look at the \_

split entropy

split entroy formula, relates the size of this subset relative to \_

the size of entire set

gain ratio formula

Gain(S,A)/SplitEntropy(S,A)

A series of improvements to ID3 culminated in a practical and influential system for decision tree induction called \_

c4.5

covering algorithms: seek a way of covering all instances in a class & excluding all else - at each stage identifying a rule that covers some of the instances.

covering approach

COVERING ALGORITHMS: CONSTRUCTING RULES lead to

Rules, not trees

covering algorithms: replicated subtree problem

rules can be symetric whereas trees must select one attribute to split on first

covering algorithms: A decision tree split takes all classes into account in trying to maximize the purity of the split, whereas

rules concentrates on one class at a time, disregarding other classes

covering algorithms: accuracy/condidence

instances predicted correctly, as a proportion of the instances to which the rule applies

covering algorithms: coverage also called

support

covering algorithms: accuracy also called

confidence

covering algorithms: In the event of a tie, we

choose the rule with the greater coverage

coverage

covering algorithms: number of instances predicted correctly

PRISM can be described as \_

separate-and-conquer algorithms

PRISM method

Identify rule that covers many instances, separate out, continue on remaining instances

the information value for info([2, 3]) is _ bits

−2/5 × log 2/5 − 3/5 × log 3/5

u is the

mean

Naïve Bayes: Word frequencies can be accommodated by applying a \_

modified form of Naïve Bayes called multinomial Naïve Bayes.

Naïve Bayes: a document can be viewed as a \_

bag of words

Naïve Bayes. If you suspect it isn't normal but don't know the actual distribution, there are procedures for "\_" that do not assume any particular distribution for the attribute values.

kernel density estimation

information gain of the ID or Day of the week as attribute would be the same as

the information at the root

gain ratio modification can overcompensate. One fix is to choose the attribute that \_, provided that the information gain for that attribute is at least as great as the _ for all the attributes examined.

that maximizes the gain ratio, average information gain

information: amount of information obtained by making a decision. When the number of yes's and no's equal:

information reaches a maximum

# choose best splitting attribute: information gain gain(temperature) = 0.571 bits gain(humidity) = 0.971 bits gain(windy) = 0.020 bits

humidty

An important domain for machine learning is document classification, in which each instance _ and the instance's class is the\_

represents a document, document topic

One of the really nice things about Naïve Bayes is that \_.

missing values are no problem

This method goes by the name of _ because it's based on Bayes' rule and assumes independence —it is only valid to multiply probabilities when the events are \_.

naïve bayes, independent

The gain ratio is derived by taking into account _ the dataset, disregarding any information about the class

the number and size of daughter nodes into which an attribute splits

information is measured in:

units called bits

Yet Another Deck Flashcards

(88 cards)