Yet Another Deck Flashcards
Regression is a data mining task of predicting the value of the target (_) by building a model based on one of more predictors
numerical variable
regression options
* decision tree (freq table) * multiple linear regression (co variance matrix) * k-nearest neighbor (similarity functions) * artificial neural networks (other) * support vector machine (other) Dinosaurs made Kites and Vikings. Natural theory of regression.
The ID3 algorithm can be used to construct as decision tree for regression by
replacing information gain with standard deviation reduction
the standard deviation reduction for ID3 regression is based on the
decrease in standard deviation after a dataset is split on an attribute
constructing an ID3 decision tree is all about finding attribute that returns the
highest standard deviation reduction
when building an ID3 regression decision tree, a branch with standard deviation of more than zero _
requires further splitting
Decision trees. To stop splitting forever we need some termination criteria, for example, when the _ becomes smaller than a certain fraction of the _
standard deviation, standard deviation of for the full dataset (e.g. 5%)
Decision trees. To stop splitting forever we need some termination criteria, for example, when too _
few instances remain in the branch (e.g. 3)
Decision trees. To stop splitting forever we need some termination criteria. Then when the number of instances is more than one at a leaf node we _
calculate the average as the final value for the target
Logistic regression predicts
the probability of an outcome than can only have two values (i.e. a dichotomy)
The prediction for logistic regression is based on the use of one of several predictors (_ & _)
numerical, categorical
A linear regression is not appropriate for predicting the value of a binary variable for two reasons (1) linear regression will
predict values outside the acceptable range (0 to 1)
A linear regression is not appropriate for predicting the value of a binary variable for two reasons (2) since dichotomous experiments can only have one of two possible values for each experiment, the residuals will not
be normally distributed about the predicted line
Logistic regression produces a _ which is _
logistic curve, limited to the values between 0 and 1
logistic regression is similar to linear regression but the curve is constructed using the natural logarithm of the _ of the target variable, rather than the probability
odds
logistic regression is similar to linear regression but the predictors do not
have to be normally distributed or have equal variance in the group
Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses _ to obtain the model coefficients that relate predictors to the targer
maximum likelihood estimation (MLE)
Association rules is a pattern that states when an event occurs _
another event occurs with a certain probability
Most instance-based learners use:
Euclidean distance
Alternative to Euclidean distance
Manhattan, City-Block
It is usual to normalize all attribute values to:
normalize attribute values to between 0-1.
Normalizing Euclidean distance: symbolic attributes (non-numeric). The difference between two different values is usually expressed:
one (mismatch), zero (match)
normalizing Euclidean distance formula - missing attributes are:
taken to be 1 (maximally different)
normalizing Euclidean distance formula: For numeric attributes, the diff between two missing values is also taken as 1. However, if just one value is missing, the distance can be:
taken as (normalized) size of the other value X or 1-X, whichever is larger (as large as possibly)