Exam Flashcards
Knowledge Discovery
about understanding a domain with interpretable models
Prediction
stream where the methods you use do not matter
Black box setting
you don’t care about how the model works
Classification
- predicts future cases of a binary class.
- models the dependency target on other attributes.
- Sometimes a black-box classifier.
- some attributes may not appear because of overshadowing in decision trees.
- Supervised learning
Regression
tries to precict a numeric target variable
Clustering
divides a dataset into groups of similar cases
Frequent Patterns/Association
finds dependencies between variables
Support Vector Machine
a single line through the dataset that tries to find a nice boundary division between positive and negative attributes
Neural Network
same as the Support Vector machine, but does it with a curve
iid
identically distributed
Nominal
categorical/discrete, can only test for equality
Numeric
can test for inequalities and can use arithmetic or distance measure
Ordinal
can compare inequalities as well, but not use arithmetic or distance measure
Binary
Nominal variable with only two values
Entropy
measure of the amount of information/chaos, highest when entropy is distributed equally over the values, 1/m, unsupervised
Entropy formula
– plg(p) – (1–p)lg(1–p), p is the probability of a value, other way of writing it: Σ – Pilg(Pi)
Max Entropy formula
–m*1/m lg(1/m) = –lg(1/m) = lg m
Cumulative distribution function(CDF)
sums up all of the values in a dataset in a formula
Probability density function(PDF)
the derivative of the CDF, the relative density of points for each value, density is not the probability. the peak is where the most values and thus the biggest density is
Histograms
estimates density in a discrete way by defining cut points and count occurences per created bin. unsupervised method
Histograms Equal width
the bins are cut in equal size intervals
Histograms Equal height
the bins are cut so every bin contains about the same amount of data points
Kernel(Gaussian) Density Estimation
estimates the density of the population from a sample
Downside Entropy
the entropy concept does not apply well to numerical data, sadly, only to nominal data
Confusion Matrix/Contingency table
describes the frequency of the four combinations of subgroups and targets. This is done within subgroup positive, negative, outside subgroup positive, and outside negative.
Confusion matrix distribution
if you get a joint distribution, you implicitly also get a univariate distribution
methods of quantifying information between attributes
joint entropy, mutual information, information gain
joint distributions dependency
if there are higher counts along the diagonal of the confusion matrix, then one value is dependent on the other(somewhat). Fully determined is when one variable has the complete probability and the other nothing
Density problem visualisation
with scatterplots, you cannot see the density because of the datapoints overlapping at the same spot
Information gain
in decision trees, it is the entropy before the split compared to the entropy after the split. can be any positive number for an attribute, supervised
Uncertainty
a class attribute contains uncertainty over values. this is captured by the entropy of the target
top-down tree construction
all training examples are at the root at the start of building. partition the examples recursively by choosing one attribute each time
bottom-up tree pruning
- removes subtrees or branches in a bottom-up manner.
- goal is to improve the estimated accuracy on new cases
What does the goodness function do?
- at each node, ATTRIBUTES SELECTED based on SEPARATION of the CLASSES of the training examples.
- Examples: information gain(ID3/C4.5, information gain ratio, gini index)
Heuristic attribute selection
chooses the attribute that produces the purest nodes
How does Information gain do attribute selection?
- uses entropy of the class attribute.
- Information gain increases with the average purity of the subsets that an attribute produces.
- chooses the attribute that results in the highest information gain
Entropy pure and mixed nodes
H(0) = 0 and H(1) = 0 gives pure nodes with a skewed distribution, H(0.5) = 1 gives a mixed node with a uniform distribution
Information gain formula
gain is H(p) – ΣH(pi)*ni/n, the information before the split minus information after the split, max information gain is 2log(#values)
Information gain bias
biased towards choosing attributes with a large number of values, this may result in overfitting
Overfitting
selection of an attribute that is non-optimal for prediction
Gain ratio + when large/small?
- Is a modification of the information gain that reduces its bias on high-branching attributes.
- large when data is divided into few, even groups.
- small when each example belongs to a separate branch
Intrinsic information
entropy of distribution of instances into branches
gain ratio formula
Gain(S, A) / Intrinsic Info (S, A)
Intrinsic information aanname
importance of the attribute decreases as the instrinsic information gets larger in the gain ratio formula
Standard fix
test to prevent splitting on attributes where gain ratio may overcompensate and not choose an attribute just because its intrinsic information is low
Splitting numeric attributes
- standard method: binary splits (temp < 45), evaluate info gain(or other measure) for every split point of attribute.
- best split point is info gain for attribute.
- numeric attributes are computationally more demanding.
If an attribute A has high info gain at the root of the tree, does it always appear in a decision tree?
No.
If it is highly correlated with another attribute B, and gain(B) > gain(A), then B will appear in the tree, and further splitting on A will not be useful.
Can an attribute appear more than once in a decision tree?
Yes
If a test is not at the root of the tree, it can appear in different branches
Can an attribute appear more than once on a single path in the tree (from root to leaf)?
Yes
Numeric attributes can appear more than once, but only with very different numeric conditions
If an attribute A has gain (A)=0 (at the root), can it ever appear in a decision tree?
Yes.
- All attributes may have zero info gain
- info gain often changes when splitting on another attribute, think of the XOR-problem
Is a tree with only pure leaves always the best classifier you can have?
No.
This tree is the best classifier on the training set, but possibly not on new and unseen data. Because of overfitting, the tree may not generalize very well.
Goal Pruning
to prevent overfitting to noise in the data, strategies: pre-pruning and post-pruning
pre-pruning
stop growing a branch when information becomes unreliable
post-pruning
take a fully-grown decision tree and discard unreliable parts
chi-squared test
- most popular statistical significance test for pre-pruning.
- tests statistical significant dependency between an attribute and the class in a node
ID3
- method that uses the chi-squared test in addition to information gain
- only selects statistically significant attributes
XOR/Parity-problem
- early stopping
- stopping the growth process prematurely during pre-pruning
subtree replacement
post-pruning bottom-up approach. considers replacing a tree only after considering all of its subtrees
C4.5
decision tree method that derives the confidence interval from the training data
Observed error rate
error rate from the training set
Actual error rate
error rate from the test set, which is unknown
Confidence interval
used to calculate whether the actual error rate will be different from the observed error rate
Covering approach
for each class in turn, find the rule set that covers all examples in it (excluding examples not in the class)
PRISM
algorithm based on the covering approach. adds tests that maximizes a rule-s accuracy.