exam Flashcards
supervised learning
- how does the output depend on the input?
- can we predict the output, given the input?
- classification, regression
unsupervised learning
- no target
- clustering and frequent pattern mining (find dependencies between variables)
univariate distribution
what values occur for an attribute and how often, data is a sample of a population
entropy H(A)
how informative is an attribute? (about the value of another attribute)
measurement about the amount of information/chaos
distribution of a binary attribute
H(A) = – plg(p) – (1–p)lg(1–p)
H(A) maximal (1) when p = 1/2 = 1/m (m nr of possible values)
distribution of a nominal attribute
H(A) = Σ–p_{i}*lg(p_{i})
H maximal when p = 1/m
H_max = lg(m)
histograms
define cut points and count occurrences within bins
- equal width: cut domain in k equal sized intervals
- equal height: select k cut points such that all bins contain n/k data points
information gain
gain(A) = H(p) – ΣH(p_{i})*n_{i}/n
p probability of positive in current set (before split)
n number of examples in current set (before split)
p_{i} probability of positive in branch i (after split)
n_{i} number of examples in branch i (after split)
problem highly-branching attributes
attributes with a large number of values are more likely to create pure subsets => information gain is biased towards choosing attributes with many values (may result in overfitting: not optimal for prediction)
=> solution: gain ratio
Gain ratio
modification of the information gain that reduces its bias on high-branching attributes => large when data is divided in few, even groups, small when each example belongs to a separate branch
GainRatio(S,A) = Gain(S,A)/IntrinsicInfo(S,A)
intrinsic information
the entropy of distribution of instances into branches: the more branches, the higher this entropy
entropy over numeric attributes
many possible split points: evaluate information gain for every possible split point, choose best one and its info gain is the information gain for the attribute (breakpoints between values of the same class cannot be optimal)
pruning
prevent overfitting to noise in the data
prepruning, postpruning
prepruning
stop growing the tree when there is no statistically significant dependence between an attribute and the class at a particular node - chi-squared test
problem: prepruning might stop the process to soon (early stopping), eg. XOR problem (no individual attribute exhibits any sign. association to the class, only visible in expanded tree)
postpruning
first, build full tree
then, prune: subtree replacement (bottom up, consider replacing a tree only after considering all its subtrees)
estimating error rates
avoid overfitting: don’t model details that will produce errors on future data => estimate such errors
- prune only if it reduces the estimated error
- compute confidence interval around observed error rate (take high end as worst case)
- C4.5: error estimate for subtree is weighted sum of error estimates for all its leaves
covering algorithms
for each class in turn, identify a rule set that covers all examples in it
- a single rule describes a convex concept: describes only examples of the same class
- generate a rule by adding tests that maximize the rule’s accuracy
- each additional test reduces the rule’s coverage
selecting a test (covering algorithms)
goal: maximize accuracy
- t total number of examples covered by the rule
- p positive examples of the class covered by the rule
- select test that maximizes p/t (finished when p/t = 1)
PRISM algorithm
(separate-and-conquer: all examples covered by a rule are separated out, remaining examples are conquered)
for each class, make rules by adding tests (maximizing p/t) until the rule is perfect and until each example of the class has been covered by a rule
instance-based / rote / lazy learning
training set is searched for example that most closely resembles the new example, examples themselves represent the knowledge
- (k-)NN
distance function
- simplest case: 1 numeric attribute (difference between the two values)
- several numeric attributes: Euclydian distance + normalization (square root of the sum of squares of the distance between the two values, for every attribute)
1-NN discussion
- can be accurate, but is slow
- assumes all attributes are equally important (solution: attribute selection or add weights)
- take majority vote over k nearest neighbours (solution for noisy examples)
proper test and training procedure uses 3 sets
training set: train models
validation set: optimize algorithm parameters
test set: evaluate final model
trade-off: performance vs evaluation accuracy
more training data => better model
more test data => more accurate error estimate
ROC analysis
Receiver Operating Characteristic
compute TPR = TP/(TP+FN) and FPR = FP/(FP+TN)
data science
extract insights from large data
less emphasis on algorithms, more on outreach (application)