Exam Flashcards
Knowledge Discovery
about understanding a domain with interpretable models
Prediction
stream where the methods you use do not matter
Black box setting
you don’t care about how the model works
Classification
- predicts future cases of a binary class.
- models the dependency target on other attributes.
- Sometimes a black-box classifier.
- some attributes may not appear because of overshadowing in decision trees.
- Supervised learning
Regression
tries to precict a numeric target variable
Clustering
divides a dataset into groups of similar cases
Frequent Patterns/Association
finds dependencies between variables
Support Vector Machine
a single line through the dataset that tries to find a nice boundary division between positive and negative attributes
Neural Network
same as the Support Vector machine, but does it with a curve
iid
identically distributed
Nominal
categorical/discrete, can only test for equality
Numeric
can test for inequalities and can use arithmetic or distance measure
Ordinal
can compare inequalities as well, but not use arithmetic or distance measure
Binary
Nominal variable with only two values
Entropy
measure of the amount of information/chaos, highest when entropy is distributed equally over the values, 1/m, unsupervised
Entropy formula
– plg(p) – (1–p)lg(1–p), p is the probability of a value, other way of writing it: Σ – Pilg(Pi)
Max Entropy formula
–m*1/m lg(1/m) = –lg(1/m) = lg m
Cumulative distribution function(CDF)
sums up all of the values in a dataset in a formula
Probability density function(PDF)
the derivative of the CDF, the relative density of points for each value, density is not the probability. the peak is where the most values and thus the biggest density is
Histograms
estimates density in a discrete way by defining cut points and count occurences per created bin. unsupervised method
Histograms Equal width
the bins are cut in equal size intervals
Histograms Equal height
the bins are cut so every bin contains about the same amount of data points
Kernel(Gaussian) Density Estimation
estimates the density of the population from a sample
Downside Entropy
the entropy concept does not apply well to numerical data, sadly, only to nominal data
Confusion Matrix/Contingency table
describes the frequency of the four combinations of subgroups and targets. This is done within subgroup positive, negative, outside subgroup positive, and outside negative.
Confusion matrix distribution
if you get a joint distribution, you implicitly also get a univariate distribution
methods of quantifying information between attributes
joint entropy, mutual information, information gain
joint distributions dependency
if there are higher counts along the diagonal of the confusion matrix, then one value is dependent on the other(somewhat). Fully determined is when one variable has the complete probability and the other nothing
Density problem visualisation
with scatterplots, you cannot see the density because of the datapoints overlapping at the same spot
Information gain
in decision trees, it is the entropy before the split compared to the entropy after the split. can be any positive number for an attribute, supervised
Uncertainty
a class attribute contains uncertainty over values. this is captured by the entropy of the target
top-down tree construction
all training examples are at the root at the start of building. partition the examples recursively by choosing one attribute each time
bottom-up tree pruning
- removes subtrees or branches in a bottom-up manner.
- goal is to improve the estimated accuracy on new cases
What does the goodness function do?
- at each node, ATTRIBUTES SELECTED based on SEPARATION of the CLASSES of the training examples.
- Examples: information gain(ID3/C4.5, information gain ratio, gini index)
Heuristic attribute selection
chooses the attribute that produces the purest nodes
How does Information gain do attribute selection?
- uses entropy of the class attribute.
- Information gain increases with the average purity of the subsets that an attribute produces.
- chooses the attribute that results in the highest information gain
Entropy pure and mixed nodes
H(0) = 0 and H(1) = 0 gives pure nodes with a skewed distribution, H(0.5) = 1 gives a mixed node with a uniform distribution
Information gain formula
gain is H(p) – ΣH(pi)*ni/n, the information before the split minus information after the split, max information gain is 2log(#values)
Information gain bias
biased towards choosing attributes with a large number of values, this may result in overfitting
Overfitting
selection of an attribute that is non-optimal for prediction
Gain ratio + when large/small?
- Is a modification of the information gain that reduces its bias on high-branching attributes.
- large when data is divided into few, even groups.
- small when each example belongs to a separate branch
Intrinsic information
entropy of distribution of instances into branches
gain ratio formula
Gain(S, A) / Intrinsic Info (S, A)
Intrinsic information aanname
importance of the attribute decreases as the instrinsic information gets larger in the gain ratio formula
Standard fix
test to prevent splitting on attributes where gain ratio may overcompensate and not choose an attribute just because its intrinsic information is low
Splitting numeric attributes
- standard method: binary splits (temp < 45), evaluate info gain(or other measure) for every split point of attribute.
- best split point is info gain for attribute.
- numeric attributes are computationally more demanding.
If an attribute A has high info gain at the root of the tree, does it always appear in a decision tree?
No.
If it is highly correlated with another attribute B, and gain(B) > gain(A), then B will appear in the tree, and further splitting on A will not be useful.
Can an attribute appear more than once in a decision tree?
Yes
If a test is not at the root of the tree, it can appear in different branches
Can an attribute appear more than once on a single path in the tree (from root to leaf)?
Yes
Numeric attributes can appear more than once, but only with very different numeric conditions
If an attribute A has gain (A)=0 (at the root), can it ever appear in a decision tree?
Yes.
- All attributes may have zero info gain
- info gain often changes when splitting on another attribute, think of the XOR-problem
Is a tree with only pure leaves always the best classifier you can have?
No.
This tree is the best classifier on the training set, but possibly not on new and unseen data. Because of overfitting, the tree may not generalize very well.
Goal Pruning
to prevent overfitting to noise in the data, strategies: pre-pruning and post-pruning
pre-pruning
stop growing a branch when information becomes unreliable
post-pruning
take a fully-grown decision tree and discard unreliable parts
chi-squared test
- most popular statistical significance test for pre-pruning.
- tests statistical significant dependency between an attribute and the class in a node
ID3
- method that uses the chi-squared test in addition to information gain
- only selects statistically significant attributes
XOR/Parity-problem
- early stopping
- stopping the growth process prematurely during pre-pruning
subtree replacement
post-pruning bottom-up approach. considers replacing a tree only after considering all of its subtrees
C4.5
decision tree method that derives the confidence interval from the training data
Observed error rate
error rate from the training set
Actual error rate
error rate from the test set, which is unknown
Confidence interval
used to calculate whether the actual error rate will be different from the observed error rate
Covering approach
for each class in turn, find the rule set that covers all examples in it (excluding examples not in the class)
PRISM
algorithm based on the covering approach. adds tests that maximizes a rule-s accuracy.
goal of a test for rules
- to maximize accuracy
- p/t is 1 when the rules cover every example.
- t is total number of examples covered by rule
- p is positive examples covered by rule
separate-and-conquer algorithms
PRISM is an example, deals with only one class. starts of with one rule that separates examples, then other examples are conquered
divide-and-conquer algorithms
a subset covered by a rule doesn’t need to be explored any further
rote(routine) learning
- search similarity examples training set in test set,
- methods include visualisation of k-nn
instance-based learning
- lazy learning
- similarity function defines what is learned.
- methods include nearest neighbour, k-nearest neighbours
Euclidean distance
measures the distance or difference between two attribute values
Curse of 1-NN
- accurate but dimensionality: added dimensions increase distances and exponentially more training data is needed.
- its also slow
- remedy: weighed attributes or attribute selection
Simple models advantage
a simpler model should perform well on unseen data drawn from the same distribution
error rate formula
errors/#examples
accuracy formula
successes/#examples
rules data sets 1
never evaluate on training data
rules data sets 2
never train on test data(that includes parameter setting or feature selection)
proper procedure testing
- training set to train models
- validation set to optimize algorithm parameters
- test set to evaluate the final model
trade-off evaluation data sets
- all the data can be used to build the final classifier, however:
- trade-off between performance evaluation and accuracy
k-fold Cross-validation
splits data (stratisfied) in k folds. repeats test k-times and then takes average results. k = 10 is enough to reduce the sampling bias
Leave-One-Out Cross-validation
the number of folds equals the number of examples. It makes the best use of the data, no sampling bias. However, it is computationally expensive
ROC
- method that illustrates the diagnostic ability of a binary classifier system.
- It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
- The higher above the ROC curve, the better a classifier performs.
TPR(True Positive Rate)
true positives/total positives
FPR(False Positive Rate)
false positives/total positives
Significance tests
- confidence in state of difference.
- enough evidence to reject the null hypothesis?
- with cross-validation scores, is B better than A?
Subgroup discovery Task
- a mix between regression and classification
- find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute
Inductive constraints examples
- MAXIMUM COVERAGE,
- base rate when compared to the sample size,
- minimum quality,
- information gain (X2, WRAcc)
Maximum complexity rule
the fewer attributes, the better
Confusion matrix diagonals
High numbers across the TT-FF diagonal means a positive correlation between subgroup and target.
- High numbers across the TF-FT diagonal means a negative correlation between subgroup and target
quality measure
- interestingness of subgroup confusion matrix in a number
- popular is the WRAcc(weighted relative accuracy)
WRAcc formula
WRACC(Subgroup, Target) = p(ST) – p(S)p(T)
- compare p(ST) to independence
- shows the balance between the coverage (size of subgroup) and the unexpectedness (deviance of the target)
examples quality measures
WRAcc, information gain, X2, correlation coefficient, laplace, jaccard, specifity
Regression subgroup discovery
numeric target has an order and a scale
ordinal subgroup discovery
numeric target has an order
ranked subgroup discovery
numeric target has an order or a scale
Numeric Subgroup Discovery intuitions
- Larger subgroups are more reliable
- the majority of objects appear at the top(yes/true)
- the middle of the subgroup should differ from the middle of the ranking
- objects should have a similar rank
Classical subgroup discovery
- Nominal targets(classification)
- Numeric targets(regression)
Exceptional model mining(extension subgroup discovery)
- multiple targets
- regression, correlation
- multi-label classification
Statistical validation determines:
- the DISTRIBUTION of RANDOM RESULTS (random subsets, conditions, swap-randomization)
- minimum QUALITY
- SIGNIFICANCE of INDIVIDUAL results
Why Attribute Selection?
- to remove attributes with little/no predictive information
- irrelevant attributes often slow down algorithms.
- it also avoids huge decision trees, therefore easier to interpret
Attribution Selection: kNN curse
dimensionality: increase number attributes -> exponential increase training instances
Attribution Selection: C4.5 curse
data fragmentation problem: select attributes on less and less data after every split
filter approach
- attribute selection approach that is learner independent.
- based on simple models built by other learners.
- Example C4.5: selects features tested in the top-level nodes.
- Example kNN: weights features by capability to separate classes
Wrapper approach
- attribute selection approach that is learner dependent.
- rerun the learner with different attributes, select based on performance.
filter approach: recursive
select 1 attribute, remove, repeat
filter approach: produce a ranking
the cut-off is defined by the user
Attribute discretization
- converting numeric attributes to nominal ones.
- Downside: you always lose information.
- plus: you can use learners that can’t handle numeric data
unsupervised discretization
determining intervals without knowing the class labels. uses histograms. Two ways to do this:
- equal interval binning(equal width)
- equal frequency binning (equal size)
Supervised discretization
- determines intervals with knowledge of the class labels.
- usually works better than unsupervised
- less predictive information is lost
- Two approaches: entropy-based and bottom-up merging
Advantage Data Transformations
can lead to new insights in the data and better performance.
Linear regression
- can have one or multiple predictors
- computed directly from the data
- Lasso regression selects features by setting parameters to 0.
Rsquared/explained variance
what percentage of the variance in regression is explained by the model
Regression trees
variant of the decision tree that uses a top-down induction. two options:
- constant value in leaf(piecewise constant) -> regression trees
- Local linear model in leaf (piecewise linear) -> model trees
criteria for regression trees
- splitting criterion: standard deviation reduction(SDR)
2. stopping criterion: standard deviation below some threshold, too few examples in node
Estimate error formula
- (n+v)/(n-v) * absolute error in node
- n=examples in node, v=parameters in the model.
maximally informative k-itemsets (MIKI) Motivation
subgroup discovery typically produces very many patterns with high levels of redundancy.
maximally informative k-itemsets (MIKI) Dissimilar patterns
- optimize dissimilarity of patterns reported.
- The additional value of individual patterns is reported.
maximally informative k-itemsets (MIKI)
an itemset of size k that maximizes the joint entropy
properties of joint entropy
- MONOTONICITY: suppose X and Y are two itemsets, such that X ⊆ Y. Then: H(X) ≤ H(Y).
- UNIT GROWTH: suppose X and Y are two itemsets, such that X ⊆ Y. Then: H(Y) ≤ H(X) + |Y\X|.
- INDEPENDENCE BOUND: suppose that X = {x1, …, xk} is an itemset.
Then: H(X) ≤ Σi H(xi)
partitions of itemsets
- suppose that P = {B1, …, Bm} is a partition of an itemset.
- The joint entropy of P is defined as: H(P) = Σi H(Bi)
Joint entropy partition properties
P = {B1, …, Bm} -> partition of itemset X
- Partition bound: H(X) ≤ H(P)
P = {B1, …, Bm} -> partition of itemset X = {x1, …, xk}
- Independence bound: H(P) ≤ Σi H(xi)
Itemset
an itemset is a collection of one or more items.
Support count (σ)
a support count is the frequency or the occurrence of a certain itemset.
Support
support is the fraction of transactions that contain both itemsets X and Y.
Frequent Itemset
a frequent itemset is an itemset whose support is greater than or equal to a minsup(minimum support) threshold
Association Rule Mining
find rules that predict the occurrence of an item based on occurrences of other items in the transaction in a given set of transactions.
Association Rule
an association rule is an expression of the form X → Y (X and Y are itemsets)
Two Rule Evaluation Measures
support and confidence
Confidence
measures how often items in Y appear in transactions that contain X.
Co-occurrence items in Y with items that contain X
Computational Complexity
the amount of resources required to run an algorithm
possible association rules formula
R = 3^d- 2^(d+1)+1
*d is the number of itemsets
Brute-force approach
- example of Mining Association Rule
- approach lists all possible rules and computes support and confidence of each rule.
- Then, rules that fail minimum confidence and support thresholds are discarded.
- computationally very prohibitive/shit
Two-step approach association rules
- Frequent Itemset Generation - generate all itemsets whose support ≥ minsup
- Rule Generation - a. Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Apriori principle
- Generate length (k+1) candidate itemsets from length k frequent itemsets.
- Prune itemsets with infrequent subsets of length k.
- Count the support of each candidate
- Eliminate all the infrequent candidates
- only the frequent candidates are left.
Factors Affecting Complexity
- mimimum support threshold: lower support threshold -> more frequent itemsets -> candidate itemsets increased
- dimensionality (number of items): you need more space to store support counts
- size of database(number of transactions): bigger database -> increase run time
- Transaction width (density of datasets): wider transaction width -> increase #subsets -> increase length frequent itemsets
Maximal frequent itemset
determines for each itemset whether it is frequent or not. Fixes infrequency border
Closed itemset
determines for each itemset its support. none of the supersets or children of this set have the same support as the parent or set. this means that the support of these supersets are lower.
Regression
- average of the individual predictions
- more diversity -> better performance.
Bootstrap Aggregating (Bagging)
- From RANDOM SAMPLES with a DUPLICATE, given the training set D of size n, bagging generates m new training sets D1…Dm, each of size n.
- sampling with duplicates -> OBSERVATIONS REPEATED in the sample
unstable learners
these learners can de-correlate by learning from different datasets.
Out-of-bag error(OOB)
- measures prediction error of machine learning models(random forests) that use bootstrap aggregating(bagging).
- it is the MEAN PREDICTION ERROR on each training sample xi
- only uses tries that did not have xi in their bootstrap sample
Methods that use randomization
- Random Subspace Method
- bootstrapping
- boosting
- C4.5
Random Forests downsides
- not very transparent
- have to fit the Random Forest to the data set
- you have to record the average OOB error over all the trees
Benefits of Random Forests
- good theoretical basis, the ensembles.
- It also is fairly easy to run.
- Used as a baseline for testing.
- It is easy to run in parallel due to inherent parallelism. It works faster already on multi-core computers.
clustering
- has no labels and finds ‘natural’ grouping of instances.
- if labels are present, then they are ignored.
- unsupervised learning,
- Examples include k-means, hierarchical clustering, k-medoids, Self-Organizing Maps
k-means steps
- pick k random points: initial cluster centres
- Assign each point to the nearest cluster centre
- Move cluster centres to the mean of each cluster
- Reassign points to the nearest cluster centre
- Repeat step 3-4, until the cluster centres converge (don’t hardly move)
k-means advantages
simple, understandable and fast. The instances are automatically assigned to clusters.
k-means disadvantages(name 5)
- k must be determined beforehand,
- sensitive to outliers as instances are forced to a single cluster.
- random algorithm -> random results.
- not intuitive(higher dimensions)
- cluster central point is not always an observed data point
global optimum
point where the function value is smaller than at all other feasible points
local minimum
- a point where the function value is smaller than at nearby points
- but possibly greater than at a distant point.
k-mediods
- finds the medians of each cluster instead of the mean
- cluster representative is always an observed data point
Hierarchical clustering
structures clusters into a tree structure called a dendogram. the individual clusters are in leaves. Union of child clusters in nodes
bottom-up/agglomerative clustering
starts with a single-instance clusters and joins the two closest clusters at each step
the top-down approach(clustering)
starts with one universal cluster and is split into two clusters recursively
Centroid
the distance between clusters:
- single link: smallest distance between points
- complete link: largest distance between points
- average link: average distance between points
Self-Organising Maps(SOM)
- clustering method that groups similar data together.
- reduces dimensionality and is a data visualisation technique.
- the neurons try to mimic the input vectors
Quality measures subgroup discovery
- WRAcc
- z-score
- Explained Variance
- information gain
histograms disadvantage
bin boundaries can be placed at unfortunate locations, causing empty bins or too full bins
k as a parameter in k-NN algorithm determines:
- Decision Boundary: the smoothness of the decision boundary
- Neighbour: the amount of neighbours considered for classifying new example
- Fit: how well the model fits the training data
k-means possible desirable outcomes
- The cluster centres move to the circular data
- The algorithm gets stuck in a local optimum
- The algorithm doesn’t converge
Maximum MIKI
entropy of the items combined
MIKI
- the joint entropy that is higher than all other joint entropies -> the highest entropy.
- if the itemset is the only one with a specific number of elements, then it is also miki
identifiers downside
identifiers have a high entropy, but they also have a higher chance of overfitting.