exam Flashcards
supervised learning
- how does the output depend on the input?
- can we predict the output, given the input?
- classification, regression
unsupervised learning
- no target
- clustering and frequent pattern mining (find dependencies between variables)
univariate distribution
what values occur for an attribute and how often, data is a sample of a population
entropy H(A)
how informative is an attribute? (about the value of another attribute)
measurement about the amount of information/chaos
distribution of a binary attribute
H(A) = – plg(p) – (1–p)lg(1–p)
H(A) maximal (1) when p = 1/2 = 1/m (m nr of possible values)
distribution of a nominal attribute
H(A) = Σ–p_{i}*lg(p_{i})
H maximal when p = 1/m
H_max = lg(m)
histograms
define cut points and count occurrences within bins
- equal width: cut domain in k equal sized intervals
- equal height: select k cut points such that all bins contain n/k data points
information gain
gain(A) = H(p) – ΣH(p_{i})*n_{i}/n
p probability of positive in current set (before split)
n number of examples in current set (before split)
p_{i} probability of positive in branch i (after split)
n_{i} number of examples in branch i (after split)
problem highly-branching attributes
attributes with a large number of values are more likely to create pure subsets => information gain is biased towards choosing attributes with many values (may result in overfitting: not optimal for prediction)
=> solution: gain ratio
Gain ratio
modification of the information gain that reduces its bias on high-branching attributes => large when data is divided in few, even groups, small when each example belongs to a separate branch
GainRatio(S,A) = Gain(S,A)/IntrinsicInfo(S,A)
intrinsic information
the entropy of distribution of instances into branches: the more branches, the higher this entropy
entropy over numeric attributes
many possible split points: evaluate information gain for every possible split point, choose best one and its info gain is the information gain for the attribute (breakpoints between values of the same class cannot be optimal)
pruning
prevent overfitting to noise in the data
prepruning, postpruning
prepruning
stop growing the tree when there is no statistically significant dependence between an attribute and the class at a particular node - chi-squared test
problem: prepruning might stop the process to soon (early stopping), eg. XOR problem (no individual attribute exhibits any sign. association to the class, only visible in expanded tree)
postpruning
first, build full tree
then, prune: subtree replacement (bottom up, consider replacing a tree only after considering all its subtrees)
estimating error rates
avoid overfitting: don’t model details that will produce errors on future data => estimate such errors
- prune only if it reduces the estimated error
- compute confidence interval around observed error rate (take high end as worst case)
- C4.5: error estimate for subtree is weighted sum of error estimates for all its leaves
covering algorithms
for each class in turn, identify a rule set that covers all examples in it
- a single rule describes a convex concept: describes only examples of the same class
- generate a rule by adding tests that maximize the rule’s accuracy
- each additional test reduces the rule’s coverage
selecting a test (covering algorithms)
goal: maximize accuracy
- t total number of examples covered by the rule
- p positive examples of the class covered by the rule
- select test that maximizes p/t (finished when p/t = 1)
PRISM algorithm
(separate-and-conquer: all examples covered by a rule are separated out, remaining examples are conquered)
for each class, make rules by adding tests (maximizing p/t) until the rule is perfect and until each example of the class has been covered by a rule
instance-based / rote / lazy learning
training set is searched for example that most closely resembles the new example, examples themselves represent the knowledge
- (k-)NN
distance function
- simplest case: 1 numeric attribute (difference between the two values)
- several numeric attributes: Euclydian distance + normalization (square root of the sum of squares of the distance between the two values, for every attribute)
1-NN discussion
- can be accurate, but is slow
- assumes all attributes are equally important (solution: attribute selection or add weights)
- take majority vote over k nearest neighbours (solution for noisy examples)
proper test and training procedure uses 3 sets
training set: train models
validation set: optimize algorithm parameters
test set: evaluate final model
trade-off: performance vs evaluation accuracy
more training data => better model
more test data => more accurate error estimate
ROC analysis
Receiver Operating Characteristic
compute TPR = TP/(TP+FN) and FPR = FP/(FP+TN)
data science
extract insights from large data
less emphasis on algorithms, more on outreach (application)
big data
- very large data (Google, Facebook, Twitter)
- it’s possible: cheap storage
- analytics
- internet-gathered data
- social data
- heterogeneous, unstructured data
exploratory data analysis
- scan data without much prior focus
- find unusual parts in the data
- analyse attribute dependencies: if X=x and Y=y (subgroup) then Z is unusual
- understanding the effects of all attributes on the target
classification
model the dependency of the target attribute on the remaining attributes
subgroup discovery
find all subgroups within the inductive constraints (minimum/maximum coverage, minimum quality, maximum complexity) that show a significant deviation in the distribution of the target attribute
- a quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number, eg. WRAcc, weighted relative accuracy:
WRAcc(S,T) = P(ST) - P(S) * P(T)
(compare to independence: 0 means uninteresting)
(typically produces many patterns with high levels of reduncancy)
numeric subgroup discovery
numeric target: find subgroups with significantly higher or lower average value
- trade-off between size of subgroup and average target value
we cannot just apply a learner
- algorithms are biased
- considering all possible data distributions, no algorithm works better than another
- algorithms make assumptions about data (eg. all features are equally relevant (kNN, k-means), all features are discrete (ID3), numeric (k-means))
so adapt data to the algorithm (data engineering) or adapt the algorithm to data (selection/parameter tuning)
data engineering
- attribute selection: remove features with little predictve information
- attribute discretization: convert numeric attributes to nominal ones
- data transformations: transform data to another representation
attribute selection
- filter approach: learner independent, based on data properties or simple models built by other learners
- wrapper approach: learner dependent, rerun learner with different attributes, select attributes based on performance (will also consider attributes that are only interesting combined with other attributes, unlike the filter approach, but greedy: O(k^2) for k attributes)
attribute discretization
- descretize values in small intervals
- always loses information: try to preserve as much as possible
- transform into 1 k-valued nominal attribute
- replace with k-1 new binary attributes (preserve order of classes)
unsupervised discretization
determine intervals without knowing class labels
- equal-interval binning (equal-width)
- equal-frequency binning: bins of equal size
2b. proportional k-interval discretization: equal-frequency binning with # bins = sqrt(dataset size)
supervised discretization
class labels are known
- best if all examples in a bin have the same class
eg. entropy discretization: split data in same way C4.5 would (each leaf = bin), use entropy as splitting criterion
data transformations
can lead to new insights in the data and better performance (eg. subtract two “date” attributes to get new “age” attribute)
how good is a regression model?
(for linear data)
coefficient of determination:
R^2 = 1 - SSres/SStot (between 0 and 1) SSres = Σ(yi−fi)^2 (difference between actual and predicted value) SStot = Σ(yi−y_bar)^2 (scaling term: difference between actual value and horizontal line at average level of all y values)
compare a model where x is used to a model where x does not play a role
explained variance: what percentage of the variance is explained by the model
regression trees
- regression variant of decision tree
1. regression tree: constant value in leaf
2. model tree: local linear model in leaf
M5 model (regression trees)
we cannot use entropy (no classification): use standard deviation
- splitting criterion: standard deviation reduction (sdr: compare sd before and after splitting)
SDR = sd(T) – Σ(sd(Ti) * |Ti| / |T|) - stopping criterion: sd below some threshold or too few examples in node
- pruning (bottom up): estimate amount of error by splitting
all splits are binary
- numeric: as usual
- nominal: order all values according to average and introduce k-1 indicator variables in this order
dissimilar patterns
make sure every additional pattern adds extra value
- consider extent of patterns (what do they represent)
- add binary column for every subgroup
- joint entropy of itemset captures informativeness of a set of items (= set of subgroups)
maximally informative k-itemset (miki)
itemset of size k (k subgroups in it) that maximizes the joint entropy (maximal when all parts are of equal size)
properties:
- p(Xi) = 0.5 has highest entropy
- items in miki are independent
- every individual item adds at most 1 bit of information
- monotonicity: adding an item will always increase the joint entropy
- every item adds at most H(Xi), so a candidate itemset can be discarded if the bound is not above the current maximum)
partitions of itemsets
- group items that share information into blocks
- obtain a tighter bound
- precompute the joint entropy of small (2 or 3) itemsets
joint entropy
- Sigma(p_i*lg(p_i))
itemset
a collection of one or more items, eg. {break, milk, eggs}
Support count (σ)
frequency of occurrence of an itemset, eg. σ({Bread, Milk, Diapers}) = 2
association rule mining
given a set of transactions, find rules that predict the occurrence of an item based on occurrences of other items in the transaction, X –> Y with X and Y itemsets
support (s) = fraction of transactions that contain both X and Y
confidence (c) = how often items in Y appear in transactions that contain X (how many times does the right hand side occur together with the left hand side)
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
frequent itemset
itemset whose support >= minsup
maximal frequent itemset
if none of its immediate supersets are frequent
closed itemset
if none of its (immediate) supersets have the same support
ensemble
a collection of models that works by aggregating the individual predictions
- classification and regression (supervised learning)
- typically more accurate than base model
- diversity between models improves performance (different models, randomness)
- more randomized models work better
bootstrap aggregating (bagging)
build different models by bootstrapping:
- given a training set D of size n, bagging generates m new training sets, each of size n
- sampling with replacement: for large n, each set has 36,8% duplicates –> creates randomness, each time a slightly different model
- performance improves with more learners (large m)
- de-correlate learners by learning from diffferent datasets
random subspace method
each tree is built from a random subset of the attributes –> individual trees do not over-focus on attributes that appear highly informative in the training set
works better in high-dimensional problems
random forest
combine bagging and random attribute samples: at each node, take a sample of the attributes and pick the best one
Out-Of-Bag error (OOB)
mean prediction error on each training sample Xi, using only those trees that did not have Xi in their bootstrap sample
measure an attribute’s importance
- make a random forest
- record average OOB error over all the trees => e_0
- for each independent attribute j: swap randomize j, refit the random forest, record average OOB e_j => importance of j = e_j - e_0
benefits of random forests
- good theoretical basis (ensembles)
- easy to run: works well without tuning
- easy to run in parallel: inherent parallelism
clustering: overlapping vs non-overlapping
a point can be a member of more than one cluster vs every point is in one cluster only
clustering: top down vs bottom up
take entire dataset and cut somewhere vs first assume every individual point is a different cluster
clustering: deterministic vs probablistic
100% sure about in which cluster you are vs probability of being in a cluster is between 0 and 1
k-means algorithm
- pick k random points: initial cluster centres
- assign each point to nearest cluster centre
- move cluster centres to mean of each cluster
- reassign points to nearest cluster centre
repeat steps 3-4 until cluster centres converge
k-means: discussion
+ simple, understandable
+ fast
+ instances automatically assigned to clusters
- results can vary depending on initial choice of centres
- can get trapped in local minimum (restart with different random seed)
- must pick number of clusters beforehand
- all instances forced into a single cluster
- sensitive to outliers
- random algorithm, random results
centroid: distance between clusters
single link: smallest distance between points
complete link: largest distance between points
average link: average distance between points
hierarchical clustering advantages
- you don’t have to choose the amount of clusters beforehand (k-menas), you can choose it afterwards and see what amount of detail you want (draw a line in the dendrogram)
- can be applied to nominal data more easily than k-means
complexity of association rules
given d unique items
- total nr possible itemsets: 2^d
- total nr possible association rules: 3^d - 2^(d+1) + 1
in case of a frequent k-itemset, the total nr of possible association rules = 2^k - 2