exam Flashcards

1
Q

supervised learning

A
  • how does the output depend on the input?
  • can we predict the output, given the input?
  • classification, regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

unsupervised learning

A
  • no target

- clustering and frequent pattern mining (find dependencies between variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

univariate distribution

A

what values occur for an attribute and how often, data is a sample of a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

entropy H(A)

A

how informative is an attribute? (about the value of another attribute)
measurement about the amount of information/chaos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

distribution of a binary attribute

A

H(A) = – plg(p) – (1–p)lg(1–p)

H(A) maximal (1) when p = 1/2 = 1/m (m nr of possible values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

distribution of a nominal attribute

A

H(A) = Σ–p_{i}*lg(p_{i})
H maximal when p = 1/m
H_max = lg(m)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

histograms

A

define cut points and count occurrences within bins

  • equal width: cut domain in k equal sized intervals

- equal height: select k cut points such that all bins contain n/k data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

information gain

A

gain(A) = H(p) – ΣH(p_{i})*n_{i}/n

p probability of positive in current set (before split)
n number of examples in current set (before split)
p_{i} probability of positive in branch i (after split)
n_{i} number of examples in branch i (after split)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

problem highly-branching attributes

A

attributes with a large number of values are more likely to create pure subsets => information gain is biased towards choosing attributes with many values (may result in overfitting: not optimal for prediction)
=> solution: gain ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Gain ratio

A

modification of the information gain that reduces its bias on high-branching attributes => large when data is divided in few, even groups, small when each example belongs to a separate branch

GainRatio(S,A) = Gain(S,A)/IntrinsicInfo(S,A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

intrinsic information

A

the entropy of distribution of instances into branches: the more branches, the higher this entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

entropy over numeric attributes

A
many possible split points:
evaluate information gain for every possible split point, choose best one and its info gain is the information gain for the attribute (breakpoints between values of the same class cannot be optimal)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

pruning

A

prevent overfitting to noise in the data

prepruning, postpruning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

prepruning

A
stop growing the tree when there is no statistically significant dependence between an attribute and the class at a particular node
- chi-squared test

problem: prepruning might stop the process to soon (early stopping), eg. XOR problem (no individual attribute exhibits any sign. association to the class, only visible in expanded tree)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

postpruning

A

first, build full tree

then, prune: subtree replacement (bottom up, consider replacing a tree only after considering all its subtrees)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

estimating error rates

A

avoid overfitting: don’t model details that will produce errors on future data => estimate such errors

  • prune only if it reduces the estimated error
  • compute confidence interval around observed error rate (take high end as worst case)
  • C4.5: error estimate for subtree is weighted sum of error estimates for all its leaves
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

covering algorithms

A

for each class in turn, identify a rule set that covers all examples in it

  • a single rule describes a convex concept: describes only examples of the same class
  • generate a rule by adding tests that maximize the rule’s accuracy
  • each additional test reduces the rule’s coverage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

selecting a test (covering algorithms)

A

goal: maximize accuracy
- t total number of examples covered by the rule
- p positive examples of the class covered by the rule
- select test that maximizes p/t (finished when p/t = 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

PRISM algorithm

A

(separate-and-conquer: all examples covered by a rule are separated out, remaining examples are conquered)

for each class, make rules by adding tests (maximizing p/t) until the rule is perfect and until each example of the class has been covered by a rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

instance-based / rote / lazy learning

A

training set is searched for example that most closely resembles the new example, examples themselves represent the knowledge
- (k-)NN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

distance function

A
  • simplest case: 1 numeric attribute (difference between the two values)
  • several numeric attributes: Euclydian distance + normalization (square root of the sum of squares of the distance between the two values, for every attribute)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

1-NN discussion

A
  • can be accurate, but is slow
  • assumes all attributes are equally important (solution: attribute selection or add weights)
  • take majority vote over k nearest neighbours (solution for noisy examples)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

proper test and training procedure uses 3 sets

A

training set: train models
validation set: optimize algorithm parameters
test set: evaluate final model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

trade-off: performance vs evaluation accuracy

A

more training data => better model

more test data => more accurate error estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

ROC analysis

A

Receiver Operating Characteristic

compute TPR = TP/(TP+FN) and FPR = FP/(FP+TN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

data science

A

extract insights from large data

less emphasis on algorithms, more on outreach (application)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

big data

A
  • very large data (Google, Facebook, Twitter)
  • it’s possible: cheap storage
  • analytics
  • internet-gathered data
  • social data
  • heterogeneous, unstructured data
28
Q

exploratory data analysis

A
  • scan data without much prior focus
  • find unusual parts in the data
  • analyse attribute dependencies: if X=x and Y=y (subgroup) then Z is unusual
  • understanding the effects of all attributes on the target
29
Q

classification

A

model the dependency of the target attribute on the remaining attributes

30
Q

subgroup discovery

A

find all subgroups within the inductive constraints (minimum/maximum coverage, minimum quality, maximum complexity) that show a significant deviation in the distribution of the target attribute

  • a quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number, eg. WRAcc, weighted relative accuracy:
    WRAcc(S,T) = P(ST) - P(S) * P(T)
    (compare to independence: 0 means uninteresting)

(typically produces many patterns with high levels of reduncancy)

31
Q

numeric subgroup discovery

A

numeric target: find subgroups with significantly higher or lower average value
- trade-off between size of subgroup and average target value

32
Q

we cannot just apply a learner

A
  • algorithms are biased
  • considering all possible data distributions, no algorithm works better than another
  • algorithms make assumptions about data (eg. all features are equally relevant (kNN, k-means), all features are discrete (ID3), numeric (k-means))

so adapt data to the algorithm (data engineering) or adapt the algorithm to data (selection/parameter tuning)

33
Q

data engineering

A
  • attribute selection: remove features with little predictve information
  • attribute discretization: convert numeric attributes to nominal ones
  • data transformations: transform data to another representation
34
Q

attribute selection

A
  • filter approach: learner independent, based on data properties or simple models built by other learners
  • wrapper approach: learner dependent, rerun learner with different attributes, select attributes based on performance (will also consider attributes that are only interesting combined with other attributes, unlike the filter approach, but greedy: O(k^2) for k attributes)
35
Q

attribute discretization

A
  • descretize values in small intervals
  • always loses information: try to preserve as much as possible
  1. transform into 1 k-valued nominal attribute
  2. replace with k-1 new binary attributes (preserve order of classes)
36
Q

unsupervised discretization

A

determine intervals without knowing class labels

  1. equal-interval binning (equal-width)
  2. equal-frequency binning: bins of equal size
    2b. proportional k-interval discretization: equal-frequency binning with # bins = sqrt(dataset size)
37
Q

supervised discretization

A

class labels are known

  • best if all examples in a bin have the same class
    eg. entropy discretization: split data in same way C4.5 would (each leaf = bin), use entropy as splitting criterion
38
Q

data transformations

A

can lead to new insights in the data and better performance (eg. subtract two “date” attributes to get new “age” attribute)

39
Q

how good is a regression model?

A

(for linear data)
coefficient of determination:

R^2 = 1 - SSres/SStot (between 0 and 1)
SSres =  Σ(yi−fi)^2  (difference between actual and predicted value)
SStot = Σ(yi−y_bar)^2  (scaling term: difference between actual value and horizontal line at average level of all y values)

compare a model where x is used to a model where x does not play a role

explained variance: what percentage of the variance is explained by the model

40
Q

regression trees

A
  • regression variant of decision tree
    1. regression tree: constant value in leaf
    2. model tree: local linear model in leaf
41
Q

M5 model (regression trees)

A

we cannot use entropy (no classification): use standard deviation

  • splitting criterion: standard deviation reduction (sdr: compare sd before and after splitting)
    SDR = sd(T) – Σ(sd(Ti) * |Ti| / |T|)
  • stopping criterion: sd below some threshold or too few examples in node
  • pruning (bottom up): estimate amount of error by splitting

all splits are binary

  • numeric: as usual
  • nominal: order all values according to average and introduce k-1 indicator variables in this order
42
Q

dissimilar patterns

A

make sure every additional pattern adds extra value

  • consider extent of patterns (what do they represent)
  • add binary column for every subgroup
  • joint entropy of itemset captures informativeness of a set of items (= set of subgroups)
43
Q

maximally informative k-itemset (miki)

A

itemset of size k (k subgroups in it) that maximizes the joint entropy (maximal when all parts are of equal size)
properties:
- p(Xi) = 0.5 has highest entropy
- items in miki are independent
- every individual item adds at most 1 bit of information
- monotonicity: adding an item will always increase the joint entropy
- every item adds at most H(Xi), so a candidate itemset can be discarded if the bound is not above the current maximum)

44
Q

partitions of itemsets

A
  • group items that share information into blocks
  • obtain a tighter bound
  • precompute the joint entropy of small (2 or 3) itemsets
45
Q

joint entropy

A
  • Sigma(p_i*lg(p_i))
46
Q

itemset

A

a collection of one or more items, eg. {break, milk, eggs}

47
Q

Support count (σ)

A

frequency of occurrence of an itemset, eg. σ({Bread, Milk, Diapers}) = 2

48
Q

association rule mining

A

given a set of transactions, find rules that predict the occurrence of an item based on occurrences of other items in the transaction, X –> Y with X and Y itemsets

support (s) = fraction of transactions that contain both X and Y

confidence (c) = how often items in Y appear in transactions that contain X (how many times does the right hand side occur together with the left hand side)
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)

49
Q

frequent itemset

A

itemset whose support >= minsup

50
Q

maximal frequent itemset

A

if none of its immediate supersets are frequent

51
Q

closed itemset

A

if none of its (immediate) supersets have the same support

52
Q

ensemble

A

a collection of models that works by aggregating the individual predictions

  • classification and regression (supervised learning)
  • typically more accurate than base model
  • diversity between models improves performance (different models, randomness)
  • more randomized models work better
53
Q

bootstrap aggregating (bagging)

A

build different models by bootstrapping:

  • given a training set D of size n, bagging generates m new training sets, each of size n
  • sampling with replacement: for large n, each set has 36,8% duplicates –> creates randomness, each time a slightly different model
  • performance improves with more learners (large m)
  • de-correlate learners by learning from diffferent datasets
54
Q

random subspace method

A

each tree is built from a random subset of the attributes –> individual trees do not over-focus on attributes that appear highly informative in the training set

works better in high-dimensional problems

55
Q

random forest

A

combine bagging and random attribute samples: at each node, take a sample of the attributes and pick the best one

56
Q

Out-Of-Bag error (OOB)

A

mean prediction error on each training sample Xi, using only those trees that did not have Xi in their bootstrap sample

57
Q

measure an attribute’s importance

A
  • make a random forest
  • record average OOB error over all the trees => e_0
  • for each independent attribute j: swap randomize j, refit the random forest, record average OOB e_j => importance of j = e_j - e_0
58
Q

benefits of random forests

A
  • good theoretical basis (ensembles)
  • easy to run: works well without tuning
  • easy to run in parallel: inherent parallelism
59
Q

clustering: overlapping vs non-overlapping

A

a point can be a member of more than one cluster vs every point is in one cluster only

60
Q

clustering: top down vs bottom up

A

take entire dataset and cut somewhere vs first assume every individual point is a different cluster

61
Q

clustering: deterministic vs probablistic

A

100% sure about in which cluster you are vs probability of being in a cluster is between 0 and 1

62
Q

k-means algorithm

A
  1. pick k random points: initial cluster centres
  2. assign each point to nearest cluster centre
  3. move cluster centres to mean of each cluster
  4. reassign points to nearest cluster centre
    repeat steps 3-4 until cluster centres converge
63
Q

k-means: discussion

A

+ simple, understandable
+ fast
+ instances automatically assigned to clusters

  • results can vary depending on initial choice of centres
  • can get trapped in local minimum (restart with different random seed)
  • must pick number of clusters beforehand
  • all instances forced into a single cluster
  • sensitive to outliers
  • random algorithm, random results
64
Q

centroid: distance between clusters

A

single link: smallest distance between points
complete link: largest distance between points
average link: average distance between points

65
Q

hierarchical clustering advantages

A
  • you don’t have to choose the amount of clusters beforehand (k-menas), you can choose it afterwards and see what amount of detail you want (draw a line in the dendrogram)
  • can be applied to nominal data more easily than k-means
66
Q

complexity of association rules

A

given d unique items

  • total nr possible itemsets: 2^d
  • total nr possible association rules: 3^d - 2^(d+1) + 1

in case of a frequent k-itemset, the total nr of possible association rules = 2^k - 2