Exam Flashcards

Question

Confusion Matrix/Contingency table

Answer 1

describes the frequency of the four combinations of subgroups and targets. This is done within subgroup positive, negative, outside subgroup positive, and outside negative.

Answer 2

if you get a joint distribution, you implicitly also get a univariate distribution

Answer 3

joint entropy, mutual information, information gain

Answer 4

if there are higher counts along the diagonal of the confusion matrix, then one value is dependent on the other(somewhat). Fully determined is when one variable has the complete probability and the other nothing

Answer 5

with scatterplots, you cannot see the density because of the datapoints overlapping at the same spot

Answer 6

in decision trees, it is the entropy before the split compared to the entropy after the split. can be any positive number for an attribute, supervised

Answer 7

a class attribute contains uncertainty over values. this is captured by the entropy of the target

Answer 8

all training examples are at the root at the start of building. partition the examples recursively by choosing one attribute each time

Answer 9

- removes subtrees or branches in a bottom-up manner. | - goal is to improve the estimated accuracy on new cases

Answer 10

- at each node, ATTRIBUTES SELECTED based on SEPARATION of the CLASSES of the training examples. - Examples: information gain(ID3/C4.5, information gain ratio, gini index)

Answer 11

chooses the attribute that produces the purest nodes

Answer 12

- uses entropy of the class attribute. - Information gain increases with the average purity of the subsets that an attribute produces. - chooses the attribute that results in the highest information gain

Answer 13

H(0) = 0 and H(1) = 0 gives pure nodes with a skewed distribution, H(0.5) = 1 gives a mixed node with a uniform distribution

Answer 14

gain is H(p) – ΣH(pi)*ni/n, the information before the split minus information after the split, max information gain is 2log(#values)

Answer 15

biased towards choosing attributes with a large number of values, this may result in overfitting

Answer 16

selection of an attribute that is non-optimal for prediction

Answer 17

- Is a modification of the information gain that reduces its bias on high-branching attributes. - large when data is divided into few, even groups. - small when each example belongs to a separate branch

Answer 18

entropy of distribution of instances into branches

Answer 19

Gain(S, A) / Intrinsic Info (S, A)

Answer 20

importance of the attribute decreases as the instrinsic information gets larger in the gain ratio formula

Answer 21

test to prevent splitting on attributes where gain ratio may overcompensate and not choose an attribute just because its intrinsic information is low

Answer 22

- standard method: binary splits (temp < 45), evaluate info gain(or other measure) for every split point of attribute. - best split point is info gain for attribute. - numeric attributes are computationally more demanding.

Answer 23

No. If it is highly correlated with another attribute B, and gain(B) > gain(A), then B will appear in the tree, and further splitting on A will not be useful.

Answer 24

Yes | If a test is not at the root of the tree, it can appear in different branches

Answer 25

Yes | Numeric attributes can appear more than once, but only with very different numeric conditions

Answer 26

Yes. 1. All attributes may have zero info gain 2. info gain often changes when splitting on another attribute, think of the XOR-problem

Answer 27

No. This tree is the best classifier on the training set, but possibly not on new and unseen data. Because of overfitting, the tree may not generalize very well.

Answer 28

to prevent overfitting to noise in the data, strategies: pre-pruning and post-pruning

Answer 29

stop growing a branch when information becomes unreliable

Answer 30

take a fully-grown decision tree and discard unreliable parts

Answer 31

- most popular statistical significance test for pre-pruning. - tests statistical significant dependency between an attribute and the class in a node

Answer 32

- method that uses the chi-squared test in addition to information gain - only selects statistically significant attributes

Answer 33

- early stopping | - stopping the growth process prematurely during pre-pruning

Answer 34

post-pruning bottom-up approach. considers replacing a tree only after considering all of its subtrees

Answer 35

decision tree method that derives the confidence interval from the training data

Answer 36

error rate from the training set

Answer 37

error rate from the test set, which is unknown

Answer 38

used to calculate whether the actual error rate will be different from the observed error rate

Answer 39

for each class in turn, find the rule set that covers all examples in it (excluding examples not in the class)

Answer 40

algorithm based on the covering approach. adds tests that maximizes a rule-s accuracy.

Answer 41

- to maximize accuracy - p/t is 1 when the rules cover every example. - t is total number of examples covered by rule - p is positive examples covered by rule

Answer 42

PRISM is an example, deals with only one class. starts of with one rule that separates examples, then other examples are conquered

Answer 43

a subset covered by a rule doesn't need to be explored any further

Answer 44

- search similarity examples training set in test set, | - methods include visualisation of k-nn

Answer 45

- lazy learning - similarity function defines what is learned. - methods include nearest neighbour, k-nearest neighbours

Answer 46

measures the distance or difference between two attribute values

Answer 47

- accurate but dimensionality: added dimensions increase distances and exponentially more training data is needed. - its also slow - remedy: weighed attributes or attribute selection

Answer 48

a simpler model should perform well on unseen data drawn from the same distribution

Answer 49

#errors/#examples

Answer 50

#successes/#examples

Answer 51

never evaluate on training data

Answer 52

never train on test data(that includes parameter setting or feature selection)

Answer 53

1. training set to train models 2. validation set to optimize algorithm parameters 3. test set to evaluate the final model

Answer 54

- all the data can be used to build the final classifier, however: - trade-off between performance evaluation and accuracy

Answer 55

splits data (stratisfied) in k folds. repeats test k-times and then takes average results. k = 10 is enough to reduce the sampling bias

Answer 56

the number of folds equals the number of examples. It makes the best use of the data, no sampling bias. However, it is computationally expensive

Answer 57

- method that illustrates the diagnostic ability of a binary classifier system. - It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. - The higher above the ROC curve, the better a classifier performs.

Answer 58

true positives/total positives

Answer 59

false positives/total positives

Answer 60

- confidence in state of difference. - enough evidence to reject the null hypothesis? - with cross-validation scores, is B better than A?

Answer 61

- a mix between regression and classification - find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute

Answer 62

- MAXIMUM COVERAGE, - base rate when compared to the sample size, - minimum quality, - information gain (X2, WRAcc)

Answer 63

the fewer attributes, the better

Answer 64

High numbers across the TT-FF diagonal means a positive correlation between subgroup and target. - High numbers across the TF-FT diagonal means a negative correlation between subgroup and target

Answer 65

- interestingness of subgroup confusion matrix in a number | - popular is the WRAcc(weighted relative accuracy)

Answer 66

WRACC(Subgroup, Target) = p(ST) – p(S)p(T) * compare p(ST) to independence * shows the balance between the coverage (size of subgroup) and the unexpectedness (deviance of the target)

Answer 67

WRAcc, information gain, X2, correlation coefficient, laplace, jaccard, specifity

Answer 68

numeric target has an order and a scale

Answer 69

numeric target has an order

Answer 70

numeric target has an order or a scale

Answer 71

- Larger subgroups are more reliable - the majority of objects appear at the top(yes/true) - the middle of the subgroup should differ from the middle of the ranking - objects should have a similar rank

Answer 72

- Nominal targets(classification) | - Numeric targets(regression)

Answer 73

- multiple targets - regression, correlation - multi-label classification

Answer 74

- the DISTRIBUTION of RANDOM RESULTS (random subsets, conditions, swap-randomization) - minimum QUALITY - SIGNIFICANCE of INDIVIDUAL results

Answer 75

- to remove attributes with little/no predictive information - irrelevant attributes often slow down algorithms. - it also avoids huge decision trees, therefore easier to interpret

Answer 76

dimensionality: increase number attributes -> exponential increase training instances

Answer 77

data fragmentation problem: select attributes on less and less data after every split

Answer 78

- attribute selection approach that is learner independent. - based on simple models built by other learners. - Example C4.5: selects features tested in the top-level nodes. - Example kNN: weights features by capability to separate classes

Answer 79

- attribute selection approach that is learner dependent. | - rerun the learner with different attributes, select based on performance.

Answer 80

select 1 attribute, remove, repeat

Answer 81

the cut-off is defined by the user

Answer 82

- converting numeric attributes to nominal ones. - Downside: you always lose information. - plus: you can use learners that can't handle numeric data

Answer 83

determining intervals without knowing the class labels. uses histograms. Two ways to do this: 1. equal interval binning(equal width) 2. equal frequency binning (equal size)

Answer 84

- determines intervals with knowledge of the class labels. - usually works better than unsupervised - less predictive information is lost - Two approaches: entropy-based and bottom-up merging

Answer 85

can lead to new insights in the data and better performance.

Answer 86

- can have one or multiple predictors - computed directly from the data - Lasso regression selects features by setting parameters to 0.

Answer 87

what percentage of the variance in regression is explained by the model

Answer 88

variant of the decision tree that uses a top-down induction. two options: 1. constant value in leaf(piecewise constant) -> regression trees 2. Local linear model in leaf (piecewise linear) -> model trees

Answer 89

1. splitting criterion: standard deviation reduction(SDR) | 2. stopping criterion: standard deviation below some threshold, too few examples in node

Answer 90

- (n+v)/(n-v) * absolute error in node | - n=examples in node, v=parameters in the model.

Answer 91

subgroup discovery typically produces very many patterns with high levels of redundancy.

Answer 92

- optimize dissimilarity of patterns reported. | - The additional value of individual patterns is reported.

Answer 93

an itemset of size k that maximizes the joint entropy

Answer 94

- MONOTONICITY: suppose X and Y are two itemsets, such that X ⊆ Y. Then: H(X) ≤ H(Y). - UNIT GROWTH: suppose X and Y are two itemsets, such that X ⊆ Y. Then: H(Y) ≤ H(X) + |Y\X|. - INDEPENDENCE BOUND: suppose that X = {x1, …, xk} is an itemset. Then: H(X) ≤ Σi H(xi)

Answer 95

- suppose that P = {B1, …, Bm} is a partition of an itemset. - The joint entropy of P is defined as: H(P) = Σi H(Bi)

Answer 96

P = {B1, …, Bm} -> partition of itemset X - Partition bound: H(X) ≤ H(P) P = {B1, …, Bm} -> partition of itemset X = {x1, …, xk} - Independence bound: H(P) ≤ Σi H(xi)

Answer 97

an itemset is a collection of one or more items.

Answer 98

a support count is the frequency or the occurrence of a certain itemset.

Answer 99

support is the fraction of transactions that contain both itemsets X and Y.

Answer 100

a frequent itemset is an itemset whose support is greater than or equal to a minsup(minimum support) threshold

Answer 101

find rules that predict the occurrence of an item based on occurrences of other items in the transaction in a given set of transactions.

Answer 102

an association rule is an expression of the form X → Y (X and Y are itemsets)

Answer 103

support and confidence

Answer 104

measures how often items in Y appear in transactions that contain X. Co-occurrence items in Y with items that contain X

Answer 105

the amount of resources required to run an algorithm

Answer 106

R = 3^d- 2^(d+1)+1 | *d is the number of itemsets

Answer 107

- example of Mining Association Rule - approach lists all possible rules and computes support and confidence of each rule. - Then, rules that fail minimum confidence and support thresholds are discarded. - computationally very prohibitive/shit

Answer 108

1. Frequent Itemset Generation - generate all itemsets whose support ≥ minsup 2. Rule Generation - a. Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

Answer 109

1. Generate length (k+1) candidate itemsets from length k frequent itemsets. 2. Prune itemsets with infrequent subsets of length k. 3. Count the support of each candidate 4. Eliminate all the infrequent candidates 5. only the frequent candidates are left.

Answer 110

- mimimum support threshold: lower support threshold -> more frequent itemsets -> candidate itemsets increased - dimensionality (number of items): you need more space to store support counts - size of database(number of transactions): bigger database -> increase run time - Transaction width (density of datasets): wider transaction width -> increase #subsets -> increase length frequent itemsets

Answer 111

determines for each itemset whether it is frequent or not. Fixes infrequency border

Answer 112

determines for each itemset its support. none of the supersets or children of this set have the same support as the parent or set. this means that the support of these supersets are lower.

Answer 113

- average of the individual predictions | - more diversity -> better performance.

Answer 114

- From RANDOM SAMPLES with a DUPLICATE, given the training set D of size n, bagging generates m new training sets D1...Dm, each of size n. - sampling with duplicates -> OBSERVATIONS REPEATED in the sample

Answer 115

these learners can de-correlate by learning from different datasets.

Answer 116

- measures prediction error of machine learning models(random forests) that use bootstrap aggregating(bagging). - it is the MEAN PREDICTION ERROR on each training sample xi - only uses tries that did not have xi in their bootstrap sample

Answer 117

- Random Subspace Method - bootstrapping - boosting - C4.5

Answer 118

- not very transparent - have to fit the Random Forest to the data set - you have to record the average OOB error over all the trees

Answer 119

1. good theoretical basis, the ensembles. 2. It also is fairly easy to run. 3. Used as a baseline for testing. 4. It is easy to run in parallel due to inherent parallelism. It works faster already on multi-core computers.

Answer 120

- has no labels and finds 'natural' grouping of instances. - if labels are present, then they are ignored. - unsupervised learning, - Examples include k-means, hierarchical clustering, k-medoids, Self-Organizing Maps

Answer 121

1. pick k random points: initial cluster centres 2. Assign each point to the nearest cluster centre 3. Move cluster centres to the mean of each cluster 4. Reassign points to the nearest cluster centre 5. Repeat step 3-4, until the cluster centres converge (don’t hardly move)

Answer 122

simple, understandable and fast. The instances are automatically assigned to clusters.

Answer 123

- k must be determined beforehand, - sensitive to outliers as instances are forced to a single cluster. - random algorithm -> random results. - not intuitive(higher dimensions) - cluster central point is not always an observed data point

Answer 124

point where the function value is smaller than at all other feasible points

Answer 125

- a point where the function value is smaller than at nearby points - but possibly greater than at a distant point.

Answer 126

- finds the medians of each cluster instead of the mean | - cluster representative is always an observed data point

Answer 127

structures clusters into a tree structure called a dendogram. the individual clusters are in leaves. Union of child clusters in nodes

Answer 128

starts with a single-instance clusters and joins the two closest clusters at each step

Answer 129

starts with one universal cluster and is split into two clusters recursively

Answer 130

the distance between clusters: - single link: smallest distance between points - complete link: largest distance between points - average link: average distance between points

Answer 131

- clustering method that groups similar data together. - reduces dimensionality and is a data visualisation technique. - the neurons try to mimic the input vectors

Answer 132

- WRAcc - z-score - Explained Variance - information gain

Answer 133

bin boundaries can be placed at unfortunate locations, causing empty bins or too full bins

Answer 134

- Decision Boundary: the smoothness of the decision boundary - Neighbour: the amount of neighbours considered for classifying new example - Fit: how well the model fits the training data

Answer 135

- The cluster centres move to the circular data - The algorithm gets stuck in a local optimum - The algorithm doesn’t converge

Answer 136

entropy of the items combined

Answer 137

- the joint entropy that is higher than all other joint entropies -> the highest entropy. - if the itemset is the only one with a specific number of elements, then it is also miki

Answer 138

identifiers have a high entropy, but they also have a higher chance of overfitting.

Exam Flashcards

(162 cards)