Exam Flashcards

1
Q

Knowledge Discovery

A

about understanding a domain with interpretable models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Prediction

A

stream where the methods you use do not matter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Black box setting

A

you don’t care about how the model works

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification

A
  • predicts future cases of a binary class.
  • models the dependency target on other attributes.
  • Sometimes a black-box classifier.
  • some attributes may not appear because of overshadowing in decision trees.
  • Supervised learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Regression

A

tries to precict a numeric target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Clustering

A

divides a dataset into groups of similar cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Frequent Patterns/Association

A

finds dependencies between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Support Vector Machine

A

a single line through the dataset that tries to find a nice boundary division between positive and negative attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Neural Network

A

same as the Support Vector machine, but does it with a curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

iid

A

identically distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Nominal

A

categorical/discrete, can only test for equality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Numeric

A

can test for inequalities and can use arithmetic or distance measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ordinal

A

can compare inequalities as well, but not use arithmetic or distance measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Binary

A

Nominal variable with only two values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Entropy

A

measure of the amount of information/chaos, highest when entropy is distributed equally over the values, 1/m, unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Entropy formula

A

– plg(p) – (1–p)lg(1–p), p is the probability of a value, other way of writing it: Σ – Pilg(Pi)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Max Entropy formula

A

–m*1/m lg(1/m) = –lg(1/m) = lg m

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Cumulative distribution function(CDF)

A

sums up all of the values in a dataset in a formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Probability density function(PDF)

A

the derivative of the CDF, the relative density of points for each value, density is not the probability. the peak is where the most values and thus the biggest density is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Histograms

A

estimates density in a discrete way by defining cut points and count occurences per created bin. unsupervised method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Histograms Equal width

A

the bins are cut in equal size intervals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Histograms Equal height

A

the bins are cut so every bin contains about the same amount of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Kernel(Gaussian) Density Estimation

A

estimates the density of the population from a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Downside Entropy

A

the entropy concept does not apply well to numerical data, sadly, only to nominal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Confusion Matrix/Contingency table

A

describes the frequency of the four combinations of subgroups and targets. This is done within subgroup positive, negative, outside subgroup positive, and outside negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Confusion matrix distribution

A

if you get a joint distribution, you implicitly also get a univariate distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

methods of quantifying information between attributes

A

joint entropy, mutual information, information gain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

joint distributions dependency

A

if there are higher counts along the diagonal of the confusion matrix, then one value is dependent on the other(somewhat). Fully determined is when one variable has the complete probability and the other nothing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Density problem visualisation

A

with scatterplots, you cannot see the density because of the datapoints overlapping at the same spot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Information gain

A

in decision trees, it is the entropy before the split compared to the entropy after the split. can be any positive number for an attribute, supervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Uncertainty

A

a class attribute contains uncertainty over values. this is captured by the entropy of the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

top-down tree construction

A

all training examples are at the root at the start of building. partition the examples recursively by choosing one attribute each time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

bottom-up tree pruning

A
  • removes subtrees or branches in a bottom-up manner.

- goal is to improve the estimated accuracy on new cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What does the goodness function do?

A
  • at each node, ATTRIBUTES SELECTED based on SEPARATION of the CLASSES of the training examples.
  • Examples: information gain(ID3/C4.5, information gain ratio, gini index)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Heuristic attribute selection

A

chooses the attribute that produces the purest nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How does Information gain do attribute selection?

A
  • uses entropy of the class attribute.
  • Information gain increases with the average purity of the subsets that an attribute produces.
  • chooses the attribute that results in the highest information gain
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Entropy pure and mixed nodes

A

H(0) = 0 and H(1) = 0 gives pure nodes with a skewed distribution, H(0.5) = 1 gives a mixed node with a uniform distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Information gain formula

A

gain is H(p) – ΣH(pi)*ni/n, the information before the split minus information after the split, max information gain is 2log(#values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Information gain bias

A

biased towards choosing attributes with a large number of values, this may result in overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Overfitting

A

selection of an attribute that is non-optimal for prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Gain ratio + when large/small?

A
  • Is a modification of the information gain that reduces its bias on high-branching attributes.
  • large when data is divided into few, even groups.
  • small when each example belongs to a separate branch
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Intrinsic information

A

entropy of distribution of instances into branches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

gain ratio formula

A

Gain(S, A) / Intrinsic Info (S, A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Intrinsic information aanname

A

importance of the attribute decreases as the instrinsic information gets larger in the gain ratio formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Standard fix

A

test to prevent splitting on attributes where gain ratio may overcompensate and not choose an attribute just because its intrinsic information is low

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Splitting numeric attributes

A
  • standard method: binary splits (temp < 45), evaluate info gain(or other measure) for every split point of attribute.
  • best split point is info gain for attribute.
  • numeric attributes are computationally more demanding.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

If an attribute A has high info gain at the root of the tree, does it always appear in a decision tree?

A

No.
If it is highly correlated with another attribute B, and gain(B) > gain(A), then B will appear in the tree, and further splitting on A will not be useful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Can an attribute appear more than once in a decision tree?

A

Yes

If a test is not at the root of the tree, it can appear in different branches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Can an attribute appear more than once on a single path in the tree (from root to leaf)?

A

Yes

Numeric attributes can appear more than once, but only with very different numeric conditions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

If an attribute A has gain (A)=0 (at the root), can it ever appear in a decision tree?

A

Yes.

  1. All attributes may have zero info gain
  2. info gain often changes when splitting on another attribute, think of the XOR-problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Is a tree with only pure leaves always the best classifier you can have?

A

No.
This tree is the best classifier on the training set, but possibly not on new and unseen data. Because of overfitting, the tree may not generalize very well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Goal Pruning

A

to prevent overfitting to noise in the data, strategies: pre-pruning and post-pruning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

pre-pruning

A

stop growing a branch when information becomes unreliable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

post-pruning

A

take a fully-grown decision tree and discard unreliable parts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

chi-squared test

A
  • most popular statistical significance test for pre-pruning.
  • tests statistical significant dependency between an attribute and the class in a node
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

ID3

A
  • method that uses the chi-squared test in addition to information gain
  • only selects statistically significant attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

XOR/Parity-problem

A
  • early stopping

- stopping the growth process prematurely during pre-pruning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

subtree replacement

A

post-pruning bottom-up approach. considers replacing a tree only after considering all of its subtrees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

C4.5

A

decision tree method that derives the confidence interval from the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Observed error rate

A

error rate from the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Actual error rate

A

error rate from the test set, which is unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Confidence interval

A

used to calculate whether the actual error rate will be different from the observed error rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Covering approach

A

for each class in turn, find the rule set that covers all examples in it (excluding examples not in the class)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

PRISM

A

algorithm based on the covering approach. adds tests that maximizes a rule-s accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

goal of a test for rules

A
  • to maximize accuracy
  • p/t is 1 when the rules cover every example.
  • t is total number of examples covered by rule
  • p is positive examples covered by rule
66
Q

separate-and-conquer algorithms

A

PRISM is an example, deals with only one class. starts of with one rule that separates examples, then other examples are conquered

67
Q

divide-and-conquer algorithms

A

a subset covered by a rule doesn’t need to be explored any further

68
Q

rote(routine) learning

A
  • search similarity examples training set in test set,

- methods include visualisation of k-nn

69
Q

instance-based learning

A
  • lazy learning
  • similarity function defines what is learned.
  • methods include nearest neighbour, k-nearest neighbours
70
Q

Euclidean distance

A

measures the distance or difference between two attribute values

71
Q

Curse of 1-NN

A
  • accurate but dimensionality: added dimensions increase distances and exponentially more training data is needed.
  • its also slow
  • remedy: weighed attributes or attribute selection
72
Q

Simple models advantage

A

a simpler model should perform well on unseen data drawn from the same distribution

73
Q

error rate formula

A

errors/#examples

74
Q

accuracy formula

A

successes/#examples

75
Q

rules data sets 1

A

never evaluate on training data

76
Q

rules data sets 2

A

never train on test data(that includes parameter setting or feature selection)

77
Q

proper procedure testing

A
  1. training set to train models
  2. validation set to optimize algorithm parameters
  3. test set to evaluate the final model
78
Q

trade-off evaluation data sets

A
  • all the data can be used to build the final classifier, however:
  • trade-off between performance evaluation and accuracy
79
Q

k-fold Cross-validation

A

splits data (stratisfied) in k folds. repeats test k-times and then takes average results. k = 10 is enough to reduce the sampling bias

80
Q

Leave-One-Out Cross-validation

A

the number of folds equals the number of examples. It makes the best use of the data, no sampling bias. However, it is computationally expensive

81
Q

ROC

A
  • method that illustrates the diagnostic ability of a binary classifier system.
  • It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
  • The higher above the ROC curve, the better a classifier performs.
82
Q

TPR(True Positive Rate)

A

true positives/total positives

83
Q

FPR(False Positive Rate)

A

false positives/total positives

84
Q

Significance tests

A
  • confidence in state of difference.
  • enough evidence to reject the null hypothesis?
  • with cross-validation scores, is B better than A?
85
Q

Subgroup discovery Task

A
  • a mix between regression and classification
  • find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute
86
Q

Inductive constraints examples

A
  • MAXIMUM COVERAGE,
  • base rate when compared to the sample size,
  • minimum quality,
  • information gain (X2, WRAcc)
87
Q

Maximum complexity rule

A

the fewer attributes, the better

88
Q

Confusion matrix diagonals

A

High numbers across the TT-FF diagonal means a positive correlation between subgroup and target.
- High numbers across the TF-FT diagonal means a negative correlation between subgroup and target

89
Q

quality measure

A
  • interestingness of subgroup confusion matrix in a number

- popular is the WRAcc(weighted relative accuracy)

90
Q

WRAcc formula

A

WRACC(Subgroup, Target) = p(ST) – p(S)p(T)

  • compare p(ST) to independence
  • shows the balance between the coverage (size of subgroup) and the unexpectedness (deviance of the target)
91
Q

examples quality measures

A

WRAcc, information gain, X2, correlation coefficient, laplace, jaccard, specifity

92
Q

Regression subgroup discovery

A

numeric target has an order and a scale

93
Q

ordinal subgroup discovery

A

numeric target has an order

94
Q

ranked subgroup discovery

A

numeric target has an order or a scale

95
Q

Numeric Subgroup Discovery intuitions

A
  • Larger subgroups are more reliable
  • the majority of objects appear at the top(yes/true)
  • the middle of the subgroup should differ from the middle of the ranking
  • objects should have a similar rank
96
Q

Classical subgroup discovery

A
  • Nominal targets(classification)

- Numeric targets(regression)

97
Q

Exceptional model mining(extension subgroup discovery)

A
  • multiple targets
  • regression, correlation
  • multi-label classification
98
Q

Statistical validation determines:

A
  • the DISTRIBUTION of RANDOM RESULTS (random subsets, conditions, swap-randomization)
  • minimum QUALITY
  • SIGNIFICANCE of INDIVIDUAL results
99
Q

Why Attribute Selection?

A
  • to remove attributes with little/no predictive information
  • irrelevant attributes often slow down algorithms.
  • it also avoids huge decision trees, therefore easier to interpret
100
Q

Attribution Selection: kNN curse

A

dimensionality: increase number attributes -> exponential increase training instances

101
Q

Attribution Selection: C4.5 curse

A

data fragmentation problem: select attributes on less and less data after every split

102
Q

filter approach

A
  • attribute selection approach that is learner independent.
  • based on simple models built by other learners.
  • Example C4.5: selects features tested in the top-level nodes.
  • Example kNN: weights features by capability to separate classes
103
Q

Wrapper approach

A
  • attribute selection approach that is learner dependent.

- rerun the learner with different attributes, select based on performance.

104
Q

filter approach: recursive

A

select 1 attribute, remove, repeat

105
Q

filter approach: produce a ranking

A

the cut-off is defined by the user

106
Q

Attribute discretization

A
  • converting numeric attributes to nominal ones.
  • Downside: you always lose information.
  • plus: you can use learners that can’t handle numeric data
107
Q

unsupervised discretization

A

determining intervals without knowing the class labels. uses histograms. Two ways to do this:

  1. equal interval binning(equal width)
  2. equal frequency binning (equal size)
108
Q

Supervised discretization

A
  • determines intervals with knowledge of the class labels.
  • usually works better than unsupervised
  • less predictive information is lost
  • Two approaches: entropy-based and bottom-up merging
109
Q

Advantage Data Transformations

A

can lead to new insights in the data and better performance.

110
Q

Linear regression

A
  • can have one or multiple predictors
  • computed directly from the data
  • Lasso regression selects features by setting parameters to 0.
111
Q

Rsquared/explained variance

A

what percentage of the variance in regression is explained by the model

112
Q

Regression trees

A

variant of the decision tree that uses a top-down induction. two options:

  1. constant value in leaf(piecewise constant) -> regression trees
  2. Local linear model in leaf (piecewise linear) -> model trees
113
Q

criteria for regression trees

A
  1. splitting criterion: standard deviation reduction(SDR)

2. stopping criterion: standard deviation below some threshold, too few examples in node

114
Q

Estimate error formula

A
  • (n+v)/(n-v) * absolute error in node

- n=examples in node, v=parameters in the model.

115
Q

maximally informative k-itemsets (MIKI) Motivation

A

subgroup discovery typically produces very many patterns with high levels of redundancy.

116
Q

maximally informative k-itemsets (MIKI) Dissimilar patterns

A
  • optimize dissimilarity of patterns reported.

- The additional value of individual patterns is reported.

117
Q

maximally informative k-itemsets (MIKI)

A

an itemset of size k that maximizes the joint entropy

118
Q

properties of joint entropy

A
  • MONOTONICITY: suppose X and Y are two itemsets, such that X ⊆ Y. Then: H(X) ≤ H(Y).
  • UNIT GROWTH: suppose X and Y are two itemsets, such that X ⊆ Y. Then: H(Y) ≤ H(X) + |Y\X|.
  • INDEPENDENCE BOUND: suppose that X = {x1, …, xk} is an itemset.
    Then: H(X) ≤ Σi H(xi)
119
Q

partitions of itemsets

A
  • suppose that P = {B1, …, Bm} is a partition of an itemset.
  • The joint entropy of P is defined as: H(P) = Σi H(Bi)
120
Q

Joint entropy partition properties

A

P = {B1, …, Bm} -> partition of itemset X
- Partition bound: H(X) ≤ H(P)

P = {B1, …, Bm} -> partition of itemset X = {x1, …, xk}
- Independence bound: H(P) ≤ Σi H(xi)

121
Q

Itemset

A

an itemset is a collection of one or more items.

122
Q

Support count (σ)

A

a support count is the frequency or the occurrence of a certain itemset.

123
Q

Support

A

support is the fraction of transactions that contain both itemsets X and Y.

124
Q

Frequent Itemset

A

a frequent itemset is an itemset whose support is greater than or equal to a minsup(minimum support) threshold

125
Q

Association Rule Mining

A

find rules that predict the occurrence of an item based on occurrences of other items in the transaction in a given set of transactions.

126
Q

Association Rule

A

an association rule is an expression of the form X → Y (X and Y are itemsets)

127
Q

Two Rule Evaluation Measures

A

support and confidence

128
Q

Confidence

A

measures how often items in Y appear in transactions that contain X.
Co-occurrence items in Y with items that contain X

129
Q

Computational Complexity

A

the amount of resources required to run an algorithm

130
Q

possible association rules formula

A

R = 3^d- 2^(d+1)+1

*d is the number of itemsets

131
Q

Brute-force approach

A
  • example of Mining Association Rule
  • approach lists all possible rules and computes support and confidence of each rule.
  • Then, rules that fail minimum confidence and support thresholds are discarded.
  • computationally very prohibitive/shit
132
Q

Two-step approach association rules

A
  1. Frequent Itemset Generation - generate all itemsets whose support ≥ minsup
  2. Rule Generation - a. Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
133
Q

Apriori principle

A
  1. Generate length (k+1) candidate itemsets from length k frequent itemsets.
  2. Prune itemsets with infrequent subsets of length k.
  3. Count the support of each candidate
  4. Eliminate all the infrequent candidates
  5. only the frequent candidates are left.
134
Q

Factors Affecting Complexity

A
  • mimimum support threshold: lower support threshold -> more frequent itemsets -> candidate itemsets increased
  • dimensionality (number of items): you need more space to store support counts
  • size of database(number of transactions): bigger database -> increase run time
  • Transaction width (density of datasets): wider transaction width -> increase #subsets -> increase length frequent itemsets
135
Q

Maximal frequent itemset

A

determines for each itemset whether it is frequent or not. Fixes infrequency border

136
Q

Closed itemset

A

determines for each itemset its support. none of the supersets or children of this set have the same support as the parent or set. this means that the support of these supersets are lower.

137
Q

Regression

A
  • average of the individual predictions

- more diversity -> better performance.

138
Q

Bootstrap Aggregating (Bagging)

A
  • From RANDOM SAMPLES with a DUPLICATE, given the training set D of size n, bagging generates m new training sets D1…Dm, each of size n.
  • sampling with duplicates -> OBSERVATIONS REPEATED in the sample
139
Q

unstable learners

A

these learners can de-correlate by learning from different datasets.

140
Q

Out-of-bag error(OOB)

A
  • measures prediction error of machine learning models(random forests) that use bootstrap aggregating(bagging).
  • it is the MEAN PREDICTION ERROR on each training sample xi
  • only uses tries that did not have xi in their bootstrap sample
141
Q

Methods that use randomization

A
  • Random Subspace Method
  • bootstrapping
  • boosting
  • C4.5
142
Q

Random Forests downsides

A
  • not very transparent
  • have to fit the Random Forest to the data set
  • you have to record the average OOB error over all the trees
143
Q

Benefits of Random Forests

A
  1. good theoretical basis, the ensembles.
  2. It also is fairly easy to run.
  3. Used as a baseline for testing.
  4. It is easy to run in parallel due to inherent parallelism. It works faster already on multi-core computers.
144
Q

clustering

A
  • has no labels and finds ‘natural’ grouping of instances.
  • if labels are present, then they are ignored.
  • unsupervised learning,
  • Examples include k-means, hierarchical clustering, k-medoids, Self-Organizing Maps
145
Q

k-means steps

A
  1. pick k random points: initial cluster centres
  2. Assign each point to the nearest cluster centre
  3. Move cluster centres to the mean of each cluster
  4. Reassign points to the nearest cluster centre
  5. Repeat step 3-4, until the cluster centres converge (don’t hardly move)
146
Q

k-means advantages

A

simple, understandable and fast. The instances are automatically assigned to clusters.

147
Q

k-means disadvantages(name 5)

A
  • k must be determined beforehand,
  • sensitive to outliers as instances are forced to a single cluster.
  • random algorithm -> random results.
  • not intuitive(higher dimensions)
  • cluster central point is not always an observed data point
148
Q

global optimum

A

point where the function value is smaller than at all other feasible points

149
Q

local minimum

A
  • a point where the function value is smaller than at nearby points
  • but possibly greater than at a distant point.
150
Q

k-mediods

A
  • finds the medians of each cluster instead of the mean

- cluster representative is always an observed data point

151
Q

Hierarchical clustering

A

structures clusters into a tree structure called a dendogram. the individual clusters are in leaves. Union of child clusters in nodes

152
Q

bottom-up/agglomerative clustering

A

starts with a single-instance clusters and joins the two closest clusters at each step

153
Q

the top-down approach(clustering)

A

starts with one universal cluster and is split into two clusters recursively

154
Q

Centroid

A

the distance between clusters:

  • single link: smallest distance between points
  • complete link: largest distance between points
  • average link: average distance between points
155
Q

Self-Organising Maps(SOM)

A
  • clustering method that groups similar data together.
  • reduces dimensionality and is a data visualisation technique.
  • the neurons try to mimic the input vectors
156
Q

Quality measures subgroup discovery

A
  • WRAcc
  • z-score
  • Explained Variance
  • information gain
157
Q

histograms disadvantage

A

bin boundaries can be placed at unfortunate locations, causing empty bins or too full bins

158
Q

k as a parameter in k-NN algorithm determines:

A
  • Decision Boundary: the smoothness of the decision boundary
  • Neighbour: the amount of neighbours considered for classifying new example
  • Fit: how well the model fits the training data
159
Q

k-means possible desirable outcomes

A
  • The cluster centres move to the circular data
  • The algorithm gets stuck in a local optimum
  • The algorithm doesn’t converge
160
Q

Maximum MIKI

A

entropy of the items combined

161
Q

MIKI

A
  • the joint entropy that is higher than all other joint entropies -> the highest entropy.
  • if the itemset is the only one with a specific number of elements, then it is also miki
162
Q

identifiers downside

A

identifiers have a high entropy, but they also have a higher chance of overfitting.