exam Flashcards

Question

ROC analysis

Answer 1

Receiver Operating Characteristic | compute TPR = TP/(TP+FN) and FPR = FP/(FP+TN)

Answer 2

extract insights from large data less emphasis on algorithms, more on outreach (application)

Answer 3

- very large data (Google, Facebook, Twitter) - it's possible: cheap storage - analytics - internet-gathered data - social data - heterogeneous, unstructured data

Answer 4

- scan data without much prior focus - find unusual parts in the data - analyse attribute dependencies: if X=x and Y=y (subgroup) then Z is unusual - understanding the effects of all attributes on the target

Answer 5

model the dependency of the target attribute on the remaining attributes

Answer 6

find all subgroups within the inductive constraints (minimum/maximum coverage, minimum quality, maximum complexity) that show a significant deviation in the distribution of the target attribute - a quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number, eg. WRAcc, weighted relative accuracy: WRAcc(S,T) = P(ST) - P(S) * P(T) (compare to independence: 0 means uninteresting) (typically produces many patterns with high levels of reduncancy)

Answer 7

numeric target: find subgroups with significantly higher or lower average value - trade-off between size of subgroup and average target value

Answer 8

- algorithms are biased - considering all possible data distributions, no algorithm works better than another - algorithms make assumptions about data (eg. all features are equally relevant (kNN, k-means), all features are discrete (ID3), numeric (k-means)) so adapt data to the algorithm (data engineering) or adapt the algorithm to data (selection/parameter tuning)

Answer 9

- attribute selection: remove features with little predictve information - attribute discretization: convert numeric attributes to nominal ones - data transformations: transform data to another representation

Answer 10

- filter approach: learner independent, based on data properties or simple models built by other learners - wrapper approach: learner dependent, rerun learner with different attributes, select attributes based on performance (will also consider attributes that are only interesting combined with other attributes, unlike the filter approach, but greedy: O(k^2) for k attributes)

Answer 11

- descretize values in small intervals - always loses information: try to preserve as much as possible 1. transform into 1 k-valued nominal attribute 2. replace with k-1 new binary attributes (preserve order of classes)

Answer 12

determine intervals without knowing class labels 1. equal-interval binning (equal-width) 2. equal-frequency binning: bins of equal size 2b. proportional k-interval discretization: equal-frequency binning with # bins = sqrt(dataset size)

Answer 13

class labels are known - best if all examples in a bin have the same class eg. entropy discretization: split data in same way C4.5 would (each leaf = bin), use entropy as splitting criterion

Answer 14

can lead to new insights in the data and better performance (eg. subtract two "date" attributes to get new "age" attribute)

Answer 15

(for linear data) coefficient of determination: ``` R^2 = 1 - SSres/SStot (between 0 and 1) SSres = Σ(yi−fi)^2 (difference between actual and predicted value) SStot = Σ(yi−y_bar)^2 (scaling term: difference between actual value and horizontal line at average level of all y values) ``` compare a model where x is used to a model where x does not play a role explained variance: what percentage of the variance is explained by the model

Answer 16

- regression variant of decision tree 1. regression tree: constant value in leaf 2. model tree: local linear model in leaf

Answer 17

we cannot use entropy (no classification): use standard deviation - splitting criterion: standard deviation reduction (sdr: compare sd before and after splitting) SDR = sd(T) – Σ(sd(Ti) * |Ti| / |T|) - stopping criterion: sd below some threshold or too few examples in node - pruning (bottom up): estimate amount of error by splitting all splits are binary - numeric: as usual - nominal: order all values according to average and introduce k-1 indicator variables in this order

Answer 18

make sure every additional pattern adds extra value - consider extent of patterns (what do they represent) - add binary column for every subgroup - joint entropy of itemset captures informativeness of a set of items (= set of subgroups)

Answer 19

itemset of size k (k subgroups in it) that maximizes the joint entropy (maximal when all parts are of equal size) properties: - p(Xi) = 0.5 has highest entropy - items in miki are independent - every individual item adds at most 1 bit of information - monotonicity: adding an item will always increase the joint entropy - every item adds at most H(Xi), so a candidate itemset can be discarded if the bound is not above the current maximum)

Answer 20

- group items that share information into blocks - obtain a tighter bound - precompute the joint entropy of small (2 or 3) itemsets

Answer 21

- Sigma(p_i*lg(p_i))

Answer 22

a collection of one or more items, eg. {break, milk, eggs}

Answer 23

frequency of occurrence of an itemset, eg. σ({Bread, Milk, Diapers}) = 2

Answer 24

given a set of transactions, find rules that predict the occurrence of an item based on occurrences of other items in the transaction, X --> Y with X and Y itemsets support (s) = fraction of transactions that contain both X and Y confidence (c) = how often items in Y appear in transactions that contain X (how many times does the right hand side occur together with the left hand side) c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)

Answer 25

itemset whose support >= minsup

Answer 26

if none of its immediate supersets are frequent

Answer 27

if none of its (immediate) supersets have the same support

Answer 28

a collection of models that works by aggregating the individual predictions - classification and regression (supervised learning) - typically more accurate than base model - diversity between models improves performance (different models, randomness) - more randomized models work better

Answer 29

build different models by bootstrapping: - given a training set D of size n, bagging generates m new training sets, each of size n - sampling with replacement: for large n, each set has 36,8% duplicates --> creates randomness, each time a slightly different model - performance improves with more learners (large m) - de-correlate learners by learning from diffferent datasets

Answer 30

each tree is built from a random subset of the attributes --> individual trees do not over-focus on attributes that appear highly informative in the training set works better in high-dimensional problems

Answer 31

combine bagging and random attribute samples: at each node, take a sample of the attributes and pick the best one

Answer 32

mean prediction error on each training sample Xi, using only those trees that did not have Xi in their bootstrap sample

Answer 33

- make a random forest - record average OOB error over all the trees => e_0 - for each independent attribute j: swap randomize j, refit the random forest, record average OOB e_j => importance of j = e_j - e_0

Answer 34

- good theoretical basis (ensembles) - easy to run: works well without tuning - easy to run in parallel: inherent parallelism

Answer 35

a point can be a member of more than one cluster vs every point is in one cluster only

Answer 36

take entire dataset and cut somewhere vs first assume every individual point is a different cluster

Answer 37

100% sure about in which cluster you are vs probability of being in a cluster is between 0 and 1

Answer 38

1. pick k random points: initial cluster centres 2. assign each point to nearest cluster centre 3. move cluster centres to mean of each cluster 4. reassign points to nearest cluster centre repeat steps 3-4 until cluster centres converge

Answer 39

+ simple, understandable + fast + instances automatically assigned to clusters - results can vary depending on initial choice of centres - can get trapped in local minimum (restart with different random seed) - must pick number of clusters beforehand - all instances forced into a single cluster - sensitive to outliers - random algorithm, random results

Answer 40

single link: smallest distance between points complete link: largest distance between points average link: average distance between points

Answer 41

- you don't have to choose the amount of clusters beforehand (k-menas), you can choose it afterwards and see what amount of detail you want (draw a line in the dendrogram) - can be applied to nominal data more easily than k-means

Answer 42

given d unique items - total nr possible itemsets: 2^d - total nr possible association rules: 3^d - 2^(d+1) + 1 in case of a frequent k-itemset, the total nr of possible association rules = 2^k - 2

exam Flashcards

(66 cards)