Final Flashcards

Question

Avoiding overfitting (ex post, based on holdout & cross-validation)

Answer 1

Pruning, sweet spot, ensemble methods (bagging, boosting, random forrest)

Answer 2

one model can never fully reduce overfitting --> use multiple models

Answer 3

Regularization --> Ridge regression (L2-norm penalty) & Lasso regression (L1-norm penalty)

Answer 4

Manhattan, Euclidean, Jaccard, Cosine

Answer 5

Use methods to see if elements fall into natural groupings (historical clustering, k-means clustering)

Answer 6

number of correct decision made / total number of decisions made (TP+TN)/(P+N)

Answer 7

Unbalanced classes & Problems with unequal costs and benefits

Answer 8

model is used to classify instances in one category

Answer 9

model is used to rank-order instances by the likelihood of belonging to a category

Answer 10

When you know the base rate and the classifiers, and know the cost benefits. Determines the best classifier to obtain maximum expected profit

Answer 11

If we don't have the cost benefits and the base rate, our sample is balanced. Compares the classification performance of models, compare the rank-order performance of models They plot false positive and true positive rate for the different classifiers The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR).

Answer 12

Are intuitive, demonstrate model performance

Answer 13

Shows the effectiveness of classifiers. Performance of of rank-ordering classifiers compared to random

Answer 14

Naive assumes that the probability of testing positive to one test, given someone has cancer is independent from all other test results e.g.

Answer 15

treat every document as a collection of individual tokens. Pre-process the text --> term frequencies --> normalized frequency --> determine outcome

Answer 16

N-gram sequences & named entity extraction & topic models

Answer 17

idea: to measure the tendency of events to (not) occur together. co-occurrence: measures the relation between one X and one Y. association rules: measure the relation between multiple X's and one Y

Answer 18

to find a typical behavior/ features of an individual/ group/ entity. To predict future behavior, to detect abnormal behavior

Answer 19

to predict connections between entities based upon dyadic similarities (similarities between pairs of entities) and existing links to others. Levels of analysis: dyadic: firms or people (not single). Methods: various: regression, social network analysis etc

Answer 20

to replace a large dataset with a long list of variables with a smaller dataset minimizing information loss. statistical techniques allow us to reduce the original list of variables to fewer key dimensions or 'factors'

Answer 21

VRIO questions --> valuable, rare, imitability, organization

Answer 22

Historical advantage, Unique IP, Unique complementary assets, Superior data scientist, Superior data science management

Answer 23

P(c)= (n+1)/(n+m+2)

Answer 24

Pro: simple fast/ flexible loss function/ non-linear functions. Cons: relatively unknown/ may not give solutions/ may require large sample size

Answer 25

Pro: importance of individual factors/ pretty well-known. cons: time consuming, no solution, minimum observations

Answer 26

Use part of the data set to train model (training data set) --> use remaining part of the data to test predictive performance of the model (holdout data)

Answer 27

use different parts of the data sets as hold out data --> repeat hold out method many time on different parts

Answer 28

Grow large tree with training data --> cut branches that do not improve accuracy based on hold-out data (and replace them with a leaf)

Answer 29

Create many trees with increasing complexity (number of nodes) --> evaluate predictive performance. Find the optimal complexity based on performance on hold-out data

Answer 30

Repeatedly select random subset of the obseravtions in the dataset --> create separate trees with each of these subsets. combine all predictions

Answer 31

Select a random subset of the observations in the dataset to create first tree --> select another random subset + wrong predictions of first model to create second model and third etc. combine predictions from all trees

Answer 32

Repeatedly select random subset of the variables in the dataset, create separate trees for each of these subsets. combine predictions from all trees

Answer 33

Given by sum of absolute differences

Answer 34

Square root of sum squared differences

Answer 35

Similarity equals the intersect divides by the union (can range from 0 -1). 1 - the divisions they have in common/all divisions

Answer 36

Measure of similarity equals similar frequencies in ocurrences (can range from 0-1). Sum of the square of each percentages of each division. 1 of the companies have no overlap, 0 if they have the same distribution.

Answer 37

compute distances among all objects/ clusters, group closest objects/clusters together, repeat. distance measure: manhattan, euclidean. linkage function: distance between cluster centres, distance between nearest object, etc

Answer 38

determine number of clusters (k), and put k 'centroids' at random positions. Determine for each element closest cluster (centroid). move each centroid to its cluster means. repeat.

Answer 39

Nested hold-out method / nested cross- validation

Answer 40

If you want to minimize false positives TP/(TP+FP)

Answer 41

If you want to minimize false negatives Total positive rate = TPR = TP/(TP+FN) = same as sensitivity

Answer 42

TNR = = TN/(FP+TN)

Answer 43

2x (precisionxrecall)/precision+recall)

Answer 44

Accuracy: Keep in mind base-rate/ f-measure: more refined measure, that balanced also false positives and false negatives

Answer 45

Area-under-the-curve (AUC): correctly predict 'true positive' (profits) and minimize 'false negatives (losses) Do not use accuracy

Answer 46

Profit curve: find maximum within resource (budget) constraints. Lift curve: similar, but might go beyond maximum profit point

Answer 47

Profit curve: only way to see maximum profits and fraction to target

Answer 48

how many more times x without y occur randomly compared to how many times x but not y occurred

Answer 49

how likely are x and y to occur/ not to occur

Answer 50

what is the probability of x and y occurring together

Answer 51

given x, how likely is y to occur

Answer 52

how many more times do x and y occur together than we would expect by chance

Answer 53

how much more likely do x and y occur together than we would expect by chance

Final Flashcards

(77 cards)