Final Flashcards

1
Q

Entity

A

Object, instance, observation, element, example, line, row, feature vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Attribute

A

characteristic, (independent/dependent) variable, column, feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unsupervised setting

A

to identify a pattern (descriptive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Supervised setting

A

to predict (predictive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Induction

A

Generalizing from a specific case to general rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Deduction

A

Applying general rules to create other specific facts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Induction is developing …

A

Classification and regression models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Deduction is using

A

Classification and regression models (apply induction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Supervised segmentation

A

How can we segment the population into groups that differ from each with respect to some quantity of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Entropy

A

Separating different data points using maths

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Entropy is a method

A

That tells us how ordered a system is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Information gain

A

the difference between parent entropy and the sum of the children entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

La place correction

A

learn underlying distribution of the data that generated the data we are working with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Support vector machine

A

computes the line (hyperplane) that best separates the data points which are closest to the decision boundary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Support Vector Machines (SVM) –>

A

If you don’t have data which gives you probabilities but only gives you ranking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Overfitting

A

Tendency of methods to tailor models exactly to the training data/ finding false patterns through chance occurrences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Overfitting leads to

A

lack of generalization: model cannot predict on new cases (out-of-sample)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Bias

A

difference between predicted and real data (when missing the real trends = underfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Variance

A

variation caused by random noise (modeling by random noise –> overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

SVM sensitive to outliers?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Logistic regression sensitive to outliers?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Increase of complexity in classification trees

A

number of nodes and small leave size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Increase of complexity in regressions

A

number of variables, complex functional forms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Avoiding overfitting (ex ante)

A

Min size of leaves, max number of leaves, max length of paths, statistical tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Avoiding overfitting (ex post, based on holdout & cross-validation)

A

Pruning, sweet spot, ensemble methods (bagging, boosting, random forrest)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Ensemble methods

A

one model can never fully reduce overfitting –> use multiple models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Avoid overfitting: logistic regression –> solution for a too complex relationship

A

Regularization –> Ridge regression (L2-norm penalty) & Lasso regression (L1-norm penalty)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Distance (measures)

A

Manhattan, Euclidean, Jaccard, Cosine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Clustering

A

Use methods to see if elements fall into natural groupings (historical clustering, k-means clustering)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Accuracy

A

number of correct decision made / total number of decisions made

(TP+TN)/(P+N)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Problems with accuracy

A

Unbalanced classes & Problems with unequal costs and benefits

32
Q

Classification

A

model is used to classify instances in one category

33
Q

Ranking

A

model is used to rank-order instances by the likelihood of belonging to a category

34
Q

Visualization: Profit curves

A

When you know the base rate and the classifiers, and know the cost benefits. Determines the best classifier to obtain maximum expected profit

35
Q

Visualization: ROC graphs

A

If we don’t have the cost benefits and the base rate, our sample is balanced.

Compares the classification performance of models, compare the rank-order performance of models

They plot false positive and true positive rate for the different classifiers

The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR).

36
Q

Visualization: Cumulative response curves

A

Are intuitive, demonstrate model performance

37
Q

Visualization: Lift curve

A

Shows the effectiveness of classifiers. Performance of of rank-ordering classifiers compared to random

38
Q

Naive Bayes’ rule different from Bayes’ rule

A

Naive assumes that the probability of testing positive to one test, given someone has cancer is independent from all other test results e.g.

39
Q

Bag of words approach

A

treat every document as a collection of individual tokens. Pre-process the text –> term frequencies –> normalized frequency –> determine outcome

40
Q

Advanced text analysis methods

A

N-gram sequences & named entity extraction & topic models

41
Q

co-occurrence and association rules

A

idea: to measure the tendency of events to (not) occur together. co-occurrence: measures the relation between one X and one Y. association rules: measure the relation between multiple X’s and one Y

42
Q

Profiling

A

to find a typical behavior/ features of an individual/ group/ entity. To predict future behavior, to detect abnormal behavior

43
Q

Link prediction

A

to predict connections between entities based upon dyadic similarities (similarities between pairs of entities) and existing links to others. Levels of analysis: dyadic: firms or people (not single). Methods: various: regression, social network analysis etc

44
Q

Latent dimensions and data reductions

A

to replace a large dataset with a long list of variables with a smaller dataset minimizing information loss. statistical techniques allow us to reduce the original list of variables to fewer key dimensions or ‘factors’

45
Q

sustaining competitive advantage with data science

A

VRIO questions –> valuable, rare, imitability, organization

46
Q

Sustainability factors

A

Historical advantage, Unique IP, Unique complementary assets, Superior data scientist, Superior data science management

47
Q

LaPlace correction –> formula to learn the distribution/nature of data

A

P(c)= (n+1)/(n+m+2)

48
Q

Support Vector Machines (pros + cons)

A

Pro: simple fast/ flexible loss function/ non-linear functions. Cons: relatively unknown/ may not give solutions/ may require large sample size

49
Q

Logistic regression (pros + cons)

A

Pro: importance of individual factors/ pretty well-known. cons: time consuming, no solution, minimum observations

50
Q

Avoid overfitting - hold out

A

Use part of the data set to train model (training data set) –> use remaining part of the data to test predictive performance of the model (holdout data)

51
Q

Avoid overfitting - cross-validation

A

use different parts of the data sets as hold out data –> repeat hold out method many time on different parts

52
Q

Pruning

A

Grow large tree with training data –> cut branches that do not improve accuracy based on hold-out data (and replace them with a leaf)

53
Q

Sweet spot

A

Create many trees with increasing complexity (number of nodes) –> evaluate predictive performance. Find the optimal complexity based on performance on hold-out data

54
Q

Bagging (ensemble method)

A

Repeatedly select random subset of the obseravtions in the dataset –> create separate trees with each of these subsets. combine all predictions

55
Q

Boosting (ensemble method)

A

Select a random subset of the observations in the dataset to create first tree –> select another random subset + wrong predictions of first model to create second model and third etc. combine predictions from all trees

56
Q

Random forest (ensemble method)

A

Repeatedly select random subset of the variables in the dataset, create separate trees for each of these subsets. combine predictions from all trees

57
Q

Manhattan distance

A

Given by sum of absolute differences

58
Q

Euclidean distance

A

Square root of sum squared differences

59
Q

Jaccard distance

A

Similarity equals the intersect divides by the union (can range from 0 -1).
1 - the divisions they have in common/all divisions

60
Q

Cosine distance

A

Measure of similarity equals similar frequencies in ocurrences (can range from 0-1).
Sum of the square of each percentages of each division. 1 of the companies have no overlap, 0 if they have the same distribution.

61
Q

Historical clustering

A

compute distances among all objects/ clusters, group closest objects/clusters together, repeat.
distance measure: manhattan, euclidean.
linkage function: distance between cluster centres, distance between nearest object, etc

62
Q

k-means clustering

A

determine number of clusters (k), and put k ‘centroids’ at random positions. Determine for each element closest cluster (centroid). move each centroid to its cluster means. repeat.

63
Q

How to choose the optimal complexity

A

Nested hold-out method / nested cross- validation

64
Q

Precision

A

If you want to minimize false positives

TP/(TP+FP)

65
Q

Recall

A

If you want to minimize false negatives

Total positive rate = TPR = TP/(TP+FN)
= same as sensitivity

66
Q

Specificity

A

TNR = = TN/(FP+TN)

67
Q

F-measure

A

2x (precisionxrecall)/precision+recall)

68
Q

Maximizing number of correctly classified

A

Accuracy: Keep in mind base-rate/ f-measure: more refined measure, that balanced also false positives and false negatives

69
Q

When optiminzing cost/benefit trade-off

A

Area-under-the-curve (AUC): correctly predict ‘true positive’ (profits) and minimize ‘false negatives (losses)
Do not use accuracy

70
Q

When resources are limited:

A

Profit curve: find maximum within resource (budget) constraints.
Lift curve: similar, but might go beyond maximum profit point

71
Q

When maximizing profits

A

Profit curve: only way to see maximum profits and fraction to target

72
Q

Conviction

A

how many more times x without y occur randomly compared to how many times x but not y occurred

73
Q

Correlation

A

how likely are x and y to occur/ not to occur

74
Q

Support

A

what is the probability of x and y occurring together

75
Q

Confidence/strength

A

given x, how likely is y to occur

76
Q

lift

A

how many more times do x and y occur together than we would expect by chance

77
Q

Leverage

A

how much more likely do x and y occur together than we would expect by chance