Final Flashcards
Entity
Object, instance, observation, element, example, line, row, feature vector
Attribute
characteristic, (independent/dependent) variable, column, feature
Unsupervised setting
to identify a pattern (descriptive)
Supervised setting
to predict (predictive)
Induction
Generalizing from a specific case to general rules
Deduction
Applying general rules to create other specific facts
Induction is developing …
Classification and regression models
Deduction is using
Classification and regression models (apply induction)
Supervised segmentation
How can we segment the population into groups that differ from each with respect to some quantity of interest
Entropy
Separating different data points using maths
Entropy is a method
That tells us how ordered a system is
Information gain
the difference between parent entropy and the sum of the children entropy
La place correction
learn underlying distribution of the data that generated the data we are working with
Support vector machine
computes the line (hyperplane) that best separates the data points which are closest to the decision boundary
Support Vector Machines (SVM) –>
If you don’t have data which gives you probabilities but only gives you ranking
Overfitting
Tendency of methods to tailor models exactly to the training data/ finding false patterns through chance occurrences
Overfitting leads to
lack of generalization: model cannot predict on new cases (out-of-sample)
Bias
difference between predicted and real data (when missing the real trends = underfitting)
Variance
variation caused by random noise (modeling by random noise –> overfitting)
SVM sensitive to outliers?
No
Logistic regression sensitive to outliers?
Yes
Increase of complexity in classification trees
number of nodes and small leave size
Increase of complexity in regressions
number of variables, complex functional forms
Avoiding overfitting (ex ante)
Min size of leaves, max number of leaves, max length of paths, statistical tests
Avoiding overfitting (ex post, based on holdout & cross-validation)
Pruning, sweet spot, ensemble methods (bagging, boosting, random forrest)
Ensemble methods
one model can never fully reduce overfitting –> use multiple models
Avoid overfitting: logistic regression –> solution for a too complex relationship
Regularization –> Ridge regression (L2-norm penalty) & Lasso regression (L1-norm penalty)
Distance (measures)
Manhattan, Euclidean, Jaccard, Cosine
Clustering
Use methods to see if elements fall into natural groupings (historical clustering, k-means clustering)
Accuracy
number of correct decision made / total number of decisions made
(TP+TN)/(P+N)
Problems with accuracy
Unbalanced classes & Problems with unequal costs and benefits
Classification
model is used to classify instances in one category
Ranking
model is used to rank-order instances by the likelihood of belonging to a category
Visualization: Profit curves
When you know the base rate and the classifiers, and know the cost benefits. Determines the best classifier to obtain maximum expected profit
Visualization: ROC graphs
If we don’t have the cost benefits and the base rate, our sample is balanced.
Compares the classification performance of models, compare the rank-order performance of models
They plot false positive and true positive rate for the different classifiers
The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR).
Visualization: Cumulative response curves
Are intuitive, demonstrate model performance
Visualization: Lift curve
Shows the effectiveness of classifiers. Performance of of rank-ordering classifiers compared to random
Naive Bayes’ rule different from Bayes’ rule
Naive assumes that the probability of testing positive to one test, given someone has cancer is independent from all other test results e.g.
Bag of words approach
treat every document as a collection of individual tokens. Pre-process the text –> term frequencies –> normalized frequency –> determine outcome
Advanced text analysis methods
N-gram sequences & named entity extraction & topic models
co-occurrence and association rules
idea: to measure the tendency of events to (not) occur together. co-occurrence: measures the relation between one X and one Y. association rules: measure the relation between multiple X’s and one Y
Profiling
to find a typical behavior/ features of an individual/ group/ entity. To predict future behavior, to detect abnormal behavior
Link prediction
to predict connections between entities based upon dyadic similarities (similarities between pairs of entities) and existing links to others. Levels of analysis: dyadic: firms or people (not single). Methods: various: regression, social network analysis etc
Latent dimensions and data reductions
to replace a large dataset with a long list of variables with a smaller dataset minimizing information loss. statistical techniques allow us to reduce the original list of variables to fewer key dimensions or ‘factors’
sustaining competitive advantage with data science
VRIO questions –> valuable, rare, imitability, organization
Sustainability factors
Historical advantage, Unique IP, Unique complementary assets, Superior data scientist, Superior data science management
LaPlace correction –> formula to learn the distribution/nature of data
P(c)= (n+1)/(n+m+2)
Support Vector Machines (pros + cons)
Pro: simple fast/ flexible loss function/ non-linear functions. Cons: relatively unknown/ may not give solutions/ may require large sample size
Logistic regression (pros + cons)
Pro: importance of individual factors/ pretty well-known. cons: time consuming, no solution, minimum observations
Avoid overfitting - hold out
Use part of the data set to train model (training data set) –> use remaining part of the data to test predictive performance of the model (holdout data)
Avoid overfitting - cross-validation
use different parts of the data sets as hold out data –> repeat hold out method many time on different parts
Pruning
Grow large tree with training data –> cut branches that do not improve accuracy based on hold-out data (and replace them with a leaf)
Sweet spot
Create many trees with increasing complexity (number of nodes) –> evaluate predictive performance. Find the optimal complexity based on performance on hold-out data
Bagging (ensemble method)
Repeatedly select random subset of the obseravtions in the dataset –> create separate trees with each of these subsets. combine all predictions
Boosting (ensemble method)
Select a random subset of the observations in the dataset to create first tree –> select another random subset + wrong predictions of first model to create second model and third etc. combine predictions from all trees
Random forest (ensemble method)
Repeatedly select random subset of the variables in the dataset, create separate trees for each of these subsets. combine predictions from all trees
Manhattan distance
Given by sum of absolute differences
Euclidean distance
Square root of sum squared differences
Jaccard distance
Similarity equals the intersect divides by the union (can range from 0 -1).
1 - the divisions they have in common/all divisions
Cosine distance
Measure of similarity equals similar frequencies in ocurrences (can range from 0-1).
Sum of the square of each percentages of each division. 1 of the companies have no overlap, 0 if they have the same distribution.
Historical clustering
compute distances among all objects/ clusters, group closest objects/clusters together, repeat.
distance measure: manhattan, euclidean.
linkage function: distance between cluster centres, distance between nearest object, etc
k-means clustering
determine number of clusters (k), and put k ‘centroids’ at random positions. Determine for each element closest cluster (centroid). move each centroid to its cluster means. repeat.
How to choose the optimal complexity
Nested hold-out method / nested cross- validation
Precision
If you want to minimize false positives
TP/(TP+FP)
Recall
If you want to minimize false negatives
Total positive rate = TPR = TP/(TP+FN)
= same as sensitivity
Specificity
TNR = = TN/(FP+TN)
F-measure
2x (precisionxrecall)/precision+recall)
Maximizing number of correctly classified
Accuracy: Keep in mind base-rate/ f-measure: more refined measure, that balanced also false positives and false negatives
When optiminzing cost/benefit trade-off
Area-under-the-curve (AUC): correctly predict ‘true positive’ (profits) and minimize ‘false negatives (losses)
Do not use accuracy
When resources are limited:
Profit curve: find maximum within resource (budget) constraints.
Lift curve: similar, but might go beyond maximum profit point
When maximizing profits
Profit curve: only way to see maximum profits and fraction to target
Conviction
how many more times x without y occur randomly compared to how many times x but not y occurred
Correlation
how likely are x and y to occur/ not to occur
Support
what is the probability of x and y occurring together
Confidence/strength
given x, how likely is y to occur
lift
how many more times do x and y occur together than we would expect by chance
Leverage
how much more likely do x and y occur together than we would expect by chance