Final Flashcards
Entity
Object, instance, observation, element, example, line, row, feature vector
Attribute
characteristic, (independent/dependent) variable, column, feature
Unsupervised setting
to identify a pattern (descriptive)
Supervised setting
to predict (predictive)
Induction
Generalizing from a specific case to general rules
Deduction
Applying general rules to create other specific facts
Induction is developing …
Classification and regression models
Deduction is using
Classification and regression models (apply induction)
Supervised segmentation
How can we segment the population into groups that differ from each with respect to some quantity of interest
Entropy
Separating different data points using maths
Entropy is a method
That tells us how ordered a system is
Information gain
the difference between parent entropy and the sum of the children entropy
La place correction
learn underlying distribution of the data that generated the data we are working with
Support vector machine
computes the line (hyperplane) that best separates the data points which are closest to the decision boundary
Support Vector Machines (SVM) –>
If you don’t have data which gives you probabilities but only gives you ranking
Overfitting
Tendency of methods to tailor models exactly to the training data/ finding false patterns through chance occurrences
Overfitting leads to
lack of generalization: model cannot predict on new cases (out-of-sample)
Bias
difference between predicted and real data (when missing the real trends = underfitting)
Variance
variation caused by random noise (modeling by random noise –> overfitting)
SVM sensitive to outliers?
No
Logistic regression sensitive to outliers?
Yes
Increase of complexity in classification trees
number of nodes and small leave size
Increase of complexity in regressions
number of variables, complex functional forms
Avoiding overfitting (ex ante)
Min size of leaves, max number of leaves, max length of paths, statistical tests
Avoiding overfitting (ex post, based on holdout & cross-validation)
Pruning, sweet spot, ensemble methods (bagging, boosting, random forrest)
Ensemble methods
one model can never fully reduce overfitting –> use multiple models
Avoid overfitting: logistic regression –> solution for a too complex relationship
Regularization –> Ridge regression (L2-norm penalty) & Lasso regression (L1-norm penalty)
Distance (measures)
Manhattan, Euclidean, Jaccard, Cosine
Clustering
Use methods to see if elements fall into natural groupings (historical clustering, k-means clustering)
Accuracy
number of correct decision made / total number of decisions made
(TP+TN)/(P+N)