definitions and terms Flashcards
data science
a set of fundamental principles that guide the extraction of knowledge from data, the main aim of which in the business community is to improve decision making; a combination of analytical engineering and exploration; data science is a broader term than “data mining”
data mining
data mining focuses on the automated search for knowledge, patterns, or regularities from data; aka KDD: Knowledge Discovery and Data mining; re formal statistical techniques, data mining might be considerd partly as hypothesis generation (vs testing)–ie can we find patterns in the data in the first place (whence ordinary statistical hypothesis testing might be applied)?
machine learning
machine learning amounts to, for one, the collection of methods for extracting (predictive) models from data; ML is concerned with the analysis of data to find useful or informative patterns; the methods were developed from machine learning, applied statistics, and pattern recognition
vs. data mining, ML is more general in application, eg to robotics or computer vision, and may put more emphasis on theory (than real-world application), while data mining is more narrowly concerned with practical, commercial and business applications
model (for machine learning)
a model is an abstraction that can perform a prediction, (re-)action or transformation to or in respect of an instance of input values; a simplified representation of reality created to serve a purpose; it is usually simplifed based on fitting it to a specific purpose, or perhaps (simplified) based on constraints on information or tractability
predictive vs descriptive models
predictive model: a formula for estimating the unknown value of interest: the target; the formula is often mathematical or a logical statement (such as a rule), and usually (the formula) is a hybrid of the two
descriptive model: a model whose primary purpose is to gain insight into the underlying phenomenon or process (eg a descriptive model of customer churn behavior would reveal what feature attributes customers who churn (leave) typically have); in some sense, we have all the data, and are trying to understand it
induction
the process of generalizing from specific cases to general rules, laws, or truths
the creation of models from data is termed model induction; the tool or procedure that creates the model from the data is aka the induction algorithm, or learner; eg a linear regression procedure will induce a ~parametrized-surface model to fit the data
instance
an instance is aka an example, representing a data point, what is described by a set of attributes or predictors, and what sometimes conventionally amounts to a row in a database or spreadsheet; an instance is sometimes called a feature vector, and in context of statistics may be called a “case”
loss function
a loss function determines how much penalty should be assigned to an instance based on the error in the model’s predicted value (some form of aggregate penalty may be used for training the model, or evaluating the model after it’s trained)
some types:
- zero-one function: penalty=0 for correct decision; penalty=1 for incorrect decision; often used for classification problems
- hinge function / hinge loss: in context of how “wrong-sided” from a desired separation boundary an instance is in the attribute phase space (the loss graph looks like a hinge); often used for classification problems, the penalty increases the more on the wrong side of the dividing line an instance is
- squared function: squared error is the square of the distance from the desired value; often used in regression contexts
generalization
the property of a model or modeling process, whereby the model applies to data that were not used to build the model
overfitting
finding chance occurrences in the dataset (that seem to fit interesting patterns but) that do not generalize is called overfitting the data
the underlying reasons for overfitting when building models from data are essentially problems of multiple comparison
mutual information
the amount of information one can obtain from one random variable given another; a measure of dependence or “mutual dependence” between two random variables
I(X;Y) = H(X)-H(X|Y) = H(Y)-H(Y|X)
classifiers and “positive” vs “negative” examples
for a classifier, a bad outcome is regarded as a “positive” example, and a good outcome is regarded as a “negative” example
data leakage
in supervised learning where the training data instances inadvertantly contain information on the same-instances target (eg putting the value of the target variable “in” the attribute vector accidentally)–the target / label value leaks into the attribute / feature vectors we’re training on
base rate
- for classification problems, the base rate classifier is (usually) the majority class in the dataset; the base rate is then the number of times that class appears in the dataset, divided by dataset size
- for regression problems, the baseline is simply the mean or median value of the numeric target variable–a simple model that always predicts this average value exhibits base rate performance
validation set
an intermediate holdout set used to optimize different classes of models or, say, over a region in parameter space; after finalizing the model an outer holdout or final test set may be used for performance metrics
sequential forward selection (SFS) / sequential backward selection (SBS)
SFS is a method for choosing relevant features for model building, using an iterative process and holdout nesting / cross-validation, starting with a single feature, optimizing, then selecting for another feature to pair with it, and so on
SBS goes “backwards” from some oversize set of features, paring them down
regularization (aka shrinkage)
when fitting a function-based model to the data, we include not just the accuracy of the fit, but also the simplicity of the model; we’re fitting on both accuracy and simplicity
the curse of dimensionality
there are so many feature attributes, ie the phase space is of such high dimension, it makes things difficult
eg for nearest neighbor methods, there may be so many irrelevant dimensions that the distance metric fails on all these extraneous distance components
Jaccard distance
used eg for nearest neighbor models–treat each instance object as a set of elements (eg whiskey attributes, like “light yellow” and “salty” and “peaty”, etc)
the distance treats two instance objects as sets of characteristics, with binary “in or out” tags for each attribute (as w/ whiskey case study); the distance is one minus the cardinality of the set intersection (logical AND) of the two instance sets, divided by the cardinality of the set union (logical OR) of the two instance sets (it’s close to 1 if the sets have little in common, and goes to 0 if the sets are identical)
cosine distance
used eg in text classification contexts, to measure the similarity of two documents
for feature vectors x and y, it’s: 1-x<dot>y / ||x||_2 ||y||_2; ie 1 minus the dot product of feature vectors, divided by L2 norm product (so really 1 minus the cosine of the angle between the vectors)</dot>
class priors
in classification problems the class prior is an estimate of the probability that randomly sampling an instance from a population will yield the given class (regardless of any attributes of the instance); this has overtones of Bayes, where the “prior probability distribution” is the distribution under consideration “prior” to the reception of “new information” (such as that gleaned by a predictive model)
precision/recall, sensitivity/specificity, PPV/NPV
____________________true
_____________pos___neg___meas
______pos____TP____FP_____PPV
pred_neg____FN____TN____NPV
______meas__SN____SP_____AC
PPV:TP/(TP+FP) (aka precision)
NPV:TN/(FN+TN)
SN:TP/(TP+FN) (aka recall; aka true positive rate)
SP:TN/(FP+TN) (aka true negative rate)
AC:(TP+TN)/(TP+FP+TN+FN)
F-measure, aka F1 score
harmonic mean of precision and recall (sensitivity): 2*(precision*recall)/(precision + recall)
features:
- if both sensitivity (recall) and precision are “perfect” at 1.0, then the F-measure is 1.0 as well
- if both sensitivity (recall) and precision are “worst” at epsilon, then the F-measure is eps^2/eps = eps
- the F-measure may be useful for imbalanced data sets, where there is the risk of a classifier (where there are very few positive cases) “hedging” by just predicting all cases to be negative–the F1 score may be used to score the model(s) for training
- NB: this is not to be confused with the F statistic or F test or F value (used eg in ANOVA)
ranking classifier
a ranking classifier does more than just predict a class for the test instance; a ranking classifier returns, with the predicted class membership, a “certainty” level (eg a score in [0,1]; note this may be different from a true probability estimate of class inclusion); the instances may then be ranked in order of predicted certainty level
features:
* ranking classifiers allow setting better cutoffs re accuracy, and allow more detailed performance analysis, through eg confusion matrices and profit curves, ROC curves, and lift curves
* note, a confusion matrix for a ranking classifier with binary categories generally means we’ve assumed the “top” n instances are “positive,” whence, for the 2x2 classification table the top row has n entries, and the bottom has the rest of the entries in the test set