definitions and terms Flashcards
data science
a set of fundamental principles that guide the extraction of knowledge from data, the main aim of which in the business community is to improve decision making; a combination of analytical engineering and exploration; data science is a broader term than “data mining”
data mining
data mining focuses on the automated search for knowledge, patterns, or regularities from data; aka KDD: Knowledge Discovery and Data mining; re formal statistical techniques, data mining might be considerd partly as hypothesis generation (vs testing)–ie can we find patterns in the data in the first place (whence ordinary statistical hypothesis testing might be applied)?
machine learning
machine learning amounts to, for one, the collection of methods for extracting (predictive) models from data; ML is concerned with the analysis of data to find useful or informative patterns; the methods were developed from machine learning, applied statistics, and pattern recognition
vs. data mining, ML is more general in application, eg to robotics or computer vision, and may put more emphasis on theory (than real-world application), while data mining is more narrowly concerned with practical, commercial and business applications
model (for machine learning)
a model is an abstraction that can perform a prediction, (re-)action or transformation to or in respect of an instance of input values; a simplified representation of reality created to serve a purpose; it is usually simplifed based on fitting it to a specific purpose, or perhaps (simplified) based on constraints on information or tractability
predictive vs descriptive models
predictive model: a formula for estimating the unknown value of interest: the target; the formula is often mathematical or a logical statement (such as a rule), and usually (the formula) is a hybrid of the two
descriptive model: a model whose primary purpose is to gain insight into the underlying phenomenon or process (eg a descriptive model of customer churn behavior would reveal what feature attributes customers who churn (leave) typically have); in some sense, we have all the data, and are trying to understand it
induction
the process of generalizing from specific cases to general rules, laws, or truths
the creation of models from data is termed model induction; the tool or procedure that creates the model from the data is aka the induction algorithm, or learner; eg a linear regression procedure will induce a ~parametrized-surface model to fit the data
instance
an instance is aka an example, representing a data point, what is described by a set of attributes or predictors, and what sometimes conventionally amounts to a row in a database or spreadsheet; an instance is sometimes called a feature vector, and in context of statistics may be called a “case”
loss function
a loss function determines how much penalty should be assigned to an instance based on the error in the model’s predicted value (some form of aggregate penalty may be used for training the model, or evaluating the model after it’s trained)
some types:
- zero-one function: penalty=0 for correct decision; penalty=1 for incorrect decision; often used for classification problems
- hinge function / hinge loss: in context of how “wrong-sided” from a desired separation boundary an instance is in the attribute phase space (the loss graph looks like a hinge); often used for classification problems, the penalty increases the more on the wrong side of the dividing line an instance is
- squared function: squared error is the square of the distance from the desired value; often used in regression contexts
generalization
the property of a model or modeling process, whereby the model applies to data that were not used to build the model
overfitting
finding chance occurrences in the dataset (that seem to fit interesting patterns but) that do not generalize is called overfitting the data
the underlying reasons for overfitting when building models from data are essentially problems of multiple comparison
mutual information
the amount of information one can obtain from one random variable given another; a measure of dependence or “mutual dependence” between two random variables
I(X;Y) = H(X)-H(X|Y) = H(Y)-H(Y|X)
classifiers and “positive” vs “negative” examples
for a classifier, a bad outcome is regarded as a “positive” example, and a good outcome is regarded as a “negative” example
data leakage
in supervised learning where the training data instances inadvertantly contain information on the same-instances target (eg putting the value of the target variable “in” the attribute vector accidentally)–the target / label value leaks into the attribute / feature vectors we’re training on
base rate
- for classification problems, the base rate classifier is (usually) the majority class in the dataset; the base rate is then the number of times that class appears in the dataset, divided by dataset size
- for regression problems, the baseline is simply the mean or median value of the numeric target variable–a simple model that always predicts this average value exhibits base rate performance
validation set
an intermediate holdout set used to optimize different classes of models or, say, over a region in parameter space; after finalizing the model an outer holdout or final test set may be used for performance metrics
sequential forward selection (SFS) / sequential backward selection (SBS)
SFS is a method for choosing relevant features for model building, using an iterative process and holdout nesting / cross-validation, starting with a single feature, optimizing, then selecting for another feature to pair with it, and so on
SBS goes “backwards” from some oversize set of features, paring them down
regularization (aka shrinkage)
when fitting a function-based model to the data, we include not just the accuracy of the fit, but also the simplicity of the model; we’re fitting on both accuracy and simplicity
the curse of dimensionality
there are so many feature attributes, ie the phase space is of such high dimension, it makes things difficult
eg for nearest neighbor methods, there may be so many irrelevant dimensions that the distance metric fails on all these extraneous distance components