ch1 Flashcards
big data
data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them
machine learning
a set of methods thst can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or perform other kinds of decision making under uncertainty
the long tail
property in data sets of a few instances being highly common/likely while other instances are quite rare
this is apparently shared in data generated across a wide variety of domains (why?)
supervised learning
the machine learning task of inferring a function from labeled training data
Formally put, the task is to learn a mapping from inputs x to outputs y, given a labeled set of input-output pairs D = {(Xi, Yi)} for i=1 through N
Here, D is the training set, and N is the number of training examples.
features, attributes. covariates
input-output pairs that make up the training set in a supervised learning problem
response variable
output as a function of input in a supervised lesrning problem
categorical/nominal variable
a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property
classification vs regression
When the response variable is categorical (Ie takes on a finite range of values), supervised learning is called classification. When the response variable is resl-valued, the problem is known as regression.
ordinal regression
supervised learning when the response variable occurs over a finite range of values but they have some natural ordering (eg grades A-F)
descriptive/unsupervised learning, or knowledge discovery
learning where only inputs D - {Xi} over i[1 to N are provided and the goal is to find “interesting patterns” in the data.
A much less well-defined problem, since we aren’t told what kinds of patterns to look for and no obvious error metric exists
reinforcement learning
learning a decision based on occasional reward or punishment signals
binary classification
classification between two classes
multiclass classificstion
classifcation between more than two classes
multi-label classifcation
classification where class labels aren’t mutually exclusive
often best viewed as predicting multiple related binsry class labels from the same inout (multiple output model)
multiple output model
model of multi-label classification treating problem as involving prediction of multiple related binary class labels
function approximation
formalization of classification w/ assumption y = f(x) for some unknown function f and goal to estimate the function f given a labeled training set and predict using y^ = f^(x)
maximum a posteriori probability (MAP) estimate
an estimate of an unknown quantity, that equals the mode of the posterior distribution
posterior probability
the conditional probability that is assigned after the relevant evidence or background is taken into account.
posterior probability distribution
the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey
central idea of bayesian statistics
probability is orderly opinion, and that inference from data is nothing other than the revision of such opinion in the light of relevant new information.
bag-of-words / vector space model
a simplifying representation used in natural language processing and information retrieval (IR). Also known as vector space model[1]. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
Documents and queries are represented as vectors. Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero.
tf–idf or TFIDF, short for term frequency–inverse document frequency
often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
stop words
words which are filtered out before or after processing of natural language data (text)
feature extraction
starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations
dimensionality reduction or dimension reduction
the process of reducing the number of random variables under consideration[1] by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.
feature selection, also known as variable selection, attribute selection or variable subset selection
the process of selecting a subset of relevant features (variables, predictors) for use in model construction
curse of dimensionality
various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience
The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.
overfitting
the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”
The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.
Underfitting
occurs when a statistical model cannot adequately capture the underlying structure of the data. An underfitted model is a model where some parameters or terms that would appear in a correctly specified model are missing.[2] Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.
density estimation
formulation of unsupervised learning as involving the construction of an estimate, based on observed data, of an unobservable underlying probability density function
Cluster analysis or clustering
the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)
latent variables
variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).
model based clustering
fitting a probabilistic model to the datw, rather than running some ad hoc algorithm
advantages include that one can compare different kinds of models in an objective way (in trrms of the likelihood that they assign to the data)
imputation
the process of replacing missing data with substituted values
matrix completion
the task of filling in the missing entries of a partially observed matrix
collaborative filtering
In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
In the more general sense, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc
parametric model
A statistical model with a fixed number of parameters
non-parametric model
statistical model whose number of parameters increase with the amount of training data. More flexible than parametric models (ie make weaker assumptions), but are computationally intractable for large data sets.
K nearest neighbor classifier
Non parametric classifier that looks at the K points of the training set nearest to the test input X, counts how many members of each class are in the set, and returns the empirical fraction as the estimate.
Loses effectiveness in high dimensions or whenever the nearest neighbors are in fact distant from the test input.
instance-based or memory-based learning
a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.
inductive bias
a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered
inductive bias
of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered
linear regression
parametric model asserting that the response is a linear function of the inputs
error
the deviation of the observed value from the (unobservable) true value of a quantity of interest
residual
the difference between the observed value and the estimated value of the quantity of interest
scalar product
dot product of two vectors
basis function expansion
using linear regression to model nonlinear relationships by replacing x with some nonlinear function of the inputs ø(x)
bernoulli distribution
the probability distribution of a random variable which takes the value 1 with probability p p and the value 0 with probability q = 1 − p q=1-p
sigmoid function, aka logistic or logit function
S shaped “squashing function” that maps the whole real line to [0,1], which is necessary formthe output to be interpreted as a probsbility
logistic regression
a regression model where the dependent variable (DV) is categorical (usually binary)
result of doing a basis (sigmoid) function expansion on a bernoulli distribution,
actuslly classification, not real regression
decision rule
a function which maps an observation to an appropriate action
linearly separable
two sets are linearly separable if there exists at least one line in the plane with all of the blue points on one side of the line and all the red points on the other side
generalization error
a measure of how accurately an algorithm is able to predict outcome values for previously unseen data
cross validation
partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to estimate a final predictive model.
often used when there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability
LOOCV leave-one-out cross validation
or replace ‘one’ w ‘p’
using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set.
k-fold cross-validation
the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation.
The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[7] but in general k remains an unfixed parameter.
no free lunch theorem
a result that states that for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. No solution therefore offers a “short cut”.