Intro Flashcards
What is Classification?
Assign a category to each item. For example, document classification may assign items with categories such as politics, business, sports, or weather while image classification may assign items with categories such as landscape, portrait, or animal. The number of categories in such tasks is often relatively small, but can be large in some difficult tasks and even unbounded as in OCR (Optical Character Recognition), text classification, or speech recognition.
What is Regression?
Predict a real value for each item. Examples of regression include prediction of stock values or variations of economic variables. In this problem, the penalty for an incorrect prediction depends on the magnitude of the difference between the true and predicted values, in contrast with the classification problem, where there is typically no notion of closeness between various categories.
What is Ranking?
Order items according to some criterion. Web search, e.g., returning web pages relevant to a search query, is the canonical ranking example. Many other similar ranking problems arise in the context of the design of information extraction or natural language processing systems.
What is Clustering?
Partition items into homogeneous regions. Clustering is often performed to analyze very large data sets. For example, in the context of social network analysis, clustering algorithms attempt to identify “communities” within large groups of people.
What is Dimensionality Reduction or Manifold Learning=
Transform an initial representation of items into a lower-dimensional representation of these items while preserving some properties of the initial representation. A common example involves preprocessing digital images in computer vision tasks.
What are Examples?
Items or instances of data used for learning or evaluation. In our spam problem, these examples correspond to the collection of email messages we will use for learning and testing.
What are Features?
The set of attributes, often represented as a vector, associated to an example. In the case of email messages, some relevant features may include the length of the message, the name of the sender, various characteristics of the header, the presence of certain keywords in the body of the message, and so on.
What are Labels?
Values or categories assigned to examples. In classification problems, examples are assigned specific categories, for instance, the spam and non-spam
categories in our binary classification problem. In regression, items are assigned real-valued labels.
What are Training Samples?
Examples used to train a learning algorithm. In our spam problem, the training sample consists of a set of email examples along with their associated labels.
What are Validation Samples?
Examples used to tune the parameters of a learning algorithm when working with labeled data. Learning algorithms typically have one or more free parameters, and the validation sample is used to select appropriate values for these model parameters.
What are Test Samples?
Examples used to evaluate the performance of a learning algorithm. The test sample is separate from the training and validation data and is not made available in the learning stage. In the spam problem, the test sample consists of a collection of email examples for which the learning algorithm must predict labels based on features. These predictions are then compared with the labels of the test sample to measure the performance of the algorithm.
What is Loss Function?
A function that measures the difference, or loss, between a predicted label and a true label. Denoting the set of all labels as Y and the set of possible predictions as Y’ , a loss function L is a mapping L : Y × Y’ → R + . In most
cases, Y’ = Y and the loss function is bounded, but these conditions do not always hold. Common examples of loss functions include the zero-one (or misclassification)
loss defined over {−1, +1} × {−1, +1} by L(y, y’ ) = 1 y’ != y and the squared loss defined over I × I by
L(y, y’ ) = (y’ − y)^2 , where I ⊆ R is typically a bounded
interval.
What is the Hypothesis Set?
A set of functions mapping features (feature vectors) to the set of labels Y. In our example, these may be a set of functions mapping email features to Y = {spam, non-spam}. More generally, hypotheses may be functions mapping features to a different set Y’ . They could be linear functions mapping email feature vectors to real numbers interpreted as scores (Y’ = R), with higher score values more indicative of spam than lower ones.
What is Supervised Learning?
The learner receives a set of labeled examples as training data and makes predictions for all unseen points. This is the most common scenario associated with classification, regression, and ranking problems. The spam detection problem discussed in the previous section is an instance of supervised learning.
What is Unsupervised Learning?
The learner exclusively receives unlabeled training data,
and makes predictions for all unseen points. Since in general no labeled example is available in that setting, it can be difficult to quantitatively evaluate the performance of a learner. Clustering and dimensionality reduction are example of unsupervised learning problems.