Module 2: Chapter 5 - Semi-supervised Learning Flashcards

Question 1

Q

When is semi-supervised learning used?

Answer

A

semi-supervised learning is a generic term used to describe the situation where a dataset contains both labeled and unlabeled observations - Identify labels for the unlabeled data

Question 2

Q

Which three assumptions must hold for semi-supervised learning?

Answer

A

The “clustering assumption” – the unlabeled data fall naturally into separable clusters (locally dense regions in feature space)

The “smoothness assumption” or “continuity assumption” – there is a smooth and continuous boundary separating the classes that can be used for deciding the classes of unlabeled instances

The “manifold assumption” –the observed datapoints in the high-dimensional feature space are often concentrated along lower dimensional substructures that are topological manifolds. A topological manifold is a topological space that locally resembles the Euclidian space r-n.1 A way to understand the manifold assumption is to think about a sphere (a three-dimensional object) where all the datapoints are concentrated on the surface (a two-dimensional object). The surface of a sphere is a two-dimensional manifold embedded in a three-dimensional space. The manifold assumption states that the input space is composed of multiple manifolds on which all the datapoints lie and all the datapoints in the same manifold belong to the same class

Question 3

Q

What is the difference between transductive vs. inductive methods in the context of semi-supervised models?

Answer

A

Transductive methods do not aim to build a generalizable model and are therefore sometimes considered to arise from a “closed world view”. In this case, because there is no model, the objective is solely to identify labels for the unlabeled data already observed. All instances need to be specified at the time of conducting the analysis, and no new instances can be incorporated into the study and classified at a later stage, so there is no separate test data. One transductive technique is label propagation, which is a graphical technique that assigns labels to unlabeled instances based on how close they are to labeled data points using a metric such as the Euclidean distance applied to the features.

Inductive methods, on the other hand, involve building a model that links the features to the labels, and that can then be applied to other instances. Common inductive methods include self-training and co-training

Question 4

Q

What is self-training (inductive method) ?

Answer

A

It is sometimes referred to as a heuristic technique, because it employs unlabeled data from a supervised perspective, using methods and models from the latter, rather than using both labeled and unlabeled data together in learning

Question 5

Q

What are the steps in self-learning?

Answer

A

1) Generate a classification model using any preferred technique (e.g., K nearest neighbors or logistic regression) applied to the labeled part of the data.

2) Apply the model generated in step 1 to all the unlabeled data and generate predicted labels for each instance in the unlabeled part of the data.

3) Select the single instance for which the model’s predicted label has the highest probability of being correct based on the probabilities output from logistic regression, neural networks etc.

4) Apply the predicted label to the instance selected at stage 3 and shift that datapoint from the unlabeled to the labeled portion of the dataset.

5) Return to stage 1 with the labeled set having now been enlarged by one observation and the unlabeled set reduced by one.

6) Repeat stages 1 to 5 until all the unlabeled data have been labeled, then stop and that would be the final classification model.

The labels applied to the previously unlabeled data are known as pseudo-labels.

Question 6

Q

What are the disadvantages of self-learning?

Answer

A

It is very computationally intensive because the model is retrained as many times as there are instances in the unlabeled data. If computational resources are constrained, this problem can be mitigated by selecting the best predicted k observations at stage 3 (where k is a positive integer) and shifting all k observations, along with their predicted labels, in stage 4. For instance, if k = 10, this will reduce by tenfold the number of rounds of training required.

Retraining the model after the addition of each individual datapoint can result in severe overfitting. Overfitting can be guarded against by a process known as co-training, discussed in the next sub-section.

Question 7

Q

What is co-training?

Answer

A

Co-training is another useful method that can be applied when we have two different “views” of an example

For instance, in building a credit risk model, a borrower’s financial information (income, assets, liabilities etc.) and non-financial information (marital status, tenure of employment, education level etc.) are considered. In this case, the financial and non-financial features can be considered as two different “views” of a borrower. Both are important descriptions of the borrower and provide complementary information. Co-training can utilize both “views” to build two classifiers that teach each other on unlabeled data.

Question 8

Q

What are the steps implemented in co-training?

Answer

A

Let us divide the feature set x into two disjoint subsets2 xA and xB representing two different views of the dataset. Co-training assumes that either xA and xB are individually sufficient for learning if we have enough labeled data, and thus classifiers can be built for each of them

(1) Split feature set x into two disjoint subsets, xA and xB, corresponding to two different views A and B, both for the labeled and unlabeled data.

(2) Generate classification models (Model A and Model B) for the two feature sets of the labeled data.

(3) Apply the models generated in step 2 to the two unlabeled subsets of data and generate predicted labels for each instance in the unlabeled subsets.

(4) Select the predicted observation from the unlabeled subset for each model with the highest probability score (e.g., logistic regression, neural networks with softmax layer, etc.).

(5) Assign the predicted labels to the instances selected at stage 4 and shift those data points from the unlabeled to the labeled sets. (The key difference with co-training compared with self-training is that the data points move to the labeled dataset of the other feature subset.) So, the best predicted instance from unlabeled subset A moves to the labeled subset B and vice-versa.

(6) Return to stage 2 with the labeled sets having now been enlarged by one observation each and the unlabeled sets reduced by one each.

(7) Repeat stages 2 to 6 until all the unlabeled data have been labeled, then stop and those would be the final classification models, one for each of the two disjoint sets.

(8) Estimate a single supervised model that reunites the two subsets A and B now that all instances have been labeled.

Question 9

Q

What is the relationship between co-learning and overfitting?

Answer

A

Because the co-training technique uses different subsets of features to build two different models to augment the training set, it reduces the risk of overfitting. Co-training is sometimes referred to as a disagreement-based method, because it exploits differences in the predictions based on the two subsets of features to improve the training classifications of both as they learn from one other.

Question 10

Q

Which unsupervised pre-processing techniques are used in semi-supervised learning?

Answer

A

Unsupervised Pre-processing

(1) Feature extraction – this involves employing techniques such as principal components analysis or autoencoders, discussed in previous chapters, to reduce the dimensionality of the unlabeled data and to represent it more efficiently.

(2) “Cluster-then-label” – as the name suggests, the combined unlabeled + labeled datasets are subjected to a clustering algorithm such as k-means, and then the resulting clusters are used to train a classifier model. If most of the labeled instances with a given label appear in the same cluster, then that label is assigned to all the unlabeled points in the same cluster. This is another example of pseudo-labeling. Alternatively, a supervised learning model can be built on the clusters rather than the individual labeled data points.

(3) Pre-training – here, the unlabeled data are formed into clusters that are useful to develop preliminary decision boundaries prior to applying supervised learning.

Brainscape's Knowledge GenomeTM

Module 2: Chapter 5 - Semi-supervised Learning Flashcards

Brainscape's Knowledge Genome^TM