10 - Representation Learning Flashcards
What is Semi-Supervised Learning
- Often there is a lot of data but only a small subset is labeled
- Only training on the small labeled dataset will most likely cause overfitting
- Semi-Supervised Learning can be a good solution, as it trains on both the labeled and the not labeled data
What is agood representation?
Information processing tasks can vary in difficulty, depending in the representation of the information (eg. arabic numeral representaion vs roman numeral representation)
Generally, a representation is better than another if it makes the subsequent learning task easier → Therefore obviously the choice of representation depends on the learning task.
LLMs & unsupervised learning
LLMs actually do that, they just take sentences, cut of a word and then guess it. Afterwards though they use supervised allignmet (reinforcement learnign from human feedback).
What is Greedy Layer-wise Unsupervised Pretraining
→ also shortened to “Unsupervised Pretraining”
Originally used bc it was problematic to jointly train all layers of a deep neural net for a supervised task. Later they found out it is very useful to find a good initialization for a joint learning procedure and even can be used to successfully train fully connected architectures. So it finally allowed training of deep supervised nets without architectural specializations like convolution or recurrence.
Unsupervised Pretraining relies on a single-layer representation learnign algorithm
- Eg: RBM, single-layer AE, sparse coding model etc
- each layer is pretrained using unsupervised learning and their output becomes the next layers’ input
- hopefully each layer produces output data whose distribution of relation to other variables (eg categories) is simpler than before.
Greedy: The algorithm optimizes each solution piece independently instead of jointly, which is why it is called greedy
Layer-Wise: the independent pieces are the layers of the network. When one layer is trained, the others (previous layers) stay fixed
Unsupervised: layers are trained with an unsupervised representation learning algorithm
Pretraining: bc afterwards another joint training algorithm (supervised) is applied to fine-tune all layers together, so it is just a first step. Sometimes “Pretraining” includes them both.
When and why does unsupervised pretrainging work?
Unsupervised Pretraining can sometimes yeild great improvements in some cases, but also not change anything and often bring harm in ither situations.
When used to learn a representation
- good when the initial representation is bad
- Example: Word embeddings
- Bad when already good represented
- Example: images
When used to regularize:
- Most helpful when number of labeled data is small (a handful - a dozen)
Other factors could for example be:
- Good chance of UP being most usefull for very complicated functions to be learned
What is transfer learning?
→transferring a learned model to a new task (a task with new outputs while the input stays similar)
For general feature learning approach.
- The trained feature extractor stays as it is, but a new classifier is trained
- Example: train a cnn to classify handwritten numbers. Afterwards use the feature extractor in another task, where the model is adapted to classify handwritten letters by traning a new 25way classifier.
Visualizing representations with t-SNE
Pretraining goal: providing us with high-dimensional features that are expected to be good at separating classes of the inputs, which makes it easier to train a new classification head.
t-SNE allows mapping of a set of high-dim data points all the way down to ≤ 3D, with no labels required and as a one shot method. It tries to preserve relative neighbor distances while projecting down to 2D.
Few-Shot and One-Shot Learning
One-shot learning – each new class has one labeled example. The goal is to make predictions for the new classes based on this single example.
Few-shot learning – there is a limited number of labeled examples for each new class. The goal is to make predictions for new classes based on just a few examples of labeled data.
Zero-shot learning – there is absolutely no labeled data available for new classes. The goal is for the algorithm to make predictions about new classes by using prior knowledge about the relationships that exist between classes it already knows. In the case of large language models (LLMs) like ChatGPT, for example, prior knowledge is likely include semantic similarities.