Quiz 5 Flashcards
Definitions of MDPs (states/actions/environment)
An MDP consists of
States (full system description),
Actions (choices at each state), a
Transition Model (probability of next state given action),
Reward Function (reward for actions), and a
Policy (mapping from states to actions).
Dynamic programming methods for solving MDPs
Policy evaluation computes value of a policy,
policy improvement updates actions based on value,
policy iteration alternates between evaluation and improvement, and
value iteration updates values directly.
All aim to find optimal policies.
Exploration vs. exploitation
Exploration is trying new actions to learn their rewards, while
exploitation uses known good actions.
Balancing them is crucial in RL.
Challenges of RL
RL challenges include:
high variance,
sample inefficiency,
stability issues,
delayed rewards,
exploration-exploitation trade-offs, and
partial observability.
DQN/REINFORCE
DQN uses neural networks to approximate Q-values with experience replay and target networks for stability.
REINFORCE uses Monte Carlo policy gradients, directly optimizing expected reward.
Policy gradients derivation
Policy gradients use the log-derivative trick to express gradients as an expectation, enabling estimation via samples.
The final form is a sum over time of the gradient of log probability times reward.
Difference between types of learning (semi-supervised, few-shot, self-supervised) and what type of data they assume
Semi-supervised uses small labeled and large unlabeled data, few-shot learns from very few labeled examples, and self-supervised uses unlabeled data with pseudo-labels.
Types of self-supervised tasks and inputs/outputs/losses
Self-supervised tasks include contrastive learning, pretext tasks like rotation prediction, and patch ordering.
Inputs are raw data, outputs are generated pseudo-labels, and losses include contrastive, cross-entropy, or triplet loss.
GANs and VAEs: Process of training, objectives/losses, and how they work
GANs train a generator and discriminator in a minimax game to create realistic samples without explicit density.
VAEs learn latent spaces with encoder-decoder networks, optimizing
ELBO to balance reconstruction and latent regularization.