Quiz 5 Flashcards

Question 1

Q

Definitions of MDPs (states/actions/environment)

Answer

A

An MDP consists of

States (full system description),

Actions (choices at each state), a

Transition Model (probability of next state given action),

Reward Function (reward for actions), and a

Policy (mapping from states to actions).

Question 2

Q

Dynamic programming methods for solving MDPs

Answer

A

Policy evaluation computes value of a policy,

policy improvement updates actions based on value,

policy iteration alternates between evaluation and improvement, and

value iteration updates values directly.

All aim to find optimal policies.

Question 3

Q

Exploration vs. exploitation

Answer

A

Exploration is trying new actions to learn their rewards, while

exploitation uses known good actions.

Balancing them is crucial in RL.

Question 4

Q

Challenges of RL

Answer

A

RL challenges include:

high variance,
sample inefficiency,
stability issues,
delayed rewards,
exploration-exploitation trade-offs, and
partial observability.

Question 5

Q

DQN/REINFORCE

Answer

A

DQN uses neural networks to approximate Q-values with experience replay and target networks for stability.

REINFORCE uses Monte Carlo policy gradients, directly optimizing expected reward.

Question 6

Q

Policy gradients derivation

Answer

A

Policy gradients use the log-derivative trick to express gradients as an expectation, enabling estimation via samples.

The final form is a sum over time of the gradient of log probability times reward.

Question 7

Q

Difference between types of learning (semi-supervised, few-shot, self-supervised) and what type of data they assume

Answer

A

Semi-supervised uses small labeled and large unlabeled data, few-shot learns from very few labeled examples, and self-supervised uses unlabeled data with pseudo-labels.

Question 8

Q

Types of self-supervised tasks and inputs/outputs/losses

Answer

A

Self-supervised tasks include contrastive learning, pretext tasks like rotation prediction, and patch ordering.

Inputs are raw data, outputs are generated pseudo-labels, and losses include contrastive, cross-entropy, or triplet loss.

Question 9

Q

GANs and VAEs: Process of training, objectives/losses, and how they work

Answer

A

GANs train a generator and discriminator in a minimax game to create realistic samples without explicit density.

VAEs learn latent spaces with encoder-decoder networks, optimizing

ELBO to balance reconstruction and latent regularization.