Quiz 5 Flashcards

1
Q

Definitions of MDPs (states/actions/environment)

A

An MDP consists of

States (full system description),

Actions (choices at each state), a

Transition Model (probability of next state given action),

Reward Function (reward for actions), and a

Policy (mapping from states to actions).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dynamic programming methods for solving MDPs

A

Policy evaluation computes value of a policy,

policy improvement updates actions based on value,

policy iteration alternates between evaluation and improvement, and

value iteration updates values directly.

All aim to find optimal policies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Exploration vs. exploitation

A

Exploration is trying new actions to learn their rewards, while

exploitation uses known good actions.

Balancing them is crucial in RL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Challenges of RL

A

RL challenges include:

high variance,
sample inefficiency,
stability issues,
delayed rewards,
exploration-exploitation trade-offs, and
partial observability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DQN/REINFORCE

A

DQN uses neural networks to approximate Q-values with experience replay and target networks for stability.

REINFORCE uses Monte Carlo policy gradients, directly optimizing expected reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Policy gradients derivation

A

Policy gradients use the log-derivative trick to express gradients as an expectation, enabling estimation via samples.

The final form is a sum over time of the gradient of log probability times reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Difference between types of learning (semi-supervised, few-shot, self-supervised) and what type of data they assume

A

Semi-supervised uses small labeled and large unlabeled data, few-shot learns from very few labeled examples, and self-supervised uses unlabeled data with pseudo-labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Types of self-supervised tasks and inputs/outputs/losses

A

Self-supervised tasks include contrastive learning, pretext tasks like rotation prediction, and patch ordering.

Inputs are raw data, outputs are generated pseudo-labels, and losses include contrastive, cross-entropy, or triplet loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

GANs and VAEs: Process of training, objectives/losses, and how they work

A

GANs train a generator and discriminator in a minimax game to create realistic samples without explicit density.

VAEs learn latent spaces with encoder-decoder networks, optimizing

ELBO to balance reconstruction and latent regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly