week 8 - reinforcement learning Flashcards

1
Q

what is the difference between supervised learning, unsupervised learning and reinforcement learning?

A

supervised = learns a mapping between data and labels

unsupervised = discovers patterns in the data

reinforcement = a form of supervised learning that learns based on a reward signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is a pro of reinforcement learning?

A

it can succeed in solving very complex problems that other ML models cant

e.g it can be trained to play games like Go,

it can also be used to explain human learning. Lots of evidence suggests that components of reinforcement learning algorithms appear to be represented in the brain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how do we represent a reinforcement learning problem?

A

As a markov decision process

MDP have 4 components:
State
Action
Transition probabilities (the probabilitity of transitioning to another state given a particular action)

the goal of the MDP is to find the best way to act (optimal policy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what can be represented as a markov decision process?

A

basically anything

it can be used in motion control problems, it can be used in social interactions, it can be used in games

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is model free reinforcement learning?

A

the process of learning which actions produce the most reward, through trial and error

this can be achieved with temporal difference learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the sarsa algorithm?

A

SARSA: State-Action-Reward-State-Action, an on-policy reinforcement learning algorithm.
Model-free: Does not require knowledge of the environment’s dynamics.
Action-value function Q(s, a): Estimates the expected cumulative reward of taking action a in state s and following the policy.
Update Rule: Q(s_t, a_t) ← Q(s_t, a_t) + α [ r_{t+1} + γ Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ]
On-policy: Updates Q(s_t, a_t) based on the action actually taken, considering the next action a_{t+1}.
Exploration-exploitation: Uses ε-greedy policy to balance exploration and exploitation.
Goal: Learn an optimal policy by iteratively updating Q-values through interaction with the environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

sarsa

A

go over this to understand how it works better, when i’m less sleep deprived

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly