Reinforcement Learning Flashcards
Value Function
Value function: expected reward starting from state ‘s’
v(s_t) = E(R_t + gR_{t+1} + g^2 R_{t+2} … | s_t = s)
v(s_t) = E[ R_t + g v(S_{t+1})| s_t = s] (Bellman equation)
= R_s + g \sum P_{ss’} v(s’)
State-value function: i.e., expected reward starting from state ‘s’ if I follow policy pi, where
v_pi(s_t) = E(R_t + gR_{t+1} + g^2 R_{t+2} … | s_t = s, pi),
Q-function (action-value function)
Q_pi = E(R + gR^1 + gR^2 … | s_0 = s, a_0 = a, pi) i.e., expected reward starting from state ‘s’ if I took initial action ‘a’ and then followed policy ‘pi’
Bellman’s recursion
Value of a state = Reward at that state + discounted value of average rewards from other states
v_pi(d) = R(d) + g x sum_{s’} P_pi(s’ | s = d) v_pi(s’)
Value Iteration
Banach Fixed Point Theorem
Bellman’s Optimality for best value func (v) and policy (pi)
optimal value function v* is obtained by taking optimal actions given optimal Q*
Policy is to take action that maximizes Q*
But how is Q* provided?
Policy
Function from state to action. Defines the agent’s behavior
Goal of RL
Find policy to maximize reward
Monte Carlo Tree Search
Tree = (node=state), (edge=action leading to another state)
Search Algo:
Act according to the policy until you reach a state (s) where you don’t know what to do (policy not defined/confident?)
Do all actions and explore from next states e.g., randomly
Estimate Q(s, a), and provide that info to prior states of the tree too
SELECT, EXPAND, SIMULATE, BACKUP, repeat
Nucleus Sampling (Top-p Sampling)
Sample from options whose probability masses sum to at least p
e.g., word1 = 0.6, word2=0.3, word3=0.05,….
if p = 0.9, sample between words#1 and #2
e.g., for generation
Beam Search
Given a beam width (w), find top-w candidates (initially single word), then predict next tokens for each of them (now each has two words) conditioned on prior words, selecting top-w again conditioned on the generated sequence, and repeating this process (n-words)
Say, y_2 is the next token to be predicted given y_1
Feed in y_1 to the network to obtain p(y_2 | x, y_1) and use p(y_1 | x) to get joint conditional p(y_1, y_2 | x)
p(y_1, y_2| x) = p(y_1 | x) p (y_2 | x, y_1)
param = beam width
e.g., for translation or transcription
Discount factor
Return (G_t) = R_t + g R_{t+1} + g^2 R_{t+2} …
g near 0 = short-sighted
g near 1 = long-sighted i.e., higher weight to future
Value function vs State-value function vs action-value function