Reinforcement Learning Flashcards

Question 1

Q

Value Function

Answer

A

Value function: expected reward starting from state ‘s’
v(s_t) = E(R_t + gR_{t+1} + g^2 R_{t+2} … | s_t = s)

v(s_t) = E[ R_t + g v(S_{t+1})| s_t = s] (Bellman equation)
= R_s + g \sum P_{ss’} v(s’)

State-value function: i.e., expected reward starting from state ‘s’ if I follow policy pi, where
v_pi(s_t) = E(R_t + gR_{t+1} + g^2 R_{t+2} … | s_t = s, pi),

Question 2

Q

Q-function (action-value function)

Answer

A

Q_pi = E(R + gR^1 + gR^2 … | s_0 = s, a_0 = a, pi) i.e., expected reward starting from state ‘s’ if I took initial action ‘a’ and then followed policy ‘pi’

Question 3

Q

Bellman’s recursion

Answer

A

Value of a state = Reward at that state + discounted value of average rewards from other states
v_pi(d) = R(d) + g x sum_{s’} P_pi(s’ | s = d) v_pi(s’)

Question 4

Q

Value Iteration

Question 5

Q

Banach Fixed Point Theorem

Question 6

Q

Bellman’s Optimality for best value func (v) and policy (pi)

Answer

A

optimal value function v* is obtained by taking optimal actions given optimal Q*

Policy is to take action that maximizes Q*

But how is Q* provided?

Question 7

Q

Policy

Answer

A

Function from state to action. Defines the agent’s behavior

Question 8

Q

Goal of RL

Answer

A

Find policy to maximize reward

Question 9

Q

Monte Carlo Tree Search

Answer

A

Tree = (node=state), (edge=action leading to another state)

Search Algo:
Act according to the policy until you reach a state (s) where you don’t know what to do (policy not defined/confident?)
Do all actions and explore from next states e.g., randomly
Estimate Q(s, a), and provide that info to prior states of the tree too

SELECT, EXPAND, SIMULATE, BACKUP, repeat

Question 10

Q

Nucleus Sampling (Top-p Sampling)

Answer

A

Sample from options whose probability masses sum to at least p
e.g., word1 = 0.6, word2=0.3, word3=0.05,….
if p = 0.9, sample between words#1 and #2
e.g., for generation

Question 11

Q

Beam Search

Answer

A

Given a beam width (w), find top-w candidates (initially single word), then predict next tokens for each of them (now each has two words) conditioned on prior words, selecting top-w again conditioned on the generated sequence, and repeating this process (n-words)

Say, y_2 is the next token to be predicted given y_1

Feed in y_1 to the network to obtain p(y_2 | x, y_1) and use p(y_1 | x) to get joint conditional p(y_1, y_2 | x)

p(y_1, y_2| x) = p(y_1 | x) p (y_2 | x, y_1)

param = beam width
e.g., for translation or transcription

Question 12

Q

Discount factor

Answer

A

Return (G_t) = R_t + g R_{t+1} + g^2 R_{t+2} …
g near 0 = short-sighted
g near 1 = long-sighted i.e., higher weight to future

Question 13

Q

Value function vs State-value function vs action-value function