Week 8: Introduction to Reinforcement Learning Flashcards

Question

The football-kicking robot (5)

Answer 1

* Robot is in a given state * Robot does not know which actions lead to reward * Does not know what state leads to which other state * Performs random actions initally * Only learns at the end of each learning epsiode

Answer 2

For all of these actions it receives no reward Because the ball has not been kicked Only when it kicks the ball it receives reward At some random point in the future Robot manages to kick the ball Finally get’s reward Robot has now learned, When you are zero steps from the ball => kick! Does not realise it has been here before (orange arrow)

Answer 3

E.g., what if it’s in front of the ball and takes a step back? For all it knows that could lead to bigger reward Robot also doesn’t know, when it’s 1 step away from the field with the ball,Stepping into that field is a good idea That can only be learned in the next learning episode!

Answer 4

of performing action (A) in a given state (S) Value can be thought of as the current prediction of how much reward it will eventually obtain if given state S it performs action A and subsequent high value actions

Answer 5

that are good predictions of upcoming reward To learn values, takes many steps of trial and error

Answer 6

state-action pair of kicking ball

Answer 7

due to supersition

Answer 8

Maybe I did some useless motions before (e.g. going back and forth 10 times). If we assigned value to those actions, we would repeat them on the future, even though they are useless which have no casual impact on the outcome. This is called superstition.

Answer 9

e.g., skinner box

Answer 10

mimics the outline of the enviroment and assigns value for each action e.g., kick

Answer 11

When in a certain field in enviroment we assign value 10 in a Q-Table KICK since it is the field where the ball is Later on , robot in this area we n look up in KICK table the number if it is non-zero it predicts reward for this action (here KICK) So robot will kick the ball! = robot's reward = no random action

Answer 12

Value of an action (here KICK) in a given state (red square)

Answer 13

model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent.

Answer 14

* This time we reached the ball from above * We find ourselves in a field with ball and know to kick it because we written down in Q-table for KICK * Random actions at beginning like kick when no ball as it is relying on Q-table until it goes to field with ball * We can now assign a different value to state-action pair that led us to be in the field with the ball to kick it so by assigning a value of 8 (smaller than 10 in KICK table since one time step away from reward) in B7 in move-down Q-table

Answer 15

Partial value ofis assigned to immediately preceeding state-action pair if one steps into field with previous learned value So we assign value of 6 smaller to turn right

Answer 16

shorter and shorter

Answer 17

for each possible action

Answer 18

trial and error through next episode the robot discovers a more direct path to the ball and thus Q-table updated

Answer 19

route to the ball to kick it

Answer 20

just look up the actions in its Q-table

Answer 21

learned action sequence, does not need to rely on trial and error anymore

Answer 22

write each of them as a single column When in field B7 (state) move down (action) had the highest value

Answer 23

In episode 1 we learned to kick in field C7. that gave us reward 10 State S = being in field C7 Action A = KICK In the field above (B7), the best action was MOVE DOWN, (after trial and error) we then assigned a fraction of the reward value from C7, (not 10 but 8) which is a state we know what to do if we find ourselves there. The value for Q(S,A) = Q(B7,MOVE DOWN) was zero prior to learning

Answer 24

We take the maximum Q (belonging to the best state-action pair) in the next state, and multiply it by a number (gamma) between 0 and 1 (discount factor, future reward counts less than instant reward). The difference between this value and the current Q value (for the state I am in) is added to the reward I already know I receive in my current stat (rk). The result is added to the table value for my current state-action pair (here B7,DOWN) If I got more reward than expected I should increase my Q for the current state-action pair!

Answer 25

= the expected future reward given my state S and action A= current reward + discounted future reward expected for being in the next state

Answer 26

’joint quality’ of taking an action A, given a state S

Answer 27

the value of the state you are in, assuming you take the best action from there on out

Answer 28

means probability of entering state S’ (given S,A)times the total reward. So a probability of transition of state times reward

Answer 29

= know how likely to go from one state to next Do know it then model of enviroment (i.e.. rules of game) = model based reinforcement learning

Answer 30

transitioning from state S to S' is not completely deterministic as there is an element of randomness In Q-learning we don't know probabilities of state=transitions

Answer 31

The agent has learned based on intermittent rewards Equipped with the Q-table it can now navigate to the ball and kick it. Note, this may apply to an agent moving in an environment (like a robot or a rat), but the idea’s can equally be applied to games (e.g., a state would be a given position in, say, chess or checkers, and the actions would be the permissible moves). The path is not optimal, but if we introduce some randomness, it will discover the optimal path (add some “exploration”, do not always 100% “exploit” the first thing you learned to be positive! (i.e., do not exploit 100% your old Q-table)

Answer 32

Do your parents always go to the same place to vacation? vs Do you always order the same food at the restaurant?

Answer 33

Can learn complex behaviours No need for an explicit teaching signal as in supervised learning and only intermittent reward

Answer 34

Takes time, lot’s of trial an error. Especially if the state-space (number of possible state-action pairs) is large (e.g., as in chess) Cannot apply in all types of situations: Can’t fall randomly off a million cliffs to learn optimal behaviour. Sounds funny until you consider self-driving cars! Training in the real-world (e.g., a robot) would take a long time

Answer 35

But we can approximate the Q-table, approximate it with deep learning Deep Q learning: e.g., Deepminds Atari game playing AI uses a DQN (Deep Q Network)

Answer 36

update equation We update Q based on experience and change value in Q-table

Answer 37

entering to next state

Week 8: Introduction to Reinforcement Learning Flashcards

(63 cards)