Algoritmes Lecture 2 Flashcards
Reinforcement Learning
Learning from experience, rewards and punishment
environment
state
agent
has control
state rewards
actions
Markov decision process
Transition process
model for decision making action set of states S set of Action(s) transition model P(s' | s,a) Rewards R(s, a, s)
reward
r = postive OR r = negative
Q-learning
Q = function s= state a = action
Methode for learning a functionQ(s, a)
estimated of the value of performing action a in state s
Overview Q learning
star with Q(s, a) for all s, a
when we take action and receive and receive a reward
estimate the value of Q(s, a) based on the current
rewards and expected future rewards
update Q(s, a) to take into the account the old estimate as well as the new one
formula Q-learning
start with Q(s,a ) = 0 for all s, a
every time we take an action a in state s and observe a reward r, we update
Q(s, a)
Greedy Decison-making
When in state s, choose action a with highest Q(s, a)
Explore VS exploit
AI know the way to the reward
Explore there are more possibility’s to get to the reward
epsilon
ɛ
ɛ-greedy
Set ɛ equal to how often we want to move randomly
with probablity ɛ, choose a random move
with prob ɛ chose a random move
Code NIM
import random
from nim import train, play
ai = train(0) //add number to train it
play(ai)