q-learning intuition Flashcards
V(s) = max(R(s,a) + yV(sā))
s = state s'= following state, ending state max = many actions a = action R = reward y(gamma) take one action in state s we will automatically get a reward ... R(s,a) value of new state yV(s') for every action (max) we have this calculation (R(s,a) + yV(s'))
The bellman equation creates ____________ for an agent to get to ___________
incentive
reward
Is an algorithm which, given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states.
deterministic algorithm
Factors that cause a non-deterministic search
A variety of factors can cause an algorithm to behave in a way which is not deterministic, or non-deterministic:
If it uses external state other than the input, such as user input, a global variable, a hardware timer value, a random value, or stored disk data.
If it operates in a way that is timing-sensitive, for example if it has multiple processors writing to the same data at the same time. In this case, the precise order in which each processor writes its data will affect the result.
If a hardware error causes its state to change in an unexpected way.
What is the dynamic programming equation/Bellman equation
It writes the value of a decision problem at a certain point in time in terms of the payoff from some initial choices and the value of the remaining decision problem that results from those initial choices.
Markov decision processes (MDPs)
Markov decision processes (MDPs) provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. Depends upon the present state, not on the sequence of events that preceded it.