Select Page

## Markov Decision Process (MDP)

Here, we consider a discrete-time Markov Decision Process where time is slotted into intervals of equal duration indexed by Definition 1. A Markov Decision Process M is a 5-tuple (S;A; p; c; ) where S is the set of states i, A is the set of actions a, is the transition probability function, which gives the probability of transition to state j after performing action a in state i, and  is the cost function, which gives the expected cost incurred when performing action a in state i. A policy is a rule that the agent follows for choosing actions given its state. Let us assume that we have a policy that assigns an action to each state, i.e., action a = (i) will be chosen each time the agent is in state i following the same policy. Then at time n it performs an action a in state i and incurs a cost c to reach the next state j distributed according to . All these costs are discounted by the discount factor and combined to a single value for each state i. Hence, the discounted value for a policy starting in state i is  which is bounded by

1

since the cost is bounded by . Let  be an optimal policy that minimizes the sequence of costs from any state. This implies that for any policy and any state i we have, . We assume that a policy is an -approximation of the optimal policy if

TABLE 1: Value iteration algorithm.