Markov Decision Process (MDP)
Here, we consider a discrete-time Markov Decision Process where time is slotted into intervals of equal duration indexed by Definition 1. A Markov Decision Process M is a 5-tuple (S;A; p; c;
) where S is the set of states i, A is the set of actions a,
is the transition probability function, which gives the probability of transition to state j after performing action a in state i, and
is the cost function, which gives the expected cost incurred when performing action a in state i. A policy is a rule that the agent follows for choosing actions given its state. Let us assume that we have a policy
that assigns an action to each state, i.e., action a =
(i) will be chosen each time the agent is in state i following the same policy. Then at time n it performs an action a in state i and incurs a cost c to reach the next state j distributed according to
. All these costs are discounted by the discount factor and combined to a single value for each state i. Hence, the discounted value for a policy starting in state i is
which is bounded by

1
since the cost is bounded by . Let
be an optimal policy that minimizes the sequence of costs from any state. This implies that for any policy
and any state i we have,
. We assume that a policy is an -approximation of the optimal policy if

TABLE 1: Value iteration algorithm.