Markov Decision Process (MDP)

Here, we consider a discrete-time Markov Decision Process where time is slotted into intervals of equal duration indexed by n\in\left \{ 0,1,2,... \right \} Definition 1. A Markov Decision Process M is a 5-tuple (S;A; p; c;\gamma ) where S is the set of states i, A is the set of actions a, p_{ij}(a) : S \times A \rightarrow [0,1] is the transition probability function, which gives the probability of transition to state j after performing action a in state i, and c : S \times A \rightarrow R is the cost function, which gives the expected cost incurred when performing action a in state i. A policy is a rule that the agent follows for choosing actions given its state. Let us assume that we have a policy \pi that assigns an action to each state, i.e., action a = \pi(i) will be chosen each time the agent is in state i following the same policy. Then at time n it performs an action a in state i and incurs a cost c to reach the next state j distributed according to p_{ij}(a). All these costs are discounted by the discount factor and combined to a single value for each state i. Hence, the discounted value for a policy starting in state i is V^\pi = \sum_{n=0}^{\infty} \gam^nc_n(i,\pi(i)) which is bounded by

1

1

since the cost is bounded by c_{max}. Let \pi^* be an optimal policy that minimizes the sequence of costs from any state. This implies that for any policy \pi and any state i we have, V^{\pi^*}(i) \leq V^\pi(i). We assume that a policy is an -approximation of the optimal policy if blank

TABLE 1: Value iteration algorithm.

TABLE 1: Value iteration algorithm.