Markov Decision Process (MDP)

Here, we consider a discrete-time Markov Decision Process where time is slotted into intervals of equal duration indexed by $n\in\left \{ 0,1,2,... \right \}$ Definition 1. A Markov Decision Process M is a 5-tuple (S;A; p; c; $\gamma$ ) where S is the set of states i, A is the set of actions a, $p_{ij}(a) : S \times A \rightarrow [0,1]$ is the transition probability function, which gives the probability of transition to state j after performing action a in state i, and $c : S \times A \rightarrow R$ is the cost function, which gives the expected cost incurred when performing action a in state i. A policy is a rule that the agent follows for choosing actions given its state. Let us assume that we have a policy $\pi$ that assigns an action to each state, i.e., action a = $\pi$ (i) will be chosen each time the agent is in state i following the same policy. Then at time n it performs an action a in state i and incurs a cost c to reach the next state j distributed according to $p_{ij}(a)$ . All these costs are discounted by the discount factor and combined to a single value for each state i. Hence, the discounted value for a policy starting in state i is $V^\pi = \sum_{n=0}^{\infty} \gam^nc_n(i,\pi(i))$ which is bounded by

since the cost is bounded by $c_{max}$ . Let $\pi^*$ be an optimal policy that minimizes the sequence of costs from any state. This implies that for any policy $\pi$ and any state i we have, $V^{\pi^*}(i) \leq V^\pi(i)$ . We assume that a policy is an -approximation of the optimal policy if

TABLE 1: Value iteration algorithm.

Markov Decision Process (MDP)

Markov Decision Process (MDP)

Submit a Comment Cancel reply

Cart

Troubleshooting and online tutorials

Product Categories

Product tags