When the agent is in state i at time slot n and takes action a, it transitions to the next state j at time slot n+1 with probability and incurs expected cost c(i, a). However, given the available actions , it is not enough to select the action that minimizes the immediate expected cost; instead, the agent should take into account the desirability of the next state j. Thus, all
the next states should be ranked. This can be done by using the optimal cost (over all remaining stages) starting from state j which is denoted by and is referred to as the optimal cost-to-go of the state j (we will also interchangeably refer to it as the value function). The cost-to-go satisfies the following Bellman’s equation
where j is the state subsequent to i. Generally, in each state i there is an optimal action a that attains the minimum above. Thus actions are ranked based on the sum of the expected immediate cost and the discounted costto- go of all subsequent stages.