thinking wires

When reading recent papers on reinforcement learning (RL) or inverse reinforcement learning (IRL), each author uses a slightly different notation of MDPs. The biggest difference between them might be the reward function; sometimes it is defined as a function of states ($\mathcal{S} \rightarrow \mathbb{R}$) and other times it maps state-action pairs to rewards ($\mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$).

While this is of course not a huge difference, I think it is good to have an intution about how the two formulations differ and to know how one can change from one to the other. Most papers take this as a trivial prerequisite, such as Ng and Russel in their 2000 paper "Algorithms for Inverse Reinforcement Learning":

"we have written rewards as $\mathcal{R}(s)$ rather than $\mathcal{R}(s, a)$; the extension is trivial."

I found that many resources teaching RL strictly use one notation and do in fact not show how to switch to others. So for some of you this extension might not look all that trivial at first sight. Not to worry, after reading this post it should be quite obvious!

What is implied when using $\mathcal{R} : \mathcal{S} \rightarrow \mathbb{R}$?

And with $\mathcal{R} : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$?

In their standard RL book, Sutton and Barton state that while the MDP community often uses $\mathcal{R}(s, a)$, they prefer $\mathcal{R}(s)$ for reinforcement learning, since there "we more often have refer to individual actual or sample rewards (rather than just their expected values)."

Switching between notations

Switching between the two different notations is actually quite easy. To learn how, let's use a very basic and small MDP which is depicted in the following figure:

A simple example MDP. There are three states ($s_0$, $s_1$, $s_2$) and two actions (blue and orange). The transitions are deterministic except for performing blue in $s_0$, with $T(s_1 | s_0, \text{blue})=40\%$ and $T(s_2 | s_0, \text{blue})=60\%$. For the color blind: blue action arrows are dotted lines. Diagram of expert estimates for the arrival of strong AI with different confidence levels

In the MDP shown above, we have three states and two actions: blue / b and orange / o. The only 'interesting' decision in this MDP occurs whenever the agent is in state $s_0$. It can choose to perform o to land in $s_2$ or alternatively use action b to land in $s_1$ with $40\%$ probability or end up in $s_2$ with $60\%$ probability.

We are already given the rewards for each state (using the $\mathcal{S} \rightarrow \mathbb{R}$ formulation):

$\mathbf{s_i}$ $\mathbf{\mathcal{R}(s_i)}$
$s_0$ $0$
$s_1$ $2$
$s_2$ $1$

From $\mathcal{R}(s)$ to $\mathcal{R}(s, a)$:

Now, how can we turn this into a reward function for expected rewards of state action pairs? In order to do this, we need the transition probabilities $T$ for all state action pairs. Then we can simply calculate the value of any $\mathcal{R}(s,a)$ as the expected reward when performing action $a$ in state $s$: $$ \mathcal{R}(s, a) = \mathbb{E}_{s'\sim T(s, a)}[\mathcal{R}(s')] $$

In our basic MDP example introduced above, let's do this for performing action blue / b in state $s_0$: $$ \mathcal{R}(s_0, b) \\ = T(s_0|s_0, b) * \mathcal{R}(s_0) + T(s_1|s_0, b) * \mathcal{R}(s_1) + T(s_2|s_0, b) * \mathcal{R}(s_2) \\ = 0 \cdot \mathcal{R}(s_0) + 0.4 \cdot \mathcal{R}(s_1) + 0.6 \cdot \mathcal{R}(s_2) \\ = 0 \cdot 0 + 0.4 \cdot 2 + 0.6 \cdot1 \\ = 1.4 $$

We can do this for the other state action pairs as well and arrive at the following results: $$ \mathcal{R}(s_0, b) = 1.4 \\ \mathcal{R}(s_0, o), \mathcal{R}(s_1, b), \mathcal{R}(s_1, o) = 1 \\ \mathcal{R}(s_2, b), \mathcal{R}(s_2, o) = 0 $$

From $\mathcal{R}(s, a)$ to $\mathcal{R}(s)$:

Now that we have fully extracted $\mathcal{R}(s, a)$ using $\mathcal{R}(s)$ and the transition probabilities $T$, let's go the other direction. We already know that $$ \mathcal{R}(s, a) = \mathbb{E}_{s'\sim T(s, a)}[\mathcal{R}(s')] $$ Assuming that our state space is finite, this is equivalent to $$ \mathcal{R}(s, a) = \sum_{s'}T(s'| s, a) \cdot \mathcal{R}(s') $$

Given that we know all $\mathcal{R}(s, a)$ and have full knowledge of the transition probabilities $T$, we can set up a system of linear equations for each unknown $\mathcal{R}(s')$, with the first three columns corresponding to $\mathcal{R}(s_0)$, $\mathcal{R}(s_1)$, and $\mathcal{R}(s_2)$ respectively: $$ \left[ \begin{array}{ccc|c} 0 & 0.4 & 0.6 & 1.4 = \mathcal{R}(s_0, b) \\ 0 & 0 & 1 & 1 = \mathcal{R}(s_0, o) \\ 0 & 0 & 1 & 1 = \mathcal{R}(s_1, b) \\ 0 & 0 & 1 & 1 = \mathcal{R}(s_1, o) \\ 1 & 0 & 0 & 0 = \mathcal{R}(s_2, b) \\ 1 & 0 & 0 & 0 = \mathcal{R}(s_0, o) \\ \end{array} \right] $$

We can easily extract from the equations above that $\mathcal{R}(s_0) = 0$ and $\mathcal{R}(s_2) = 1$. From the first equation we can then find that $0.4\cdot \mathcal{R}(s_1) + 0.6 = 1.4$, so $\mathcal{R}(s_1) = 2$.

Like this, we have successfully returned to our original reward formulation defined for each state.

Subscribe to Newsletter

Subscribe here to automatically receive new posts on thinking wires via email:


Back to main page