Hi Rishabh,

Thanks for your comment.

I am not sure if I understood what you mean. But we do not need in this proof to talk about immediate rewards explicitly. I would be happy to hear your comments with more details, but it may be helpful if I try to make the proof more clear:

1. The 1st line is basically the use of Eq. 1.

2. For the 2nd line, we claim that the expectation of q_\pi with respect to \pi' is greater than or equal to its expectation with respect to \pi. To see this fact, let us consider two cases:

(i) s \neq s*: in this case, \pi' is the same as \pi, and hence the inequality holds (it is more precisely equality in this case).

(ii) s=s*: for this case, the expectation with respect to \pi' is equal to q_\pi(s*,a*) which is by assumption greater than v_\pi(s*), that is, the expectation with respect to \pi.

3. So far, we proved 1st equality and 1st inequality. To go from line 2 to 3, we can basically use Bellman equations - with some considerations.

4. After step 3, we repeat the reasoning of step 2 again and again.

Please let me know if you still think it is not clear or wrong.

Best,

Alireza

CS PhD student in the Laboratory of Computational Neuroscience at EPFL || Personal website: https://sites.google.com/view/modirsha

CS PhD student in the Laboratory of Computational Neuroscience at EPFL || Personal website: https://sites.google.com/view/modirsha