Hi Rishabh,

Thanks for your comment.

I am not sure if I understood what you mean. But we do not need in this proof to talk about immediate rewards explicitly. I would be happy to hear your comments with more details, but it may be helpful if I try to make the proof more clear:

1. The 1st line is basically the use of Eq. 1.

2. For the 2nd line, we claim that the expectation of q_\pi with respect to \pi' is greater than or equal to its expectation with respect to \pi. To see this fact, let us consider two cases:

(i) s \neq s*: in this case, \pi' is the same as \pi, and hence the inequality holds (it is more precisely equality in this case).

(ii) s=s*: for this case, the expectation with respect to \pi' is equal to q_\pi(s*,a*) which is by assumption greater than v_\pi(s*), that is, the expectation with respect to \pi.

3. So far, we proved 1st equality and 1st inequality. To go from line 2 to 3, we can basically use Bellman equations - with some considerations.

4. After step 3, we repeat the reasoning of step 2 again and again.

Please let me know if you still think it is not clear or wrong.

Best,

Alireza

--

CS PhD student in the Laboratory of Computational Neuroscience at EPFL || Personal website: https://sites.google.com/view/modirsha

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alireza Modirshanechi

Alireza Modirshanechi

CS PhD student in the Laboratory of Computational Neuroscience at EPFL || Personal website: https://sites.google.com/view/modirsha