Thanks a lot for your feedback.
For this particular proof, my main resources were the PhD thesis of a friend (Johanni Brea) and the appendix of “Algorithms for Reinforcement Learning” by Csaba Szepesvári.
But in general, for resources concerning the rigorous mathematical treatment of RL, you can look at "Markov Decision Processes: Discrete Stochastic Dynamic Programming" by Martin Putterman or one of the many books of Dimitri Bertsekas on the topic. Michael Littman has also quite a few amazing works (including his PhD thesis) if you are also interested in POMDPs.