Linde Institute/Social and Information Sciences Laboratory (SISL) Seminar
In this talk, I will talk about a natural framework for simulation-based optimization and control of Markov decision process (MDP) models. The idea is very simple: Replace the Bellman operator by its `empirical' variant wherein Expectation is replaced by a sample average approximation. This leads to a random Bellman operator in the dynamic programming equations. We introduce several notions of probabilistic fixed points of such random operators, and show their asymptotic equivalence. We establish convergence of empirical Value and Policy Iteration algorithms by a stochastic dominance argument. The mathematical technique introduced is useful for analyzing other iterated random operators (than just the empirical Bellman operator), and may also be useful in random matrix theory. The idea can be generalized to asynchronous dynamic programming, and is also useful for computing equilibria of zero-sum stochastic games. Preliminary numerical results show better convergence rate than stochastic approximation/reinforcement learning schemes. If time permits, I will also briefly talk about Blackwell's approachability for MDPs and Stochastic games. In particular, a learning scheme for approachability in MDPs and games.
This is joint work with Dileep Kalathil, William Haskell and Vivek Borkar.