Control Meets Learning Seminar
A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks. In this talk, we will describe a universal and optimal reduction from contextual bandits to online regression. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the online regression method is optimal. We then turn to the case of iid data and present an adaptive method that attains fast instance-dependent rates, whenever certain disagreement-based notions of problem complexity are bounded. Time permitting, we extend these ideas to the setting of block-MDPs and provide oracle-efficient algorithms for reinforcement learning with rich observations that obtain optimal gap-dependent sample complexity.
Joint work with Dylan Foster, David Simchi-Levi, and Yunzong Xu