IST Lunch Bunch
The talk considers the problem of offline policy learning for automated decision systems under the contextual bandits model, where we aim at evaluating the performance of a given policy (a decision algorithm) and also learning a better policy using logged historical data consisting of context, actions, rewards and probabilities of the actions taken. This is a generalization of the Average Treatment Effect (ATE) estimation problem and has some interesting new set of desiderata to consider.
In the first part of the talk, I will compare and contrast off-policy evaluation and ATE estimation and clarify how different assumptions change the corresponding minimax risk in estimating the ``causal effect''. In addition, I will talk about how one can achieve significantly better finite sample performance than asymptotically optimal estimators through the SWITCH estimator.
In the second part of the talk, I will talk about off-policy learning in the real world. I will highlight some of the real world challenges include: missing logging probability, confounding variables (Simpson's paradox) and model misspecification. We will demonstrate that a commonly-used naive approach of direct cross-entropy minimization is implicitly optimizing a causal objective without requiring us to know the probabilities of taking actions. Then we propose policy imitation, which can be used as a regularization and as a test of whether there are confounders or model-misspecification.