Offline Policy Evaluation with New Arms


We study offline policy evaluation in a setting where the target policy can take actions that were not available when the data was logged. We analyze the bias of two popular regression-based estimators in this setting, and upper-bound their biases by a quantity we refer to as the reward regression risk. We show that the estimators can be asymptotically unbiased and uniformly convergent if the reward regression risk asymptotically goes to zero. We then upper-bound the reward regression risk using tools from domain adaptation. This analysis motivates using domain adaptation algorithms to train reward predictors for offline policy evaluation. It also suggests future directions for developing improved offline policy optimization algorithms.

NIPS Workshop on Offline Reinforcement Learning