A Practical Guide To Off Policy Evaluation (In Progress)
What is Off Policy Evaluation
Off Policy Evaluation (OPE) is a critical component in both one-shot decision making (bandit) and sequential decision making (reinforcement learning) using purely observational data, i.e. no active experimentation needed. It involves assessing the performance of a given policy using data that was typically generated by a different policy (also known as a logged policy or behavior policy). This is especially important in scenarios where deploying a new policy to gather data is expensive, risky, or impractical.
Consider the scenario of personalized cancer treatment. Suppose researchers propose a new policy for recommending individualized chemotherapy dosages based on each patient’s genetic profile and medical history. Ideally, we would deploy this new policy in clinical practice and compare its effectiveness to the current standardized chemotherapy protocols (Active experimentations/RCTs/AB tests). However, given that the new policy’s outcomes are unknown and potentially inferior to the current protocol, we may want to estimate its effectiveness before real-world implementation.
OPE is a set of techniques for such estimation. It relies on data generated by the existing policy (for example, the current standardized treatment protocols followed by oncologists) to serve as a reference for learning and inference. By using OPE, we can evaluate how well the new personalized treatment policy might perform compared to the existing standard without the need for immediate deployment. This approach helps in making informed decisions about adopting new medical treatments in a cost-effective and safe manner.
Notations
To facilitate the understanding of various OPE methods and concepts discussed later, here are a list of notations used throughout the text:
- $ N $: Number of samples or trajectories.
- $T$: Length of a decision horizon or trajectory (horizon).
- $\pi(a|s)$: Probability of taking action $a$ in state $s$ under the evaluation policy.
- $\mu(a|s)$: Probability of taking action $a$ in state $s$ under the behavior policy.
- $s_i$: State at the $i$-th sample.
- $a_i$: Action taken at the $ i $-th sample.
- $r_i $: Reward observed at the $i$-th sample.
- $\hat{r}(s, a) $: Predicted reward for state $s$ and action $a$ from a model.
- $\gamma$: Discount factor used in sequential decision settings to account for the time value of rewards.
- $\mathbb{E}[r|s,a]$: Expected reward given state $ s $ and action $ a $.
OPE Methods in Bandit Settings
A Toy Example
To illustrate OPE methods in bandit settings, we can consider a simple toy dataset. This dataset can be generated from a bandit problem with a known reward distribution. Note that all the following OPE methods would implicitity assume a causal structure depicted in figure 1.
blablabla
Importance Sampling
Importance Sampling (IS) is a popular method for OPE in bandit settings. It reweights the observed rewards according to the ratio of the probabilities of actions under the evaluation policy and the behavior policy. Formally, the IS estimator is given by:
\[\hat{V}_{\text{IS}} = \frac{1}{N} \sum_{i=1}^N \frac{\pi(a_i | s_i)}{\mu(a_i | s_i)} r_i\]Direct Methods
Direct Methods (DM) involve fitting a model to predict the expected reward for each action and then using this model to estimate the value of the evaluation policy. The DM estimator can be expressed as:
\[\hat{V}_{\text{DM}} = \frac{1}{N} \sum_{i=1}^N \mathbb{E}[r | s_i, a_i]\]Doubly Robust
The Doubly Robust (DR) method combines the strengths of IS and DM by using a model to correct the bias in the IS estimator, making it more robust to model misspecification. The DR estimator is defined as:
\[\hat{V}_{\text{DR}} = \frac{1}{N} \sum_{i=1}^N \left( \frac{\pi(a_i | s_i)}{\mu(a_i | s_i)} (r_i - \hat{r}(s_i, a_i)) + \hat{r}(s_i, a_i) \right)\]OPE Methods in Sequential Decision Settings
Another Toy Dataset
In sequential decision settings, we use a different toy dataset that involves sequences of decisions and rewards. This dataset can be generated from a Markov Decision Process (MDP).
Importance Sampling
In sequential decision settings, IS can be extended to account for the sequential nature of the data. This involves reweighting the entire trajectory of actions and rewards. The IS estimator for sequential settings is:
\[\hat{V}_{\text{IS, seq}} = \frac{1}{N} \sum_{i=1}^N \prod_{t=1}^T \frac{\pi(a_{i,t} | s_{i,t})}{\mu(a_{i,t} | s_{i,t})} r_{i,t}\]Direct Methods
Direct Methods in sequential settings involve learning a model that predicts the expected cumulative reward for each state-action pair, which is then used to estimate the policy’s value. The DM estimator for sequential settings is:
\[\hat{V}_{\text{DM, seq}} = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \gamma^t \mathbb{E}[r_{i,t} | s_{i,t}, a_{i,t}]\]Doubly Robust
The DR method can also be applied in sequential settings, combining IS with a predictive model to correct for bias and improve the estimator’s robustness. The DR estimator for sequential settings is:
\[\hat{V}_{\text{DR, seq}} = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \gamma^t \left( \frac{\pi(a_{i,t} | s_{i,t})}{\mu(a_{i,t} | s_{i,t})} (r_{i,t} - \hat{r}(s_{i,t}, a_{i,t})) + \hat{r}(s_{i,t}, a_{i,t}) \right)\]Some Important Notions of OPE Estimator
Bias
Blablabla TODO: add visuals
Variance
Blablabla
Efficiency Bounds
Blablabla
Why All OPEs Are Practically Biased?
Curse of Horizon
The curse of horizon refers to the exponential growth in the number of possible trajectories with the length of the decision horizon, making it difficult to obtain accurate estimates.
Model-Free Methods Aren’t Really Model-Free
Many so-called model-free methods still rely on implicit modeling assumptions, which can introduce bias into the evaluation.