A Practical Guide To Off Policy Evaluation (In Progress)

What is Off Policy Evaluation

Off Policy Evaluation (OPE) is a critical component in both one-shot decision making (bandit) and sequential decision making (reinforcement learning) using purely observational data, i.e. no active experimentation needed. It involves assessing the performance of a given policy using data that was typically generated by a different policy (also known as a logged policy or behavior policy). This is especially important in scenarios where deploying a new policy to gather data is expensive, risky, or impractical.

Consider the scenario of personalized cancer treatment. Suppose researchers propose a new policy for recommending individualized chemotherapy dosages based on each patient’s genetic profile and medical history. Ideally, we would deploy this new policy in clinical practice and compare its effectiveness to the current standardized chemotherapy protocols (Active experimentations/RCTs/AB tests). However, given that the new policy’s outcomes are unknown and potentially inferior to the current protocol, we may want to estimate its effectiveness before real-world implementation.

OPE is a set of techniques for such estimation. It relies on data generated by the existing policy (for example, the current standardized treatment protocols followed by oncologists) to serve as a reference for learning and inference. By using OPE, we can evaluate how well the new personalized treatment policy might perform compared to the existing standard without the need for immediate deployment. This approach helps in making informed decisions about adopting new medical treatments in a cost-effective and safe manner.

Notations

To facilitate the understanding of various OPE methods and concepts discussed later, here are a list of notations used throughout the text:

$ N $: Number of samples or trajectories.
$T$: Length of a decision horizon or trajectory (horizon).
$\pi(a|s)$: Probability of taking action $a$ in state $s$ under the evaluation policy.
$\mu(a|s)$: Probability of taking action $a$ in state $s$ under the behavior policy.
$s_i$: State at the $i$-th sample.
$a_i$: Action taken at the $ i $-th sample.
$r_i $: Reward observed at the $i$-th sample.
$\hat{r}(s, a) $: Predicted reward for state $s$ and action $a$ from a model.
$\gamma$: Discount factor used in sequential decision settings to account for the time value of rewards.
$\mathbb{E}[r|s,a]$: Expected reward given state $ s $ and action $ a $.

OPE Methods in Bandit Settings

A Toy Example

To illustrate OPE methods in bandit settings, we can consider a simple toy dataset. This dataset can be generated from a bandit problem with a known reward distribution. Note that all the following OPE methods would implicitity assume a causal structure depicted in figure 1.

blablabla

Importance Sampling

Importance Sampling (IS) is a popular method for OPE in bandit settings. It reweights the observed rewards according to the ratio of the probabilities of actions under the evaluation policy and the behavior policy. Formally, the IS estimator is given by:

\[\hat{V}_{\text{IS}} = \frac{1}{N} \sum_{i=1}^N \frac{\pi(a_i | s_i)}{\mu(a_i | s_i)} r_i\]

Direct Methods

Direct Methods (DM) involve fitting a model to predict the expected reward for each action and then using this model to estimate the value of the evaluation policy. The DM estimator can be expressed as:

\[\hat{V}_{\text{DM}} = \frac{1}{N} \sum_{i=1}^N \mathbb{E}[r | s_i, a_i]\]

Doubly Robust

The Doubly Robust (DR) method combines the strengths of IS and DM by using a model to correct the bias in the IS estimator, making it more robust to model misspecification. The DR estimator is defined as:

\[\hat{V}_{\text{DR}} = \frac{1}{N} \sum_{i=1}^N \left( \frac{\pi(a_i | s_i)}{\mu(a_i | s_i)} (r_i - \hat{r}(s_i, a_i)) + \hat{r}(s_i, a_i) \right)\]

OPE Methods in Sequential Decision Settings

Another Toy Dataset

In sequential decision settings, we use a different toy dataset that involves sequences of decisions and rewards. This dataset can be generated from a Markov Decision Process (MDP).

Importance Sampling

In sequential decision settings, IS can be extended to account for the sequential nature of the data. This involves reweighting the entire trajectory of actions and rewards. The IS estimator for sequential settings is:

\[\hat{V}_{\text{IS, seq}} = \frac{1}{N} \sum_{i=1}^N \prod_{t=1}^T \frac{\pi(a_{i,t} | s_{i,t})}{\mu(a_{i,t} | s_{i,t})} r_{i,t}\]

Direct Methods

Direct Methods in sequential settings involve learning a model that predicts the expected cumulative reward for each state-action pair, which is then used to estimate the policy’s value. The DM estimator for sequential settings is:

\[\hat{V}_{\text{DM, seq}} = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \gamma^t \mathbb{E}[r_{i,t} | s_{i,t}, a_{i,t}]\]

Doubly Robust

The DR method can also be applied in sequential settings, combining IS with a predictive model to correct for bias and improve the estimator’s robustness. The DR estimator for sequential settings is:

\[\hat{V}_{\text{DR, seq}} = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \gamma^t \left( \frac{\pi(a_{i,t} | s_{i,t})}{\mu(a_{i,t} | s_{i,t})} (r_{i,t} - \hat{r}(s_{i,t}, a_{i,t})) + \hat{r}(s_{i,t}, a_{i,t}) \right)\]

Some Important Notions of OPE Estimator

Bias

Blablabla TODO: add visuals

Variance

Blablabla

Efficiency Bounds

Blablabla

Why All OPEs Are Practically Biased?

Curse of Horizon

The curse of horizon refers to the exponential growth in the number of possible trajectories with the length of the decision horizon, making it difficult to obtain accurate estimates.

Model-Free Methods Aren’t Really Model-Free

Many so-called model-free methods still rely on implicit modeling assumptions, which can introduce bias into the evaluation.