st-tech
diff --git a/‎examples/README.md
Lines changed: 2 additions & 2 deletions b/‎examples/README.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/multiclass/README.md
Lines changed: 32 additions & 24 deletions b/‎examples/multiclass/README.md
Lines changed: 32 additions & 24 deletions
diff --git a/‎examples/multiclass/evaluate_off_policy_estimators.py
Lines changed: 17 additions & 16 deletions b/‎examples/multiclass/evaluate_off_policy_estimators.py
Lines changed: 17 additions & 16 deletions
diff --git a/‎examples/obd/README.md
Lines changed: 40 additions & 14 deletions b/‎examples/obd/README.md
Lines changed: 40 additions & 14 deletions
diff --git a/‎examples/obd/evaluate_off_policy_estimators.py
Lines changed: 24 additions & 8 deletions b/‎examples/obd/evaluate_off_policy_estimators.py
Lines changed: 24 additions & 8 deletions
@@ -1,10 +1,10 @@
 # Open Bandit Pipeline Examples
 
-This page contains a list of example codes written with the Open Bandit Pipeline.
+This page contains a list of examples written with Open Bandit Pipeline.
 
 - [`obd/`](./obd/): example implementations for evaluating standard off-policy estimators with the small sample Open Bandit Dataset.
 - [`synthetic/`](./synthetic/): example implementations for evaluating several off-policy estimators with synthetic bandit datasets.
 - [`multiclass/`](./multiclass/): example implementations for evaluating several off-policy estimators with multi-class classification datasets.
 - [`online/`](./online/): example implementations for evaluating Replay Method with online bandit algorithms.
 - [`opl/`](./opl/): example implementations for comparing the performance of several off-policy learners with synthetic bandit datasets.
-- [`quickstart/`](./quickstart/): some quickstart notebooks to guide the usage of the Open Bandit Pipeline.
+- [`quickstart/`](./quickstart/): some quickstart notebooks to guide the usage of Open Bandit Pipeline.
@@ -1,14 +1,14 @@
-# Example with Multi-class Classification Data
+# Example Experiment with Multi-class Classification Data
 
 
 ## Description
 
-Here, we use multi-class classification datasets to evaluate OPE estimators.
-Specifically, we evaluate the estimation performances of well-known off-policy estimators using the ground-truth policy value of an evaluation policy calculable with multi-class classification data.
+We use multi-class classification datasets to evaluate OPE estimators. Specifically, we evaluate the estimation performance of some well-known OPE estimators using the ground-truth policy value of an evaluation policy calculable with multi-class classification data.
 
 ## Evaluating Off-Policy Estimators
 
-In the following, we evaluate the estimation performances of
+In the following, we evaluate the estimation performance of
+
 - Direct Method (DM)
 - Inverse Probability Weighting (IPW)
 - Self-Normalized Inverse Probability Weighting (SNIPW)
@@ -17,12 +17,12 @@ In the following, we evaluate the estimation performances of
 - Switch Doubly Robust (Switch-DR)
 - Doubly Robust with Optimistic Shrinkage (DRos)
 
-For Switch-DR and DRos, we try some different values of hyperparameters.
+For Switch-DR and DRos, we tune the built-in hyperparameters using SLOPE (Su et al., 2020;  Tucker et al., 2021), a data-driven hyperparameter tuning method for OPE estimators.
 See [our documentation](https://zr-obp.readthedocs.io/en/latest/estimators.html) for the details about these estimators.
 
 ### Files
 - [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators using multi-class classification data.
-- [`./conf/hyperparams.yaml`](./conf/hyperparams.yaml) defines hyperparameters of some machine learning methods used to define regression model.
+- [`./conf/hyperparams.yaml`](./conf/hyperparams.yaml) defines hyperparameters of some ML methods used to define regression model.
 
 ### Scripts
 
@@ -50,38 +50,46 @@ python evaluate_off_policy_estimators.py\
 - `$base_model_for_reg_model` specifies the base ML model for defining regression model and should be one of "logistic_regression", "random_forest", or "lightgbm".
 - `$n_jobs` is the maximum number of concurrently running jobs.
 
-For example, the following command compares the estimation performances (relative estimation error; relative-ee) of the OPE estimators using the digits dataset.
+For example, the following command compares the estimation performance (relative estimation error; relative-ee) of the OPE estimators using the digits dataset.
 
 ```bash
 python evaluate_off_policy_estimators.py\
-    --n_runs 20\
+    --n_runs 30\
     --dataset_name digits\
     --eval_size 0.7\
     --base_model_for_behavior_policy logistic_regression\
-    --alpha_b 0.8\
-    --base_model_for_evaluation_policy logistic_regression\
+    --alpha_b 0.4\
+    --base_model_for_evaluation_policy random_forest\
     --alpha_e 0.9\
-    --base_model_for_reg_model logistic_regression\
+    --base_model_for_reg_model lightgbm\
     --n_jobs -1\
     --random_state 12345
 
 # relative-ee of OPE estimators and their standard deviations (lower is better).
-# It appears that the performances of some OPE estimators depend on the choice of their hyperparameters.
 # =============================================
 # random_state=12345
 # ---------------------------------------------
-#                           mean       std
-# dm                    0.093439  0.015391
-# ipw                   0.013286  0.008496
-# snipw                 0.006797  0.004094
-# dr                    0.007780  0.004492
-# sndr                  0.007210  0.004089
-# switch-dr (lambda=1)     0.173282  0.020025
-# switch-dr (lambda=100)   0.007780  0.004492
-# dr-os (lambda=1)      0.079629  0.014008
-# dr-os (lambda=100)    0.008031  0.004634
+#                mean       std
+# dm         0.436541  0.017629
+# ipw        0.030288  0.024506
+# snipw      0.022764  0.017917
+# dr         0.016156  0.012679
+# sndr       0.022082  0.016865
+# switch-dr  0.034657  0.018575
+# dr-os      0.015868  0.012537
 # =============================================
 ```
 
-The above result can change with different situations.
-You can try the evaluation of OPE with other experimental settings easily.
+The above result can change with different situations. You can try the evaluation of OPE with other experimental settings easily.
+
+
+## References
+
+- Yi Su, Pavithra Srinath, Akshay Krishnamurthy. [Adaptive Estimator Selection for Off-Policy Evaluation](https://arxiv.org/abs/2002.07729), ICML2020.
+- Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, Miroslav Dudík. [Doubly Robust Off-policy Evaluation with Shrinkage](https://arxiv.org/abs/1907.09623), ICML2020.
+- George Tucker and Jonathan Lee. [Improved Estimator Selection for Off-Policy Evaluation](https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf), Workshop on Reinforcement Learning
+Theory at ICML2021.
+- Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik. [Optimal and Adaptive Off-policy Evaluation in Contextual Bandits](https://arxiv.org/abs/1612.01205), ICML2017.
+- Miroslav Dudik, John Langford, Lihong Li. [Doubly Robust Policy Evaluation and Learning](https://arxiv.org/abs/1103.4601). ICML2011.
+- Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita. [Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation](https://arxiv.org/abs/2008.07146). NeurIPS2021 Track on Datasets and Benchmarks.
+
@@ -17,13 +17,13 @@
 from obp.dataset import MultiClassToBanditReduction
 from obp.ope import DirectMethod
 from obp.ope import DoublyRobust
-from obp.ope import DoublyRobustWithShrinkage
+from obp.ope import DoublyRobustWithShrinkageTuning
 from obp.ope import InverseProbabilityWeighting
 from obp.ope import OffPolicyEvaluation
 from obp.ope import RegressionModel
 from obp.ope import SelfNormalizedDoublyRobust
 from obp.ope import SelfNormalizedInverseProbabilityWeighting
-from obp.ope import SwitchDoublyRobust
+from obp.ope import SwitchDoublyRobustTuning
 
 
 # hyperparameters of the regression model used in model dependent OPE estimators
@@ -50,10 +50,10 @@
     SelfNormalizedInverseProbabilityWeighting(),
     DoublyRobust(),
     SelfNormalizedDoublyRobust(),
-    SwitchDoublyRobust(lambda_=1.0, estimator_name="switch-dr (lambda=1)"),
-    SwitchDoublyRobust(lambda_=100.0, estimator_name="switch-dr (lambda=100)"),
-    DoublyRobustWithShrinkage(lambda_=1.0, estimator_name="dr-os (lambda=1)"),
-    DoublyRobustWithShrinkage(lambda_=100.0, estimator_name="dr-os (lambda=100)"),
+    SwitchDoublyRobustTuning(lambdas=[10, 50, 100, 500, 1000, 5000, 10000, np.inf]),
+    DoublyRobustWithShrinkageTuning(
+        lambdas=[10, 50, 100, 500, 1000, 5000, 10000, np.inf]
+    ),
 ]
 
 if __name__ == "__main__":
@@ -161,7 +161,7 @@ def process(i: int):
         ground_truth_policy_value = dataset.calc_ground_truth_policy_value(
             action_dist=action_dist
         )
-        # estimate the mean reward function of the evaluation set of multi-class classification data with ML model
+        # estimate the reward function of the evaluation set of multi-class classification data with ML model
         regression_model = RegressionModel(
             n_actions=dataset.n_actions,
             base_model=base_model_dict[base_model_for_reg_model](
@@ -180,34 +180,35 @@ def process(i: int):
             bandit_feedback=bandit_feedback,
             ope_estimators=ope_estimators,
         )
-        relative_ee_i = ope.evaluate_performance_of_estimators(
+        metric_i = ope.evaluate_performance_of_estimators(
             ground_truth_policy_value=ground_truth_policy_value,
             action_dist=action_dist,
             estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
+            metric="relative-ee",
         )
 
-        return relative_ee_i
+        return metric_i
 
     processed = Parallel(
         n_jobs=n_jobs,
         verbose=50,
     )([delayed(process)(i) for i in np.arange(n_runs)])
-    relative_ee_dict = {est.estimator_name: dict() for est in ope_estimators}
-    for i, relative_ee_i in enumerate(processed):
+    metric_dict = {est.estimator_name: dict() for est in ope_estimators}
+    for i, metric_i in enumerate(processed):
         for (
             estimator_name,
             relative_ee_,
-        ) in relative_ee_i.items():
-            relative_ee_dict[estimator_name][i] = relative_ee_
-    relative_ee_df = DataFrame(relative_ee_dict).describe().T.round(6)
+        ) in metric_i.items():
+            metric_dict[estimator_name][i] = relative_ee_
+    result_df = DataFrame(metric_dict).describe().T.round(6)
 
     print("=" * 45)
     print(f"random_state={random_state}")
     print("-" * 45)
-    print(relative_ee_df[["mean", "std"]])
+    print(result_df[["mean", "std"]])
     print("=" * 45)
 
     # save results of the evaluation of off-policy estimators in './logs' directory.
     log_path = Path(f"./logs/{dataset_name}")
     log_path.mkdir(exist_ok=True, parents=True)
-    relative_ee_df.to_csv(log_path / "relative_ee_of_ope_estimators.csv")
+    result_df.to_csv(log_path / "evaluation_of_ope_results.csv")
@@ -1,16 +1,27 @@
-# Example with the Open Bandit Dataset (OBD)
+# Example Experiment with Open Bandit Dataset
 
 ## Description
 
-Here, we use the open bandit dataset and pipeline to implement and evaluate OPE. Specifically, we evaluate the estimation performances of well-known off-policy estimators using the ground-truth policy value of an evaluation policy, which is calculable with our data using on-policy estimation.
+We use Open Bandit Dataset to implement the evaluation of OPE. Specifically, we evaluate the estimation performance of some well-known OPE estimators using the on-policy policy value of an evaluation policy, which is calculable with the dataset.
 
 ## Evaluating Off-Policy Estimators
 
-We evaluate the estimation performances of off-policy estimators, including Direct Method (DM), Inverse Probability Weighting (IPW), and Doubly Robust (DR).
+In the following, we evaluate the estimation performance of
+
+- Direct Method (DM)
+- Inverse Probability Weighting (IPW)
+- Self-Normalized Inverse Probability Weighting (SNIPW)
+- Doubly Robust (DR)
+- Self-Normalized Doubly Robust (SNDR)
+- Switch Doubly Robust (Switch-DR)
+- Doubly Robust with Optimistic Shrinkage (DRos)
+
+For Switch-DR and DRos, we tune the built-in hyperparameters using SLOPE, a data-driven hyperparameter tuning method for OPE estimators.
+See [our documentation](https://zr-obp.readthedocs.io/en/latest/estimators.html) for the details about these estimators.
 
 ### Files
-- [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators.
-- [`.conf/hyperparams.yaml`](./conf/hyperparams.yaml) defines hyperparameters of some machine learning models used as the regression model in model dependent estimators (such as DM and DR).
+- [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators using Open Bandit Dataset.
+- [`.conf/hyperparams.yaml`](./conf/hyperparams.yaml) defines hyperparameters of some ML models used as the regression model in model dependent estimators (such as DM and DR).
 
 ### Scripts
 
@@ -34,28 +45,43 @@ They should be either 'bts' or 'random'.
 - `$n_sim_to_compute_action_dist` is the number of monte carlo simulation to compute the action distribution of a given evaluation policy.
 - `$n_jobs` is the maximum number of concurrently running jobs.
 
-For example, the following command compares the estimation performances of the three OPE estimators by using Bernoulli TS as evaluation policy and Random as behavior policy in "All" campaign.
+For example, the following command compares the estimation performance of the three OPE estimators by using Bernoulli TS as evaluation policy and Random as behavior policy in "All" campaign.
 
 ```bash
 python evaluate_off_policy_estimators.py\
-    --n_runs 20\
+    --n_runs 30\
     --base_model logistic_regression\
     --evaluation_policy bts\
     --behavior_policy random\
     --campaign all\
     --n_jobs -1
 
 # relative estimation errors of OPE estimators and their standard deviations.
-# our evaluation of OPE procedure suggests that DM performs best among the three OPE estimators, because it has low variance property.
-# (Note that this result is with the small sample data, and please use the full size data for a more reasonable experiment)
 # ==============================
 # random_state=12345
 # ------------------------------
-#          mean       std
-# dm   0.180269  0.114716
-# ipw  0.333113  0.350425
-# dr   0.304422  0.347866
+#                mean       std
+# dm         0.156876  0.109898
+# ipw        0.311082  0.311170
+# snipw      0.311795  0.334736
+# dr         0.292464  0.315485
+# sndr       0.302407  0.328434
+# switch-dr  0.258410  0.160598
+# dr-os      0.159520  0.109660
 # ==============================
 ```
 
-Please refer to [this page](https://zr-obp.readthedocs.io/en/latest/evaluation_ope.html) for the evaluation of OPE protocol using our real-world data. Please visit [synthetic](../synthetic/) to try the evaluation of OPE estimators with synthetic bandit datasets. Moreover, in [benchmark/ope](https://github.com/st-tech/zr-obp/tree/master/benchmark/ope), we performed the benchmark experiments on several OPE estimators using the full size Open Bandit Dataset.
+Please refer to [this page](https://zr-obp.readthedocs.io/en/latest/evaluation_ope.html) for the evaluation of OPE protocol using our real-world data. Please visit [synthetic](../synthetic/) to try the evaluation of OPE estimators with synthetic bandit data. Moreover, in [benchmark/ope](https://github.com/st-tech/zr-obp/tree/master/benchmark/ope), we performed the benchmark experiments on several OPE estimators using the full size Open Bandit Dataset.
+
+
+
+## References
+
+- Yi Su, Pavithra Srinath, Akshay Krishnamurthy. [Adaptive Estimator Selection for Off-Policy Evaluation](https://arxiv.org/abs/2002.07729), ICML2020.
+- Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, Miroslav Dudík. [Doubly Robust Off-policy Evaluation with Shrinkage](https://arxiv.org/abs/1907.09623), ICML2020.
+- George Tucker and Jonathan Lee. [Improved Estimator Selection for Off-Policy Evaluation](https://lyang36.github.io/icml2021_rltheory/camera_ready/79.pdf), Workshop on Reinforcement Learning
+Theory at ICML2021.
+- Yu-Xiang Wang, Alekh Agarwal, Miroslav Dudik. [Optimal and Adaptive Off-policy Evaluation in Contextual Bandits](https://arxiv.org/abs/1612.01205), ICML2017.
+- Miroslav Dudik, John Langford, Lihong Li. [Doubly Robust Policy Evaluation and Learning](https://arxiv.org/abs/1103.4601). ICML2011.
+- Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita. [Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation](https://arxiv.org/abs/2008.07146). NeurIPS2021 Track on Datasets and Benchmarks.
+
@@ -13,9 +13,13 @@
 from obp.dataset import OpenBanditDataset
 from obp.ope import DirectMethod
 from obp.ope import DoublyRobust
+from obp.ope import DoublyRobustWithShrinkageTuning
 from obp.ope import InverseProbabilityWeighting
 from obp.ope import OffPolicyEvaluation
 from obp.ope import RegressionModel
+from obp.ope import SelfNormalizedDoublyRobust
+from obp.ope import SelfNormalizedInverseProbabilityWeighting
+from obp.ope import SwitchDoublyRobustTuning
 from obp.policy import BernoulliTS
 from obp.policy import Random
 
@@ -32,8 +36,19 @@
     random_forest=RandomForestClassifier,
 )
 
-# OPE estimators compared
-ope_estimators = [DirectMethod(), InverseProbabilityWeighting(), DoublyRobust()]
+# compared OPE estimators
+ope_estimators = [
+    DirectMethod(),
+    InverseProbabilityWeighting(),
+    SelfNormalizedInverseProbabilityWeighting(),
+    DoublyRobust(),
+    SelfNormalizedDoublyRobust(),
+    SwitchDoublyRobustTuning(lambdas=[10, 50, 100, 500, 1000, 5000, 10000, np.inf]),
+    DoublyRobustWithShrinkageTuning(
+        lambdas=[10, 50, 100, 500, 1000, 5000, 10000, np.inf]
+    ),
+]
+
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="evaluate off-policy estimators.")
@@ -123,7 +138,7 @@
     def process(b: int):
         # sample bootstrap from batch logged bandit feedback
         bandit_feedback = obd.sample_bootstrap_bandit_feedback(random_state=b)
-        # estimate the mean reward function with an ML model
+        # estimate the reward function with an ML model
         regression_model = RegressionModel(
             n_actions=obd.n_actions,
             len_list=obd.len_list,
@@ -151,6 +166,7 @@ def process(b: int):
             ground_truth_policy_value=ground_truth_policy_value,
             action_dist=action_dist,
             estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
+            metric="relative-ee",
         )
 
         return relative_ee_b
@@ -159,22 +175,22 @@ def process(b: int):
         n_jobs=n_jobs,
         verbose=50,
     )([delayed(process)(i) for i in np.arange(n_runs)])
-    relative_ee_dict = {est.estimator_name: dict() for est in ope_estimators}
+    metric_dict = {est.estimator_name: dict() for est in ope_estimators}
     for b, relative_ee_b in enumerate(processed):
         for (
             estimator_name,
             relative_ee_,
         ) in relative_ee_b.items():
-            relative_ee_dict[estimator_name][b] = relative_ee_
-    relative_ee_df = DataFrame(relative_ee_dict).describe().T.round(6)
+            metric_dict[estimator_name][b] = relative_ee_
+    results_df = DataFrame(metric_dict).describe().T.round(6)
 
     print("=" * 30)
     print(f"random_state={random_state}")
     print("-" * 30)
-    print(relative_ee_df[["mean", "std"]])
+    print(results_df[["mean", "std"]])
     print("=" * 30)
 
     # save results of the evaluation of off-policy estimators in './logs' directory.
     log_path = Path("./logs") / behavior_policy / campaign
     log_path.mkdir(exist_ok=True, parents=True)
-    relative_ee_df.to_csv(log_path / "relative_ee_of_ope_estimators.csv")
+    results_df.to_csv(log_path / "evaluation_of_ope_results.csv")