You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-19Lines changed: 23 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -50,23 +50,22 @@ The following figure presents examples of displayed fashion items as actions.
50
50
</figcaption>
51
51
</p>
52
52
53
-
We collected the data in a 7-day experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively.
54
-
Each campaign randomly used either the Uniform Random algorithm or the Bernoulli Thompson Sampling (Bernoulli TS) algorithm, which was pre-trained for about a month before the data collection period.
53
+
We collected the data in a 7-days experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively.
54
+
Each campaign randomly used either the Uniform Random policy or the Bernoulli Thompson Sampling (Bernoulli TS) policy, which was pre-trained for about a month before the data collection period.
The small size version of our data is available at [./obd](https://github.com/st-tech/zr-obp/tree/master/obd).
61
-
This can be used for running [examples](https://github.com/st-tech/zr-obp/tree/master/examples).
61
+
This can be used for running some [examples](https://github.com/st-tech/zr-obp/tree/master/examples).
62
62
We release the full size version of our data at [https://research.zozo.com/data.html](https://research.zozo.com/data.html).
63
63
Please download the full size version for research uses.
64
64
Please see [./obd/README.md](https://github.com/st-tech/zr-obp/blob/master/obd/README.md) for the description of the dataset.
65
65
66
66
## Open Bandit Pipeline (OBP)
67
67
68
-
69
-
*Open Bandit Pipeline* is a series of implementations of dataset preprocessing, OPE estimators, and the evaluation of OPE estimators.
68
+
*Open Bandit Pipeline* is a series of implementations of dataset preprocessing, policy learning methods, OPE estimators, and the evaluation of OPE protocols.
70
69
This pipeline allows researchers to focus on building their own OPE estimator and easily compare it with others’ methods in realistic and reproducible ways.
71
70
Thus, it facilitates reproducible research on bandit algorithms and off-policy evaluation.
72
71
@@ -82,7 +81,7 @@ Thus, it facilitates reproducible research on bandit algorithms and off-policy e
82
81
Open Bandit Pipeline consists of the following main modules.
83
82
84
83
-**dataset module**: This module provides a data loader for Open Bandit Dataset and a flexible interface for handling logged bandit feedback. It also provides tools to generate synthetic bandit datasets.
85
-
-**policy module**: This module provides interfaces for online and offline bandit algorithms. It also implements several standard policy learning methods.
84
+
-**policy module**: This module provides interfaces for training online and offline bandit policies. It also implements several standard policy learning methods.
86
85
-**simulator module**: This module provides functions for conducting offline bandit simulation.
87
86
-**ope module**: This module provides interfaces for OPE estimators. It also implements several standard and advanced OPE estimators.
88
87
@@ -131,6 +130,8 @@ Currently, Open Bandit Dataset & Pipeline facilitate evaluation and comparison r
131
130
132
131
-**Off-Policy Evaluation**: We present implementations of behavior policies used when collecting datasets as a part of our pipeline. Our open data also contains logged bandit feedback data generated by *multiple* different bandit policies. Therefore, it enables the evaluation of off-policy evaluation with ground-truth for the performance of evaluation policies.
133
132
133
+
Please refer to to our [documentation](https://zr-obp.readthedocs.io/en/latest/ope.html) for the basic formulation of OPE.
134
+
134
135
135
136
# Installation
136
137
@@ -162,7 +163,7 @@ python setup.py install
162
163
163
164
# Usage
164
165
165
-
We show an example of conducting offline evaluation of the performance of Bernoulli Thompson Sampling (BernoulliTS) as an evaluation policy using the *Inverse Probability Weighting (IPW)* and logged bandit feedback generated by the Random policy (behavior policy).
166
+
We show an example of conducting offline evaluation of the performance of BernoulliTS as an evaluation policy using Inverse Probability Weighting (IPW) and logged bandit feedback generated by the Random policy (behavior policy).
166
167
We see that only ten lines of code are sufficient to complete OPE from scratch.
167
168
168
169
```python
@@ -206,17 +207,17 @@ Below, we explain some important features in the example.
206
207
We prepare an easy-to-use data loader for Open Bandit Dataset.
207
208
208
209
```python
209
-
# load and preprocess raw data in "ALL" campaign collected by the Random policy
210
+
# load and preprocess raw data in "All" campaign collected by the Random policy
Users can implement their own feature engineering in the `pre_process` method of `obp.dataset.OpenBanditDataset` class.
219
-
We show an example of implementing some new feature engineering processes in [`./examples/examples_with_obd/custom_dataset.py`](https://github.com/st-tech/zr-obp/blob/master/benchmark/cf_policy_search/custom_dataset.py).
220
+
We show an example of implementing some new feature engineering processes in [`custom_dataset.py`](https://github.com/st-tech/zr-obp/blob/master/benchmark/cf_policy_search/custom_dataset.py).
220
221
221
222
Moreover, by following the interface of `obp.dataset.BaseBanditDataset` class, one can handle future open datasets for bandit algorithms other than our Open Bandit Dataset.
222
223
`dataset` module also provide a class to generate synthetic bandit datasets.
When `is_zozotown_prior=False`, non-informative prior distribution is used.
248
-
The `compute_batch_action_dist` method of `BernoulliTS` computes the action choice probabilities based on given hyperparameters of the beta distribution. `action_dist` is an array representing the distribution over actions made by the evaluation policy.
248
+
The `compute_batch_action_dist` method of `BernoulliTS` computes the action choice probabilities based on given hyperparameters of the beta distribution.
249
+
When `is_zozotown_prior=True`, hyperparameters used during the data collection process on the ZOZOTOWN platform are set.
250
+
Otherwise, non-informative prior hyperparameters are used.
251
+
`action_dist` is an array representing the action choice probabilities made by the evaluation policy.
249
252
250
253
Users can implement their own bandit algorithms by following the interfaces implemented in [`./obp/policy/base.py`](https://github.com/st-tech/zr-obp/blob/master/obp/policy/base.py).
251
254
@@ -255,21 +258,22 @@ Our final step is **off-policy evaluation** (OPE), which attempts to estimate th
255
258
Our pipeline also provides an easy procedure for doing OPE as follows.
256
259
257
260
```python
258
-
# estimate the policy value of BernoulliTS based on the distribution over actions by that policy
261
+
# estimate the policy value of BernoulliTS based on its action choice probabilities
259
262
# it is possible to set multiple OPE estimators to the `ope_estimators` argument
# our OPE procedure suggests that BernoulliTS improves Random by 19.81%
269
-
print(relative_policy_value_of_bernoulli_ts)
272
+
print(policy_value_improvement)
270
273
1.198126...
271
274
```
272
-
Users can implement their own OPE estimator by following the interface of `obp.ope.BaseOffPolicyEstimator` class. `obp.ope.OffPolicyEvaluation` class summarizes and compares the estimated policy values by several off-policy estimators.
275
+
Users can implement their own OPE estimator by following the interface of `obp.ope.BaseOffPolicyEstimator` class.
276
+
`obp.ope.OffPolicyEvaluation` class summarizes and compares the policy values estimated by several different estimators.
273
277
A detailed usage of this class can be found at [quickstart](https://github.com/st-tech/zr-obp/tree/master/examples/quickstart). `bandit_feedback['reward'].mean()` is the empirical mean of factual rewards (on-policy estimate of the policy value) in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.).
Copy file name to clipboardExpand all lines: benchmark/ope/README.md
+3-10Lines changed: 3 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -11,8 +11,7 @@ Please download the full [open bandit dataset](https://research.zozo.com/data.ht
11
11
Model-dependent estimators such as DM and DR need a pre-trained regression model.
12
12
Here, we train a regression model with some machine learning methods.
13
13
14
-
We define hyperparameters for the machine learning methods in [`conf/hyperparams.yaml`](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/conf/hyperparams.yaml).
15
-
[train_regression_model.py](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/train_regression_model.py) implements the training process of the regression model.
14
+
[train_regression_model.py](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/train_regression_model.py) implements the training process of the regression model. ([`conf/hyperparams.yaml`](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/conf/hyperparams.yaml) defines hyperparameters for the machine learning methods.)
16
15
17
16
```
18
17
python train_regression_model.py\
@@ -34,8 +33,8 @@ where
34
33
-`$campaign` specifies the campaign considered in ZOZOTOWN and should be one of "all", "men", or "women".
35
34
-`$n_sim_to_compute_action_dist` is the number of monte carlo simulation to compute the action choice probabilities by a given evaluation policy.
36
35
-`$is_timeseries_split` is whether the data is split based on timestamp or not. If true, the out-sample performance of OPE is tested. See the relevant paper for details.
37
-
--`$test_size` specifies the proportion of the dataset to include in the test split when `$is_timeseries_split=True`.
38
-
-`$is_mrdr` is whether the regression model is trained by the more robust doubly robust way or not. See the relevant paper for details.
36
+
-`$test_size` specifies the proportion of the dataset to include in the test split when `$is_timeseries_split=True`.
37
+
-`$is_mrdr` is whether the regression model is trained by the more robust doubly robust way. See the relevant paper for details.
39
38
-`$n_jobs` is the maximum number of concurrently running jobs.
40
39
41
40
For example, the following command trains the regression model based on logistic regression on the logged bandit feedback data collected by the Random policy (as a behavior policy) in "All" campaign.
@@ -158,9 +157,3 @@ do
158
157
done
159
158
```
160
159
-->
161
-
162
-
<!-- ## Results
163
-
164
-
We report the results of the benchmark experiments on the three campaigns (all, men, women) in the following tables.
165
-
We describe **Random -> Bernoulli TS** to represent the OPE situation where we use Bernoulli TS as a hypothetical evaluation policy and Random as a hypothetical behavior policy.
166
-
In contrast, we use **Bernoulli TS -> Random** to represent the situation where we use Random as a hypothetical evaluation policy and Bernoulli TS as a hypothetical behavior policy. -->
0 commit comments