Skip to content

Commit 01823f7

Browse files
authored
Merge pull request #26 from st-tech/feat/update-version-to-0.3.3
Update: version 0.3.3
2 parents fc7d0c2 + e185a4d commit 01823f7

File tree

4 files changed

+30
-33
lines changed

4 files changed

+30
-33
lines changed

README.md

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -50,23 +50,22 @@ The following figure presents examples of displayed fashion items as actions.
5050
</figcaption>
5151
</p>
5252

53-
We collected the data in a 7-day experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively.
54-
Each campaign randomly used either the Uniform Random algorithm or the Bernoulli Thompson Sampling (Bernoulli TS) algorithm, which was pre-trained for about a month before the data collection period.
53+
We collected the data in a 7-days experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively.
54+
Each campaign randomly used either the Uniform Random policy or the Bernoulli Thompson Sampling (Bernoulli TS) policy, which was pre-trained for about a month before the data collection period.
5555

5656
<p align="center">
5757
<img width="70%" src="./images/statistics_of_obd.png" />
5858
</p>
5959

6060
The small size version of our data is available at [./obd](https://github.com/st-tech/zr-obp/tree/master/obd).
61-
This can be used for running [examples](https://github.com/st-tech/zr-obp/tree/master/examples).
61+
This can be used for running some [examples](https://github.com/st-tech/zr-obp/tree/master/examples).
6262
We release the full size version of our data at [https://research.zozo.com/data.html](https://research.zozo.com/data.html).
6363
Please download the full size version for research uses.
6464
Please see [./obd/README.md](https://github.com/st-tech/zr-obp/blob/master/obd/README.md) for the description of the dataset.
6565

6666
## Open Bandit Pipeline (OBP)
6767

68-
69-
*Open Bandit Pipeline* is a series of implementations of dataset preprocessing, OPE estimators, and the evaluation of OPE estimators.
68+
*Open Bandit Pipeline* is a series of implementations of dataset preprocessing, policy learning methods, OPE estimators, and the evaluation of OPE protocols.
7069
This pipeline allows researchers to focus on building their own OPE estimator and easily compare it with others’ methods in realistic and reproducible ways.
7170
Thus, it facilitates reproducible research on bandit algorithms and off-policy evaluation.
7271

@@ -82,7 +81,7 @@ Thus, it facilitates reproducible research on bandit algorithms and off-policy e
8281
Open Bandit Pipeline consists of the following main modules.
8382

8483
- **dataset module**: This module provides a data loader for Open Bandit Dataset and a flexible interface for handling logged bandit feedback. It also provides tools to generate synthetic bandit datasets.
85-
- **policy module**: This module provides interfaces for online and offline bandit algorithms. It also implements several standard policy learning methods.
84+
- **policy module**: This module provides interfaces for training online and offline bandit policies. It also implements several standard policy learning methods.
8685
- **simulator module**: This module provides functions for conducting offline bandit simulation.
8786
- **ope module**: This module provides interfaces for OPE estimators. It also implements several standard and advanced OPE estimators.
8887

@@ -131,6 +130,8 @@ Currently, Open Bandit Dataset & Pipeline facilitate evaluation and comparison r
131130

132131
- **Off-Policy Evaluation**: We present implementations of behavior policies used when collecting datasets as a part of our pipeline. Our open data also contains logged bandit feedback data generated by *multiple* different bandit policies. Therefore, it enables the evaluation of off-policy evaluation with ground-truth for the performance of evaluation policies.
133132

133+
Please refer to to our [documentation](https://zr-obp.readthedocs.io/en/latest/ope.html) for the basic formulation of OPE.
134+
134135

135136
# Installation
136137

@@ -162,7 +163,7 @@ python setup.py install
162163

163164
# Usage
164165

165-
We show an example of conducting offline evaluation of the performance of Bernoulli Thompson Sampling (BernoulliTS) as an evaluation policy using the *Inverse Probability Weighting (IPW)* and logged bandit feedback generated by the Random policy (behavior policy).
166+
We show an example of conducting offline evaluation of the performance of BernoulliTS as an evaluation policy using Inverse Probability Weighting (IPW) and logged bandit feedback generated by the Random policy (behavior policy).
166167
We see that only ten lines of code are sufficient to complete OPE from scratch.
167168

168169
```python
@@ -206,17 +207,17 @@ Below, we explain some important features in the example.
206207
We prepare an easy-to-use data loader for Open Bandit Dataset.
207208

208209
```python
209-
# load and preprocess raw data in "ALL" campaign collected by the Random policy
210+
# load and preprocess raw data in "All" campaign collected by the Random policy
210211
dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
211-
# obtain logged bandit feedback generated by the behavior policy
212+
# obtain logged bandit feedback
212213
bandit_feedback = dataset.obtain_batch_bandit_feedback()
213214

214215
print(bandit_feedback.keys())
215216
dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context'])
216217
```
217218

218219
Users can implement their own feature engineering in the `pre_process` method of `obp.dataset.OpenBanditDataset` class.
219-
We show an example of implementing some new feature engineering processes in [`./examples/examples_with_obd/custom_dataset.py`](https://github.com/st-tech/zr-obp/blob/master/benchmark/cf_policy_search/custom_dataset.py).
220+
We show an example of implementing some new feature engineering processes in [`custom_dataset.py`](https://github.com/st-tech/zr-obp/blob/master/benchmark/cf_policy_search/custom_dataset.py).
220221

221222
Moreover, by following the interface of `obp.dataset.BaseBanditDataset` class, one can handle future open datasets for bandit algorithms other than our Open Bandit Dataset.
222223
`dataset` module also provide a class to generate synthetic bandit datasets.
@@ -236,16 +237,18 @@ evaluation_policy = BernoulliTS(
236237
campaign="all",
237238
random_state=12345
238239
)
239-
# compute the distribution over actions by the evaluation policy using Monte Carlo simulation
240+
# compute the action choice probabilities by the evaluation policy using Monte Carlo simulation
240241
# action_dist is an array of shape (n_rounds, n_actions, len_list)
241242
# representing the distribution over actions made by the evaluation policy
242243
action_dist = evaluation_policy.compute_batch_action_dist(
243244
n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
244245
)
245246
```
246247

247-
When `is_zozotown_prior=False`, non-informative prior distribution is used.
248-
The `compute_batch_action_dist` method of `BernoulliTS` computes the action choice probabilities based on given hyperparameters of the beta distribution. `action_dist` is an array representing the distribution over actions made by the evaluation policy.
248+
The `compute_batch_action_dist` method of `BernoulliTS` computes the action choice probabilities based on given hyperparameters of the beta distribution.
249+
When `is_zozotown_prior=True`, hyperparameters used during the data collection process on the ZOZOTOWN platform are set.
250+
Otherwise, non-informative prior hyperparameters are used.
251+
`action_dist` is an array representing the action choice probabilities made by the evaluation policy.
249252

250253
Users can implement their own bandit algorithms by following the interfaces implemented in [`./obp/policy/base.py`](https://github.com/st-tech/zr-obp/blob/master/obp/policy/base.py).
251254

@@ -255,21 +258,22 @@ Our final step is **off-policy evaluation** (OPE), which attempts to estimate th
255258
Our pipeline also provides an easy procedure for doing OPE as follows.
256259

257260
```python
258-
# estimate the policy value of BernoulliTS based on the distribution over actions by that policy
261+
# estimate the policy value of BernoulliTS based on its action choice probabilities
259262
# it is possible to set multiple OPE estimators to the `ope_estimators` argument
260263
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
261264
estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)
262265
print(estimated_policy_value)
263-
{'ipw': 0.004553...} # dictionary containing estimated policy values by each OPE estimator.
266+
{'ipw': 0.004553...} # dictionary containing policy values estimated by each OPE estimator.
264267

265268
# compare the estimated performance of BernoulliTS (evaluation policy)
266-
# with the ground-truth performance of Random (behavior policy)
267-
relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
269+
# with the ground-truth performance of the Random policy (behavior policy)
270+
policy_value_improvement = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
268271
# our OPE procedure suggests that BernoulliTS improves Random by 19.81%
269-
print(relative_policy_value_of_bernoulli_ts)
272+
print(policy_value_improvement)
270273
1.198126...
271274
```
272-
Users can implement their own OPE estimator by following the interface of `obp.ope.BaseOffPolicyEstimator` class. `obp.ope.OffPolicyEvaluation` class summarizes and compares the estimated policy values by several off-policy estimators.
275+
Users can implement their own OPE estimator by following the interface of `obp.ope.BaseOffPolicyEstimator` class.
276+
`obp.ope.OffPolicyEvaluation` class summarizes and compares the policy values estimated by several different estimators.
273277
A detailed usage of this class can be found at [quickstart](https://github.com/st-tech/zr-obp/tree/master/examples/quickstart). `bandit_feedback['reward'].mean()` is the empirical mean of factual rewards (on-policy estimate of the policy value) in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.).
274278

275279

benchmark/ope/README.md

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,7 @@ Please download the full [open bandit dataset](https://research.zozo.com/data.ht
1111
Model-dependent estimators such as DM and DR need a pre-trained regression model.
1212
Here, we train a regression model with some machine learning methods.
1313

14-
We define hyperparameters for the machine learning methods in [`conf/hyperparams.yaml`](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/conf/hyperparams.yaml).
15-
[train_regression_model.py](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/train_regression_model.py) implements the training process of the regression model.
14+
[train_regression_model.py](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/train_regression_model.py) implements the training process of the regression model. ([`conf/hyperparams.yaml`](https://github.com/st-tech/zr-obp/blob/master/benchmark/ope/conf/hyperparams.yaml) defines hyperparameters for the machine learning methods.)
1615

1716
```
1817
python train_regression_model.py\
@@ -34,8 +33,8 @@ where
3433
- `$campaign` specifies the campaign considered in ZOZOTOWN and should be one of "all", "men", or "women".
3534
- `$n_sim_to_compute_action_dist` is the number of monte carlo simulation to compute the action choice probabilities by a given evaluation policy.
3635
- `$is_timeseries_split` is whether the data is split based on timestamp or not. If true, the out-sample performance of OPE is tested. See the relevant paper for details.
37-
- - `$test_size` specifies the proportion of the dataset to include in the test split when `$is_timeseries_split=True`.
38-
- `$is_mrdr` is whether the regression model is trained by the more robust doubly robust way or not. See the relevant paper for details.
36+
- `$test_size` specifies the proportion of the dataset to include in the test split when `$is_timeseries_split=True`.
37+
- `$is_mrdr` is whether the regression model is trained by the more robust doubly robust way. See the relevant paper for details.
3938
- `$n_jobs` is the maximum number of concurrently running jobs.
4039

4140
For example, the following command trains the regression model based on logistic regression on the logged bandit feedback data collected by the Random policy (as a behavior policy) in "All" campaign.
@@ -158,9 +157,3 @@ do
158157
done
159158
```
160159
-->
161-
162-
<!-- ## Results
163-
164-
We report the results of the benchmark experiments on the three campaigns (all, men, women) in the following tables.
165-
We describe **Random -> Bernoulli TS** to represent the OPE situation where we use Bernoulli TS as a hypothetical evaluation policy and Random as a hypothetical behavior policy.
166-
In contrast, we use **Bernoulli TS -> Random** to represent the situation where we use Random as a hypothetical evaluation policy and Bernoulli TS as a hypothetical behavior policy. -->

obp/policy/offline.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -245,8 +245,8 @@ def sample_action(
245245
246246
.. math::
247247
248-
& P (A_1 = a_1 | x) = \\frac{e^{f(x,a_1,1) / \\tau}}{\\sum_{a^{\\prime} \\in \\mathcal{A}} e^{f(x,a^{\\prime},1) / \\tau}} , \\\\
249-
& P (A_2 = a_2 | A_1 = a_1, x) = \\frac{e^{f(x,a_2,2) / \\tau}}{\\sum_{a^{\\prime} \\in \\mathcal{A} \\backslash \\{a_1\\}} e^{f(x,a^{\\prime},2) / \\tau}} ,
248+
& P (A_1 = a_1 | x) = \\frac{\\mathrm{exp}(f(x,a_1,1) / \\tau)}{\\sum_{a^{\\prime} \\in \\mathcal{A}} \\mathrm{exp}( f(x,a^{\\prime},1) / \\tau)} , \\\\
249+
& P (A_2 = a_2 | A_1 = a_1, x) = \\frac{\\mathrm{exp}(f(x,a_2,2) / \\tau)}{\\sum_{a^{\\prime} \\in \\mathcal{A} \\backslash \\{a_1\\}} \\mathrm{exp}(f(x,a^{\\prime},2) / \\tau )} ,
250250
\\ldots
251251
252252
where :math:`A_k` is a random variable representing an action at a position :math:`k`.
@@ -304,7 +304,7 @@ def predict_proba(
304304
305305
.. math::
306306
307-
P (A = a | x) = \\frac{e^{f(x,a) / \\tau}}{\\sum_{a^{\\prime} \\in \\mathcal{A}} e^{f(x,a^{\\prime}) / \\tau}},
307+
P (A = a | x) = \\frac{\\mathrm{exp}(f(x,a) / \\tau)}{\\sum_{a^{\\prime} \\in \\mathcal{A}} \\mathrm{exp}(f(x,a^{\\prime}) / \\tau)},
308308
309309
where :math:`A` is a random variable representing an action, and :math:`\\tau` is a temperature hyperparameter.
310310
:math:`f: \\mathcal{X} \\times \\mathcal{A} \\rightarrow \\mathbb{R}_{+}`

obp/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.3.2"
1+
__version__ = "0.3.3"

0 commit comments

Comments
 (0)