Exmaples for Online Bandit Alogirhtms with Replay Method #67

Kurorororo · 2021-02-07T20:29:53Z

I add a quickstart notebook and an example script for OPE with online bandit algorithms using Replay Method (RM).
To calculate ground-truth policy values for online bandit algorithms, I implemented SyntehticBanditDataset.sample_reward and simulator.calc_ground_truth_policy_value.
I added test code for SyntehticBanditDataset.sample_reward but not for simulator.calc_ground_truth_policy_value because test_simulator.py does not exist. If you want me to create test_simulator.py, I will do it.

usaito · 2021-02-09T01:51:50Z

@Kurorororo Thanks! Can you address the following points?

[nits]

You use distribution of rewards several times in several files. It is correct, but I think reward distribution is better as it is meaningful and also simple.
In the quickstart example, you use evaluation_policy_epsilon_greedy, evaluation_policy_lin_ucb, and evaluation_policy_lin_ts to define the (online) evaluation policies. I think just epsilon_greedy, lin_ucb, and lin_ts are sufficient here.
When calculating the ground-truth policy values, can you reveal the arguments of calc_ground_truth_policy_value? This will make it easier for users to learn how to use the function.
I mean, I recoomend

policy_value_epsilon_greedy = calc_ground_truth_policy_value(
    bandit_feedback=bandit_feedback,
    reward_sampler=dataset.sample_reward, # p(r|x,a)
    policy=evaluation_policy_epsilon_greedy,
    n_sim=3 # the number of simulations
)

The same comment is applied to run_bandit_simulation in the quickstart.

I think :math:r \sim p(r \\mid x, a) is more understandable than just :math:p(r \\mid x, a).
https://github.com/Kurorororo/zr-obp/blob/04813d9938f3be3db49c4919c7d9fc49783318cf/obp/dataset/synthetic.py#L213
There is a typo in the titile Bndit -> Bandit
https://github.com/Kurorororo/zr-obp/tree/04813d9938f3be3db49c4919c7d9fc49783318cf/examples/online

[ask]

Why did you implement sample_reward and sample_reward_given_expected_reward separately?
https://github.com/Kurorororo/zr-obp/blob/04813d9938f3be3db49c4919c7d9fc49783318cf/obp/dataset/synthetic.py#L245

Kurorororo · 2021-02-09T07:45:12Z

Thank you for the review.

You use distribution of rewards several times in several files. It is correct, but I think reward distribution is better as it is meaningful and also simple.

In the quickstart example, you use evaluation_policy_epsilon_greedy, evaluation_policy_lin_ucb, and evaluation_policy_lin_ts to define the (online) evaluation policies. I think just epsilon_greedy, lin_ucb, and lin_ts are sufficient here.

When calculating the ground-truth policy values, can you reveal the arguments of calc_ground_truth_policy_value? This will make it easier for users to learn how to use the function.

I think :math:r \sim p(r \mid x, a) is more understandable than just :math:p(r \mid x, a).

There is a typo in the titile Bndit -> Bandit

I addressed these comments in the latest commit.

Why did you implement sample_reward and sample_reward_given_expected_reward separately?

This is because obtain_batch_bandit_feedback needs expected_reward when the reward type is 'continuous'.
I implemented two functions to extract expected_reward here.

zr-obp/obp/dataset/synthetic.py

Lines 290 to 300 in 44bec59

    
           expected_reward_ = self.calc_expected_reward(context) 
        
           reward = self.sample_reward_given_expected_reward(expected_reward_, action) 
        
           if self.reward_type == "continuous": 
        
               # correct expected_reward_, as we use truncated normal distribution here 
        
               mean = expected_reward_ 
        
               a = (self.reward_min - mean) / self.reward_std 
        
               b = (self.reward_max - mean) / self.reward_std 
        
               expected_reward_ = truncnorm.stats( 
        
                   a=a, b=b, loc=mean, scale=self.reward_std, moments="m" 
        
               )

usaito · 2021-02-09T08:02:37Z

@Kurorororo Thanks! LGTM!

Kurorororo added 5 commits February 8, 2021 04:18

add notebook

52545e7

add scripts

82ede8c

update README

1282dc1

update README

744713f

add visualization

c174801

Kurorororo marked this pull request as draft February 7, 2021 20:44

Kurorororo added 4 commits February 8, 2021 23:02

implement calc_ground_truth_policy_value

ec9e9f8

implement sample_reward and calc_ground_truth_policy_value

53828f6

update README

fd02f19

update docstring

04813d9

Kurorororo marked this pull request as ready for review February 8, 2021 18:08

address review

44bec59

modify type

4cd6cf2

usaito merged commit 1254058 into st-tech:master Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exmaples for Online Bandit Alogirhtms with Replay Method #67

Exmaples for Online Bandit Alogirhtms with Replay Method #67

Uh oh!

Kurorororo commented Feb 7, 2021 •

edited

Loading

Uh oh!

usaito commented Feb 9, 2021 •

edited

Loading

Uh oh!

Kurorororo commented Feb 9, 2021

Uh oh!

usaito commented Feb 9, 2021

Uh oh!

Uh oh!

Exmaples for Online Bandit Alogirhtms with Replay Method #67

Exmaples for Online Bandit Alogirhtms with Replay Method #67

Uh oh!

Conversation

Kurorororo commented Feb 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

usaito commented Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kurorororo commented Feb 9, 2021

Uh oh!

usaito commented Feb 9, 2021

Uh oh!

Uh oh!

Kurorororo commented Feb 7, 2021 •

edited

Loading

usaito commented Feb 9, 2021 •

edited

Loading