Skip to content

Exmaples for Online Bandit Alogirhtms with Replay Method #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 9, 2021

Conversation

Kurorororo
Copy link
Contributor

@Kurorororo Kurorororo commented Feb 7, 2021

I add a quickstart notebook and an example script for OPE with online bandit algorithms using Replay Method (RM).
To calculate ground-truth policy values for online bandit algorithms, I implemented SyntehticBanditDataset.sample_reward and simulator.calc_ground_truth_policy_value.
I added test code for SyntehticBanditDataset.sample_reward but not for simulator.calc_ground_truth_policy_value because test_simulator.py does not exist. If you want me to create test_simulator.py, I will do it.

@Kurorororo Kurorororo marked this pull request as draft February 7, 2021 20:44
@Kurorororo Kurorororo marked this pull request as ready for review February 8, 2021 18:08
@usaito
Copy link
Contributor

usaito commented Feb 9, 2021

@Kurorororo Thanks! Can you address the following points?

[nits]

  • You use distribution of rewards several times in several files. It is correct, but I think reward distribution is better as it is meaningful and also simple.

  • In the quickstart example, you use evaluation_policy_epsilon_greedy, evaluation_policy_lin_ucb, and evaluation_policy_lin_ts to define the (online) evaluation policies. I think just epsilon_greedy, lin_ucb, and lin_ts are sufficient here.

  • When calculating the ground-truth policy values, can you reveal the arguments of calc_ground_truth_policy_value? This will make it easier for users to learn how to use the function.
    I mean, I recoomend

policy_value_epsilon_greedy = calc_ground_truth_policy_value(
    bandit_feedback=bandit_feedback,
    reward_sampler=dataset.sample_reward, # p(r|x,a)
    policy=evaluation_policy_epsilon_greedy,
    n_sim=3 # the number of simulations
)

The same comment is applied to run_bandit_simulation in the quickstart.

[ask]

@Kurorororo
Copy link
Contributor Author

Thank you for the review.

  • You use distribution of rewards several times in several files. It is correct, but I think reward distribution is better as it is meaningful and also simple.
  • In the quickstart example, you use evaluation_policy_epsilon_greedy, evaluation_policy_lin_ucb, and evaluation_policy_lin_ts to define the (online) evaluation policies. I think just epsilon_greedy, lin_ucb, and lin_ts are sufficient here.
  • When calculating the ground-truth policy values, can you reveal the arguments of calc_ground_truth_policy_value? This will make it easier for users to learn how to use the function.
  • I think :math:r \sim p(r \mid x, a) is more understandable than just :math:p(r \mid x, a).
  • There is a typo in the titile Bndit -> Bandit

I addressed these comments in the latest commit.

  • Why did you implement sample_reward and sample_reward_given_expected_reward separately?

This is because obtain_batch_bandit_feedback needs expected_reward when the reward type is 'continuous'.
I implemented two functions to extract expected_reward here.

expected_reward_ = self.calc_expected_reward(context)
reward = self.sample_reward_given_expected_reward(expected_reward_, action)
if self.reward_type == "continuous":
# correct expected_reward_, as we use truncated normal distribution here
mean = expected_reward_
a = (self.reward_min - mean) / self.reward_std
b = (self.reward_max - mean) / self.reward_std
expected_reward_ = truncnorm.stats(
a=a, b=b, loc=mean, scale=self.reward_std, moments="m"
)

@usaito
Copy link
Contributor

usaito commented Feb 9, 2021

@Kurorororo Thanks! LGTM!

@usaito usaito merged commit 1254058 into st-tech:master Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants