-
Notifications
You must be signed in to change notification settings - Fork 92
Exmaples for Online Bandit Alogirhtms with Replay Method #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@Kurorororo Thanks! Can you address the following points? [nits]
policy_value_epsilon_greedy = calc_ground_truth_policy_value(
bandit_feedback=bandit_feedback,
reward_sampler=dataset.sample_reward, # p(r|x,a)
policy=evaluation_policy_epsilon_greedy,
n_sim=3 # the number of simulations
) The same comment is applied to
[ask]
|
Thank you for the review.
I addressed these comments in the latest commit.
This is because zr-obp/obp/dataset/synthetic.py Lines 290 to 300 in 44bec59
|
@Kurorororo Thanks! LGTM! |
I add a quickstart notebook and an example script for OPE with online bandit algorithms using Replay Method (RM).
To calculate ground-truth policy values for online bandit algorithms, I implemented
SyntehticBanditDataset.sample_reward
andsimulator.calc_ground_truth_policy_value
.I added test code for
SyntehticBanditDataset.sample_reward
but not forsimulator.calc_ground_truth_policy_value
becausetest_simulator.py
does not exist. If you want me to createtest_simulator.py
, I will do it.