Modify Synthetic Reward/Behavior Policy Functions #145
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New Features
obp.dataset.SyntheticBanditDataset
. These new functions provide more natural and flexible reward functions and behavior policies. Specifically, previous reward and behavior policy functions were easily specified by a linear model, which is often unrealistic. Some of the new functions are based on polynomial feature transformations to be more complex, and thus realistic. The added (or modified) functions are as follows.Note that the new reward functions are based on
_base_reward_function
and behavior policy functions are based on_base_behavior_policy_function
. These base functions include more detailed description on the inner workings of the above new functions.action_context
as in the following example.Note that when
action_context=None
, one-hot action representation will be used as default.When
behavior_policy_function=None
as above, the behavior policy will be generated as follows.where
beta
is the inverse temperature parameter, which controls the optimality and entropy of the behavior policy.A large value leads to a near-deterministic behavior policy, while a small value leads to a near-uniform behavior policy.
A positive value leads to a near-optimal behavior policy, while a negative value leads to a sub-optimal behavior policy.
Minor Fixes
tests/dataset/test_synthetic.py
regarding the new featuresbandit_feedback
(output of theobtain_batch_bandit_feedback
method)scipy.special.logit
NaiveEstimator
instead of RandomOffPolicyEstimator, which has a randomness in the estimation and thus is weird as a baseline estimator.