Skip to content

Modify Synthetic Reward/Behavior Policy Functions #145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 13, 2021

Conversation

usaito
Copy link
Contributor

@usaito usaito commented Nov 11, 2021

New Features

  • define some new functions to generate expected rewards and behavior policy in obp.dataset.SyntheticBanditDataset. These new functions provide more natural and flexible reward functions and behavior policies. Specifically, previous reward and behavior policy functions were easily specified by a linear model, which is often unrealistic. Some of the new functions are based on polynomial feature transformations to be more complex, and thus realistic. The added (or modified) functions are as follows.
    • logistic_reward_function
    • logistic_polynomial_reward_function
    • linear_reward_function
    • polynomial_reward_function
    • linear_behavior_policy
    • polynomial_behavior_policy

Note that the new reward functions are based on _base_reward_function and behavior policy functions are based on _base_behavior_policy_function. These base functions include more detailed description on the inner workings of the above new functions.

  • users are now able to specify their own action_context as in the following example.
dataset = SyntheticBanditDataset(
    n_actions=10,
    dim_context=10,
    action_context=np.random.normal((10, 5)) # (n_actions, dim_action_context)
    ....
)

Note that when action_context=None, one-hot action representation will be used as default.

  • users are now able to control the optimality and entropy of the behavior policy in a flexible manner.
dataset = SyntheticBanditDataset(
    n_actions=10,
    dim_context=10,
    reward_function=logistic_reward_function,
    beta=1.0, 
    behavior_policy_function=None,
    ....
)

When behavior_policy_function=None as above, the behavior policy will be generated as follows.

\pi(a|x) = softmax (\beta q(x,a))

where beta is the inverse temperature parameter, which controls the optimality and entropy of the behavior policy.
A large value leads to a near-deterministic behavior policy, while a small value leads to a near-uniform behavior policy.
A positive value leads to a near-optimal behavior policy, while a negative value leads to a sub-optimal behavior policy.

Minor Fixes

  • add some tests to tests/dataset/test_synthetic.py regarding the new features
  • add action_dist of pi_b to bandit_feedback (output of the obtain_batch_bandit_feedback method)
  • avoid the overflow warning in the logit function using scipy.special.logit
  • fix some tests regarding OPE performance. In particular, define NaiveEstimator instead of RandomOffPolicyEstimator, which has a randomness in the estimation and thus is weird as a baseline estimator.
  • fix some typos that failed to be addressed in the previous PR (Feature Implement QLearner #144)

@usaito usaito changed the title [WIP] Modify Synthetic Reward/Behaivor Policy Functions Modify Synthetic Reward/Behaivor Policy Functions Nov 13, 2021
@usaito usaito changed the title Modify Synthetic Reward/Behaivor Policy Functions Modify Synthetic Reward/Behavior Policy Functions Nov 13, 2021
@usaito usaito merged commit 621720b into master Nov 13, 2021
@usaito usaito deleted the feature/synthetic-dataset branch November 13, 2021 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants