Modify Synthetic Reward/Behavior Policy Functions #145

usaito · 2021-11-11T02:05:29Z

New Features

define some new functions to generate expected rewards and behavior policy in obp.dataset.SyntheticBanditDataset. These new functions provide more natural and flexible reward functions and behavior policies. Specifically, previous reward and behavior policy functions were easily specified by a linear model, which is often unrealistic. Some of the new functions are based on polynomial feature transformations to be more complex, and thus realistic. The added (or modified) functions are as follows.
- logistic_reward_function
- logistic_polynomial_reward_function
- linear_reward_function
- polynomial_reward_function
- linear_behavior_policy
- polynomial_behavior_policy

Note that the new reward functions are based on _base_reward_function and behavior policy functions are based on _base_behavior_policy_function. These base functions include more detailed description on the inner workings of the above new functions.

users are now able to specify their own action_context as in the following example.

dataset = SyntheticBanditDataset(
    n_actions=10,
    dim_context=10,
    action_context=np.random.normal((10, 5)) # (n_actions, dim_action_context)
    ....
)

Note that when action_context=None, one-hot action representation will be used as default.

users are now able to control the optimality and entropy of the behavior policy in a flexible manner.

dataset = SyntheticBanditDataset(
    n_actions=10,
    dim_context=10,
    reward_function=logistic_reward_function,
    beta=1.0, 
    behavior_policy_function=None,
    ....
)

When behavior_policy_function=None as above, the behavior policy will be generated as follows.

\pi(a|x) = softmax (\beta q(x,a))

where beta is the inverse temperature parameter, which controls the optimality and entropy of the behavior policy.
A large value leads to a near-deterministic behavior policy, while a small value leads to a near-uniform behavior policy.
A positive value leads to a near-optimal behavior policy, while a negative value leads to a sub-optimal behavior policy.

Minor Fixes

add some tests to tests/dataset/test_synthetic.py regarding the new features
add action_dist of pi_b to bandit_feedback (output of the obtain_batch_bandit_feedback method)
avoid the overflow warning in the logit function using scipy.special.logit
fix some tests regarding OPE performance. In particular, define NaiveEstimator instead of RandomOffPolicyEstimator, which has a randomness in the estimation and thus is weird as a baseline estimator.
fix some typos that failed to be addressed in the previous PR (Feature Implement QLearner #144)

obp/dataset/synthetic.py

usaito added 4 commits November 10, 2021 21:04

modify reward and behavior policy functions

a67356b

add some test cases

27a31ac

fix docs

049f760

avoid overflow

23715a8

usaito mentioned this pull request Nov 11, 2021

Feature Implement QLearner #144

Merged

usaito added 4 commits November 11, 2021 08:34

fix docs

ef3f0b5

define base reward and behavior policy functions

99c0d63

fix some tests

418e591

fix docs

716b0f1

aiueola reviewed Nov 13, 2021

View reviewed changes

obp/dataset/synthetic.py Outdated Show resolved Hide resolved

obp/dataset/synthetic.py Outdated Show resolved Hide resolved

obp/dataset/synthetic.py Outdated Show resolved Hide resolved

usaito added 2 commits November 13, 2021 03:30

add sparse reward functions

a499ec2

fix docs

d2d9171

usaito changed the title ~~[WIP] Modify Synthetic Reward/Behaivor Policy Functions~~ Modify Synthetic Reward/Behaivor Policy Functions Nov 13, 2021

usaito changed the title ~~Modify Synthetic Reward/Behaivor Policy Functions~~ Modify Synthetic Reward/Behavior Policy Functions Nov 13, 2021

usaito merged commit 621720b into master Nov 13, 2021

usaito deleted the feature/synthetic-dataset branch November 13, 2021 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modify Synthetic Reward/Behavior Policy Functions #145

Modify Synthetic Reward/Behavior Policy Functions #145

Uh oh!

usaito commented Nov 11, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Modify Synthetic Reward/Behavior Policy Functions #145

Modify Synthetic Reward/Behavior Policy Functions #145

Uh oh!

Conversation

usaito commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Features

Minor Fixes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

usaito commented Nov 11, 2021 •

edited

Loading