Feature policy learner #132

usaito · 2021-09-04T17:38:07Z

new features

modify obp.policy.NNPolicyLearner to estimate the policy value without having the estimate_policy_value_tensor method as its input

#before 
## define DR
dr = DoublyRobust()

## define NNPolicyLearner with DR as its objective function
nn_dr = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    off_policy_objective=dr.estimate_policy_value_tensor, # a method of OPEEstimator
    random_state=12345,
)

## train NNPolicyLearner on the training set of logged bandit data
nn_dr.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"],
    estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
)

# after
## define NNPolicyLearner with DR as its objective function
nn_dr = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    off_policy_objective="dr", # a string value
    batch_size=64,
    random_state=12345,
)

## train NNPolicyLearner on the training set of logged bandit data
nn_dr.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"],
)

Thus, we don't have to

import the corresponding OPEEstimator to define NNPolicyLearner
train a reward estimator to train NNPolicyLearner (estimated_rewards_by_reg_model is no longer necessary to train it)

As a result

OPEEstimator classes do not need to have estimate_policy_value_tensor method
There are also two new arguments of obp.policy.NNPolicyLearner as follows.

# after
## define NNPolicyLearner with DR as its objective function
nn_dr = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    policy_reg_param=0.01, # new argument
    var_reg_param=0.01, # new argument
    off_policy_objective="dr", 
)

By setting positive values to policy_reg_param and var_reg_param, we can activate the policy regularization and variance regularization during policy training. The policy training can be rewritten as:

where the first term is "off_policy_objective", the second term is the policy regularization, and the third term is the variance regularization. The second term penalize a policy that is greatly different from the behavior policy. The third term penalizes a policy that has a large uncertainty in its policy estimate.

minor fix

apply isort
remove the estimate_policy_value_tensor method from OPEEstimaotr as it is no longer necessary
re-run examples to ensure that they work well with the current version and update the results obtained with the latest version
update examples/quickstart/obl.ipynb to adjust to the change in obp.policy.NNPolicyLearner
fix many error messages (in terms of English)
add and remove some test cases to adjust to the above changes

…necessary

usaito added 16 commits September 4, 2021 13:33

exclude OPEEstimator from NNPolicyLearner

8802665

add Adagrad

2158a28

rm estimate_policy_value_tensor from OPEEstimator as it is no longer …

1760002

…necessary

add isort

0aeff65

adjust to the update in obp and apply isort

f9a556e

apply isort

6eeffcd

fix lint

8943385

modify defaults and fix bugs

4ed27ed

apply black

76ad4b6

fix config

b6e53e7

fix a failed test

23f03a8

bug fix

e3e43d0

resolve conflict

e1340f0

adjust to updates

95f760a

resolve

9cf1e64

update quickstart examples

f7a6a28

usaito changed the title ~~[WIP] Feature policy learner~~ Feature policy learner Sep 6, 2021

usaito merged commit 867eebe into master Sep 6, 2021

usaito deleted the feature-policy-learner branch September 6, 2021 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature policy learner #132

Feature policy learner #132

Uh oh!

usaito commented Sep 4, 2021 •

edited

Loading

Uh oh!

Uh oh!

Feature policy learner #132

Feature policy learner #132

Uh oh!

Conversation

usaito commented Sep 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

new features

minor fix

Uh oh!

Uh oh!

usaito commented Sep 4, 2021 •

edited

Loading