Skip to content

Feature policy learner #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 6, 2021
Merged

Feature policy learner #132

merged 16 commits into from
Sep 6, 2021

Conversation

usaito
Copy link
Contributor

@usaito usaito commented Sep 4, 2021

new features

  • modify obp.policy.NNPolicyLearner to estimate the policy value without having the estimate_policy_value_tensor method as its input
#before 
## define DR
dr = DoublyRobust()

## define NNPolicyLearner with DR as its objective function
nn_dr = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    off_policy_objective=dr.estimate_policy_value_tensor, # a method of OPEEstimator
    random_state=12345,
)

## train NNPolicyLearner on the training set of logged bandit data
nn_dr.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"],
    estimated_rewards_by_reg_model=estimated_rewards_by_reg_model,
)
# after
## define NNPolicyLearner with DR as its objective function
nn_dr = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    off_policy_objective="dr", # a string value
    batch_size=64,
    random_state=12345,
)

## train NNPolicyLearner on the training set of logged bandit data
nn_dr.fit(
    context=bandit_feedback_train["context"],
    action=bandit_feedback_train["action"],
    reward=bandit_feedback_train["reward"],
    pscore=bandit_feedback_train["pscore"],
)

Thus, we don't have to

  • import the corresponding OPEEstimator to define NNPolicyLearner
  • train a reward estimator to train NNPolicyLearner (estimated_rewards_by_reg_model is no longer necessary to train it)

As a result

  • OPEEstimator classes do not need to have estimate_policy_value_tensor method

  • There are also two new arguments of obp.policy.NNPolicyLearner as follows.

# after
## define NNPolicyLearner with DR as its objective function
nn_dr = NNPolicyLearner(
    n_actions=dataset.n_actions,
    dim_context=dataset.dim_context,
    policy_reg_param=0.01, # new argument
    var_reg_param=0.01, # new argument
    off_policy_objective="dr", 
)

By setting positive values to policy_reg_param and var_reg_param, we can activate the policy regularization and variance regularization during policy training. The policy training can be rewritten as:

texclip20210906123453

where the first term is "off_policy_objective", the second term is the policy regularization, and the third term is the variance regularization. The second term penalize a policy that is greatly different from the behavior policy. The third term penalizes a policy that has a large uncertainty in its policy estimate.

minor fix

  • apply isort
  • remove the estimate_policy_value_tensor method from OPEEstimaotr as it is no longer necessary
  • re-run examples to ensure that they work well with the current version and update the results obtained with the latest version
  • update examples/quickstart/obl.ipynb to adjust to the change in obp.policy.NNPolicyLearner
  • fix many error messages (in terms of English)
  • add and remove some test cases to adjust to the above changes

@usaito usaito changed the title [WIP] Feature policy learner Feature policy learner Sep 6, 2021
@usaito usaito merged commit 867eebe into master Sep 6, 2021
@usaito usaito deleted the feature-policy-learner branch September 6, 2021 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant