Feature policy learner #132
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
new features
obp.policy.NNPolicyLearner
to estimate the policy value without having theestimate_policy_value_tensor
method as its inputThus, we don't have to
NNPolicyLearner
NNPolicyLearner
(estimated_rewards_by_reg_model
is no longer necessary to train it)As a result
OPEEstimator classes do not need to have
estimate_policy_value_tensor
methodThere are also two new arguments of
obp.policy.NNPolicyLearner
as follows.By setting positive values to
policy_reg_param
andvar_reg_param
, we can activate the policy regularization and variance regularization during policy training. The policy training can be rewritten as:where the first term is "off_policy_objective", the second term is the policy regularization, and the third term is the variance regularization. The second term penalize a policy that is greatly different from the behavior policy. The third term penalizes a policy that has a large uncertainty in its policy estimate.
minor fix
isort
estimate_policy_value_tensor
method fromOPEEstimaotr
as it is no longer necessaryexamples/quickstart/obl.ipynb
to adjust to the change inobp.policy.NNPolicyLearner