SoftDeterministicPolicyNetwork: fixed inconsistent return types in training and eval mode. #165

michalgregor · 2020-08-19T13:19:21Z

I have discovered a slight bug on develop, regarding SAC in eval mode. The SoftDeterministicPolicyNetwork's forward returns a tensor in eval mode and a tuple in training mode. The agent then gets the 0-th dimension of the returned value in both cases – so in eval mode, SAC fails, returning a scalar instead of a vector as action.

Also, this used to work before accidentally, because the network returned a 2D tensor: so in training mode, the [0] would have selected that tensor from the tuple and in eval mode it would have selected the first row from the 2D tensor (which would still work, even though it was not really intended).

Finally, the pull request is showing me 3 commits instead of 1. I don't know why that is precisely, since the first 2 commits are already on develop as far as I can see. If you know why this is happening, please, let me know and I will fix it before you merge.

… mode. * SoftDeterministicPolicyNetwork now returns a tuple in eval mode as well in training mode: the way the interface in the SAC agent expects.

* Line self._name = env only worked when env was a string; the class name is now used in place of env when this is not the case.

cpnota · 2020-08-26T14:08:13Z

all/policies/soft_deterministic.py

@@ -43,7 +43,7 @@ def forward(self, state):
        if self.training:
            action, log_prob = self._sample(normal)
            return action, log_prob
-        return self._squash(normal.loc)
+        return self._squash(normal.loc), torch.as_tensor(0.0, device=self.device)


Is this the right fix, or would the right fix be to change SAC? The log_prob isn't really 0.0, per se. A third option would be to actually compute the log_prob for the greedy action? Still sort of misleading, perhaps.

I can totally change that, but my reasoning was this: unless I am misreading the code, the policy is deterministic in eval mode, so the probability of the selected action is really 1, right? So its log would then be 0?

cpnota · 2020-08-26T14:10:35Z

all/presets/atari/a2c.py

+        # Model construction
+        feature_model_constructor=nature_features,
+        value_model_constructor=nature_value_head,
+        policy_model_constructor=nature_policy_head


Not sure with GitHub is showing these as changes, looks okay in the file view, though.

cpnota · 2020-08-26T14:11:20Z

all/environments/gym.py

            env = gym.make(env)
+        else:
+            self._name = env.__class__.__name__
+


Nice catch!

Yeah, I wanted to do this in a separate PR originally, but apparently I am missing something crucial as to how PRs work here on GitHub. :D :D I will need to look into that...

cpnota · 2020-09-29T14:35:07Z

Ended up going with a slightly different fix. The fixes were handled by #169 and #170 . Thanks again for identifying these bugs!

michalgregor and others added 4 commits August 11, 2020 14:29

Added support for specifying custom models under all existing presets.

3e0b6a2

Merge branch 'develop' into master

8204c66

Soft policy net: fixed inconsistent return types in training and eval…

0fbcda8

… mode. * SoftDeterministicPolicyNetwork now returns a tuple in eval mode as well in training mode: the way the interface in the SAC agent expects.

Fixed support for already constructed envs in GymEnvironment.

c5aa83b

* Line self._name = env only worked when env was a string; the class name is now used in place of env when this is not the case.

michalgregor force-pushed the master branch from 96479c6 to c5aa83b Compare August 24, 2020 08:29

cpnota reviewed Aug 26, 2020

View reviewed changes

This was referenced Sep 29, 2020

fix sac eval mode #169

Merged

fix support for preconstructed gym environments #170

Merged

cpnota closed this Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SoftDeterministicPolicyNetwork: fixed inconsistent return types in training and eval mode. #165

SoftDeterministicPolicyNetwork: fixed inconsistent return types in training and eval mode. #165

michalgregor commented Aug 19, 2020

cpnota Aug 26, 2020

michalgregor Aug 27, 2020

cpnota Aug 26, 2020

cpnota Aug 26, 2020

michalgregor Aug 27, 2020

cpnota commented Sep 29, 2020

SoftDeterministicPolicyNetwork: fixed inconsistent return types in training and eval mode. #165

SoftDeterministicPolicyNetwork: fixed inconsistent return types in training and eval mode. #165

Conversation

michalgregor commented Aug 19, 2020

cpnota Aug 26, 2020

Choose a reason for hiding this comment

michalgregor Aug 27, 2020

Choose a reason for hiding this comment

cpnota Aug 26, 2020

Choose a reason for hiding this comment

cpnota Aug 26, 2020

Choose a reason for hiding this comment

michalgregor Aug 27, 2020

Choose a reason for hiding this comment

cpnota commented Sep 29, 2020