Skip to content

Make ROM use more robust #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nczempin opened this issue Mar 26, 2017 · 2 comments
Open

Make ROM use more robust #193

nczempin opened this issue Mar 26, 2017 · 2 comments

Comments

@nczempin
Copy link
Contributor

Varying on the ideas of issue #192, my primary goal is really to enable myself or others to make it easier (hopefully just data-driven, so no compilation necessary) to use ROMs other than those that have their own classes compiled in.

The process of converting a rom name from the command line to the correct code is not very robust; if you don't use the precise naming expected, you can run into a segmentation fault in loadROM(). Since the ROMs already seem to be recognized, given that their names are printed on the command line, it should be possible in many if not most cases to work from that information and more or less ignore the filename.

What most ROMs really just need is to be told "where do I find the score" and "what value do I compare to what to determine the end of the game". Since those can just be numbers, there is no need for a separate subclass each time. The common cases ("use BCD from these n locations", where n is usually 1 to 3, "the game ends when the content of this memory address goes to m", where m is usually 0 or FF; there will be others) can be enabled by default, and for common variations we can just introduce another parameter (e. g. "use normal integer representation") that will run into a different part of generic code.

For figuring out the scores and lives, the way I've described in issue #192 has its advantages, but it's not trivial to implement. In the meantime, we can use ALE itself to help us figure out semiautomatically which addresses contain these values, by filtering the memory contents by those that mostly increase (for scores) or mostly decrease (for lives). A user can let an agent play or play manually (there is a minor complication in that we don't return to Python until we re-disable manual control, but it's manageable) and watch some memory dumps and reasonably quickly find e. g. the memory address that goes to 0x23 when your score goes to 23, and usually one right next to it will go from 0 to one when we go over 100, etc.

I have a bunch of other ideas, no need to describe them all in detail here; suffice it to say that I'm working on making those changes myself except for those where someone says they're already 90 % done :-)

@nczempin
Copy link
Contributor Author

nczempin commented Mar 26, 2017

The most important part of reverse-engineering the ROMs is actually not the score, but the "lives" or similar metric that determines the terminal state: Many games that are "arcade-like" have the implicit goal of surviving as long as possible, so having simply a +1 per step (or normalized to once/second etc.) would for many games move an agent in the right direction.

Of course, this is not true for all games; in Pong this would optimize for 21-20 or 20-21 scores that take very long. However, to achieve this, the agent has to learn to return the ball, which is a good start. Not as good as actually finding out where the score is kept in memory, but better than simply not giving out any reward at all. And in Space Invaders the agent won't go for motherships, but it still needs to learn to fight the regular aliens.

nczempin added a commit to nczempin/Arcade-Learning-Environment that referenced this issue Mar 27, 2017
@nczempin
Copy link
Contributor Author

Ideally we would like to be able to reverse-engineer both the score and the terminal condition.

I already described that for many games, just being able to detect the terminal condition we can use the number of steps it takes before we reach it as the reward.

It also works the other way round: If we don't know the terminal condition, we can just terminate after an arbitrary number of steps, and agents that maximize the score will be better in the actual game that detects the terminal condition.

nczempin added a commit to nczempin/Arcade-Learning-Environment that referenced this issue Apr 10, 2017
nczempin added a commit to nczempin/Arcade-Learning-Environment that referenced this issue May 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants