-
Notifications
You must be signed in to change notification settings - Fork 2.2k
modify Params to support adding arbitrary files to the archive #590
Conversation
also I know some of the names are goofy |
actually, hold off looking at this, I thought of some ways to make it better |
Something to keep in mind as you're trying to figure this out: some of the things we want to archive are large files that we may not want to keep in memory the whole time, or even ever read into memory. The |
so the use case you're describing is
? is my use case ("stick arbitrary in-memory objects in the archive") also necessary? |
I can't think of an obvious use-case where you want to store in-memory objects. Maybe the @DeNeutoy, @matt-peters, I know you were thinking you'd want some kind of file access for ELMo at test time - am I missing something here? And sorry @joelgrus, I didn't look closely enough earlier at the |
The ELMo use case is to provide a configuration file at test time so that the user doesn't have to include it in their own configuration, and makes swapping out different ELMo models easy. However, there is a much lighter workaround then this PR for this in place now, just specify the required file as an S3 path and using the existing A secondary use case for generic model serialization that this PR does address is allowing model components to specify which of their parameters to serialize with the saved model, since for ELMo the pretrained, fixed parameters are ~350 MB so writing them out to every checkpoint is quite inefficient. |
to be clear, this sort of serialization wouldn't happen at every checkpoint (at least not the way archiving currently works), it would only happen when training stops |
(edit: as @joelgrus beat me to it), this PR doesn't actually address the inefficiency of writing parameters to disk, because (at least so far) it's not concerned with which things need to be saved when you use allennlp/allennlp/training/trainer.py Lines 435 to 436 in f700584
So, I think the only use case we want this for right now is to fix how we're handling the |
ok, attempt #2. this is not fully tested (the unit test tests like half of it), here is the idea
then the code for archiving models (this part I haven't tested yet) searches for all on deserialization, if such a mapping file exists in the archive, it gets used to create a bunch of overrides for those params, which are passed into the new Params object. anyway, take a look and see what you think about this approach. I modified the |
The general approach looks good to me. I just have three questions:
Actually, taking a step back on this point... The only reason we remove the filename key for the This is a long way of saying that I originally thought there was something missing here, but, now that I think about it a bit more, this will work fine as you have it for
|
re (2) the idea was that the re (3) another possibility is to require |
|
I mean, we're already doing multiple inheritance in a bunch of places, e.g.
IMO it's fine in the case when all but one of the base classes are essentially "mixins", which is what we have here |
Yeah, it's just unfortunate when one of those mix-ins really wants to have
a constructor. But it's fine.
…On Fri, Dec 8, 2017, 6:40 PM Joel Grus ***@***.***> wrote:
I mean, we're already doing multiple inheritance in a bunch of places, e.g.
class TokenEmbedder(torch.nn.Module, Registrable):
IMO it's fine in the case when all but one of the base classes are
essentially "mixins", which is what we have here
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#590 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADI7L3z0Rr5G1GYQVUmRL2I3NxD8F9-1ks5s-fMVgaJpZM4Q434T>
.
|
ok, I think this is ready for another look, with the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I really like how this turned out. Great work!
self.params = _replace_none(params) | ||
self.history = history | ||
self.loading_from_archive = loading_from_archive | ||
self.files_to_archive = {} if files_to_archive is None else files_to_archive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about self.files_to_archive = files_to_archive or {}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it that way originally, but that creates a bug. basically, we want files_to_archive
shared down through the children, but if you write it that way then when the parent has an empty dict the child gets a different empty dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, great, good point.
allennlp/common/params.py
Outdated
Any class in its ``from_params`` method can request that some of its | ||
input files be added to the archive by calling this method. | ||
|
||
For example, if some class ``A`` had a ``input_file`` parameter, it could call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/had a/had an/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -121,11 +121,16 @@ def forward(self, # pylint: disable=arguments-differ | |||
|
|||
@classmethod | |||
def from_params(cls, params: Params) -> 'Elmo': | |||
# Add files to archive | |||
params.add_file_to_archive('options_file') | |||
params.add_file_to_archive('weight_file') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't add the weight file to the archive here, I think. Instead the ELMo code should be able to load itself without the weight file, if params.loading_from_archive
(because the weight file will be redundant with the model weights, and is just taking up space). But that's a bigger change that probably belongs in a different PR, so this is fine for now.
…ai#590) * wip archivable * clean up * updates * revamp * modify elmo * revert some changes * update docstring * update tests * workaround pyhocon bug * add one more assert * pylint * add back line * change to dict * rename Archivable to FileArchiver * rename some things * move files_to_archive over to Params * update docs etc * add end to end test * return cls directly * address pr feedback
I took as a design goal backward compatibility, so here's what it does at a high level.
shelf
file that's a key-value store generate using theshelve
module. (if there's no such file, like in our current models, then there's just not any additional persisted state)Archivable
abstraction, which requires it to be able to generate a dictionary of stuff it wants archived, and to be able to populate its state from a corresponding dictModel
instance and get for free archiving for all its members and so on. let me know if this is too much craziness, we could probably make it more explicit somehow.this is mostly for discussion right now. let me know what you think of this approach and whether it would satisfy our needs
Fixes #554.