modify Params to support adding arbitrary files to the archive #590

joelgrus · 2017-12-07T01:56:39Z

I took as a design goal backward compatibility, so here's what it does at a high level.

every model.tar.gz optionally contains an additional shelf file that's a key-value store generate using the shelve module. (if there's no such file, like in our current models, then there's just not any additional persisted state)
any class can implement the Archivable abstraction, which requires it to be able to generate a dictionary of stuff it wants archived, and to be able to populate its state from a corresponding dict
I used some voodoo to basically inspect class members and recurse, the idea is that you could pass in a Model instance and get for free archiving for all its members and so on. let me know if this is too much craziness, we could probably make it more explicit somehow.
I wrote a unit test to give an example of how this works

this is mostly for discussion right now. let me know what you think of this approach and whether it would satisfy our needs

Fixes #554.

joelgrus · 2017-12-07T01:57:36Z

also I know some of the names are goofy

joelgrus · 2017-12-07T03:17:14Z

actually, hold off looking at this, I thought of some ways to make it better

matt-gardner · 2017-12-07T06:07:48Z

Something to keep in mind as you're trying to figure this out: some of the things we want to archive are large files that we may not want to keep in memory the whole time, or even ever read into memory. The Embedding class needs to use this mechanism to get the embedding file into the archive, and preferably also to remove the hackiness around modifying Params on load (they still need to be modified, but it's less hacky if Embedding does the modification itself).

joelgrus · 2017-12-07T15:34:41Z

so the use case you're describing is

some object has a path to a file
we want to add that file to model.tar.gz
when restoring a saved model, we want that file extracted and used (either by modifying Params to point at it or by directly telling the object to use it instead of what's in Params)

?

is my use case ("stick arbitrary in-memory objects in the archive") also necessary?

matt-gardner · 2017-12-07T17:49:46Z

I can't think of an obvious use-case where you want to store in-memory objects. Maybe the Vocabulary? But that makes the vocab harder to look at manually, and probably isn't a good idea. The original motivation (#554) was that you get file inputs in your training configuration that you won't have access to during evaluation, so you need to save whatever files are necessary in the model archive. If the object is just stored in memory, you probably constructed it without file input from the configuration, and you can do that again with no trouble.

@DeNeutoy, @matt-peters, I know you were thinking you'd want some kind of file access for ELMo at test time - am I missing something here? And sorry @joelgrus, I didn't look closely enough earlier at the shelve stuff you shared to realize that it wasn't actually solving the file use case.

matt-peters · 2017-12-07T18:02:51Z

The ELMo use case is to provide a configuration file at test time so that the user doesn't have to include it in their own configuration, and makes swapping out different ELMo models easy. However, there is a much lighter workaround then this PR for this in place now, just specify the required file as an S3 path and using the existing cached_path functionality to download the file at test time if it isn't already present. This to me is preferable to mangling the user's Params since it's explicit, easy to implement, and easy to grok from a user's perspective.

A secondary use case for generic model serialization that this PR does address is allowing model components to specify which of their parameters to serialize with the saved model, since for ELMo the pretrained, fixed parameters are ~350 MB so writing them out to every checkpoint is quite inefficient.

joelgrus · 2017-12-07T18:07:21Z

to be clear, this sort of serialization wouldn't happen at every checkpoint (at least not the way archiving currently works), it would only happen when training stops

matt-gardner · 2017-12-07T18:11:45Z

(edit: as @joelgrus beat me to it), this PR doesn't actually address the inefficiency of writing parameters to disk, because (at least so far) it's not concerned with which things need to be saved when you use torch.save():

allennlp/allennlp/training/trainer.py

Lines 435 to 436 in f700584

    
           model_state = self._model.state_dict() 
        
           torch.save(model_state, model_path)

So, I think the only use case we want this for right now is to fix how we're handling the Embedding, so that you can actually expand your vocabulary at test time. I can think of a couple of other use cases that we might get to (e.g., a pre-computed byte-pair vocabulary, making ELMo option specification a little nicer, so you don't have to upload changes to S3 if you're training your own ELMo model), so it's good to do this in a general way like you're doing, but those aren't as high priority.

joelgrus · 2017-12-08T01:35:12Z

ok, attempt #2. this is not fully tested (the unit test tests like half of it), here is the idea

a class inherits from Archivable only if its constructor takes files that it wants to archive
in that case in its from_params method it has to set an instance variable _param_history to params.history (possibly there's some way to make this happen automatically?)
it also needs to override files_to_archive() which just returns a dict { params_key -> filename }

then the code for archiving models (this part I haven't tested yet) searches for all Archivable members, collects the files to archive, and sticks them in an archive along with the params_key -> filename mapping.

on deserialization, if such a mapping file exists in the archive, it gets used to create a bunch of overrides for those params, which are passed into the new Params object.

anyway, take a look and see what you think about this approach. I modified the Elmo code to show how it could work.

matt-gardner · 2017-12-08T22:03:48Z

The general approach looks good to me. I just have three questions:

We still need some way of telling Embedding or any other similar class that we're loading for training vs. loading for test. We currently do that by removing the key for the embedding file, but if we want to be able to expand vocab, we can't anymore, if we want to actually use the embedding file at test time to expand our vocab. One option here is instead of (or in addition to) passing a mapping from hocon keys to filenames, have a function that modifies the Params object on load. You currently have that in load_archive, but I wonder if we can pass that logic on to the FileArchiver somehow, so that Embedding can do more than just replace a filename. Seems a little tricky...

Actually, taking a step back on this point... The only reason we remove the filename key for the Embedding is because we might not have access to the original file, and we don't need it anyway, because the weights are saved in the model. If we did always have the file, we'd just be spending a few seconds loading the embeddings before replacing them with the loaded model state. So, removing the key from the parameters isn't actually necessary, it just might save a few seconds on loading.

This is a long way of saying that I originally thought there was something missing here, but, now that I think about it a bit more, this will work fine as you have it for Embedding. We'll just need to implement some "expand vocab" functionality now that we're guaranteed to have access to the embedding file at test time.

Can FileArchivers clobber each other in the archive? It looks like they currently can, if two different archivers pick the same key.
Is there an easy way to avoid doing this with multiple inheritance? The class-level _param_history made me cringe a little - you're doing that so mypy is happy, because you can't have a constructor, because you can't construct it due to the multiple inheritance issue. If this is the best solution, that's fine; I can't think of a better way to do this right now, either.

joelgrus · 2017-12-08T22:16:16Z

re (2) the idea was that the key will always be the hocon path to the object you're trying to replace, so unless you screw that up you shouldn't have a collision

re (3) another possibility is to require files_to_archive to return the full hocon path for each file. in that case you'd still need to store the history somehow but it wouldn't be part of the FileArchiver class, it's up to you.

matt-gardner · 2017-12-08T23:32:56Z

I must have missed something in my brief look. If this is already taken care of, then great.
It's not a super strong preference, and that small change won't get rid of the multiple inheritance. What you have is fine; I think using multiple inheritance here is a fine tradeoff given the alternatives.

joelgrus · 2017-12-09T02:40:20Z

I mean, we're already doing multiple inheritance in a bunch of places, e.g.

class TokenEmbedder(torch.nn.Module, Registrable):

IMO it's fine in the case when all but one of the base classes are essentially "mixins", which is what we have here

matt-gardner · 2017-12-09T02:47:07Z

Yeah, it's just unfortunate when one of those mix-ins really wants to have a constructor. But it's fine.

…

On Fri, Dec 8, 2017, 6:40 PM Joel Grus ***@***.***> wrote: I mean, we're already doing multiple inheritance in a bunch of places, e.g. class TokenEmbedder(torch.nn.Module, Registrable): IMO it's fine in the case when all but one of the base classes are essentially "mixins", which is what we have here — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#590 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADI7L3z0Rr5G1GYQVUmRL2I3NxD8F9-1ks5s-fMVgaJpZM4Q434T> .

joelgrus · 2017-12-13T18:40:19Z

ok, I think this is ready for another look, with the new Params approach

matt-gardner

This looks great. I really like how this turned out. Great work!

matt-gardner · 2017-12-15T19:33:44Z

allennlp/common/params.py

        self.params = _replace_none(params)
        self.history = history
+        self.loading_from_archive = loading_from_archive
+        self.files_to_archive = {} if files_to_archive is None else files_to_archive


How do you feel about self.files_to_archive = files_to_archive or {}?

I had it that way originally, but that creates a bug. basically, we want files_to_archive shared down through the children, but if you write it that way then when the parent has an empty dict the child gets a different empty dict

Ah, great, good point.

matt-gardner · 2017-12-15T19:38:04Z

allennlp/common/params.py

+        Any class in its ``from_params`` method can request that some of its
+        input files be added to the archive by calling this method.
+
+        For example, if some class ``A`` had a ``input_file`` parameter, it could call


s/had a/had an/

matt-gardner · 2017-12-15T19:47:57Z

allennlp/modules/elmo.py

@@ -121,11 +121,16 @@ def forward(self,    # pylint: disable=arguments-differ

    @classmethod
    def from_params(cls, params: Params) -> 'Elmo':
+        # Add files to archive
+        params.add_file_to_archive('options_file')
+        params.add_file_to_archive('weight_file')


We shouldn't add the weight file to the archive here, I think. Instead the ELMo code should be able to load itself without the weight file, if params.loading_from_archive (because the weight file will be redundant with the model weights, and is just taking up space). But that's a bigger change that probably belongs in a different PR, so this is fine for now.

…ai#590) * wip archivable * clean up * updates * revamp * modify elmo * revert some changes * update docstring * update tests * workaround pyhocon bug * add one more assert * pylint * add back line * change to dict * rename Archivable to FileArchiver * rename some things * move files_to_archive over to Params * update docs etc * add end to end test * return cls directly * address pr feedback

wip archivable

d1f9935

joelgrus added 3 commits December 6, 2017 19:43

clean up

021cbb8

Merge branch 'master' into archivable

3084922

updates

b686428

joelgrus added 5 commits December 7, 2017 17:23

revamp

6d4089d

modify elmo

c57e127

Merge branch 'master' into archivable

03eaa8e

revert some changes

4194781

update docstring

41e7023

joelgrus added 9 commits December 8, 2017 06:49

update tests

0c7ded1

Merge branch 'master' into archivable

006df35

workaround pyhocon bug

33744b9

add one more assert

7203116

pylint

8206f8f

add back line

ca7f48a

change to dict

f18f2d6

rename Archivable to FileArchiver

83cb9c5

rename some things

02bfb34

joelgrus added 5 commits December 12, 2017 13:14

move files_to_archive over to Params

aa7ee77

Merge branch 'master' into archivable

6315878

update docs etc

c1c7ba3

add end to end test

84364bf

return cls directly

de2e11d

joelgrus changed the title ~~[WIP] [for discussion] adding arbitrary objects to the archive~~ modify Params to support adding arbitrary files to the archive Dec 13, 2017

joelgrus requested review from matt-gardner and DeNeutoy December 13, 2017 18:41

matt-gardner approved these changes Dec 15, 2017

View reviewed changes

joelgrus added 2 commits December 15, 2017 13:43

address pr feedback

5aa7540

Merge branch 'master' into archivable

d19505e

joelgrus merged commit 1793768 into master Dec 15, 2017

joelgrus deleted the archivable branch December 18, 2017 23:32

matt-gardner mentioned this pull request Jan 9, 2018

Expand vocabulary in evaluate #327

Closed

modify Params to support adding arbitrary files to the archive #590

modify Params to support adding arbitrary files to the archive #590

Uh oh!

Conversation

joelgrus commented Dec 7, 2017 • edited by schmmd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joelgrus commented Dec 7, 2017

Uh oh!

joelgrus commented Dec 7, 2017

Uh oh!

matt-gardner commented Dec 7, 2017

Uh oh!

joelgrus commented Dec 7, 2017

Uh oh!

matt-gardner commented Dec 7, 2017

Uh oh!

matt-peters commented Dec 7, 2017

Uh oh!

joelgrus commented Dec 7, 2017

Uh oh!

matt-gardner commented Dec 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joelgrus commented Dec 8, 2017

Uh oh!

matt-gardner commented Dec 8, 2017

Uh oh!

joelgrus commented Dec 8, 2017

Uh oh!

matt-gardner commented Dec 8, 2017

Uh oh!

joelgrus commented Dec 9, 2017

Uh oh!

matt-gardner commented Dec 9, 2017 via email

Uh oh!

joelgrus commented Dec 13, 2017

Uh oh!

matt-gardner left a comment

Choose a reason for hiding this comment

Uh oh!

matt-gardner Dec 15, 2017

Choose a reason for hiding this comment

Uh oh!

joelgrus Dec 15, 2017

Choose a reason for hiding this comment

Uh oh!

matt-gardner Dec 15, 2017

Choose a reason for hiding this comment

Uh oh!

matt-gardner Dec 15, 2017

Choose a reason for hiding this comment

Uh oh!

joelgrus Dec 15, 2017

Choose a reason for hiding this comment

Uh oh!

matt-gardner Dec 15, 2017

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joelgrus commented Dec 7, 2017 •

edited by schmmd

Loading

matt-gardner commented Dec 7, 2017 •

edited

Loading