Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

modify Params to support adding arbitrary files to the archive #590

Merged
merged 25 commits into from
Dec 15, 2017

Conversation

joelgrus
Copy link
Contributor

@joelgrus joelgrus commented Dec 7, 2017

I took as a design goal backward compatibility, so here's what it does at a high level.

  • every model.tar.gz optionally contains an additional shelf file that's a key-value store generate using the shelve module. (if there's no such file, like in our current models, then there's just not any additional persisted state)
  • any class can implement the Archivable abstraction, which requires it to be able to generate a dictionary of stuff it wants archived, and to be able to populate its state from a corresponding dict
  • I used some voodoo to basically inspect class members and recurse, the idea is that you could pass in a Model instance and get for free archiving for all its members and so on. let me know if this is too much craziness, we could probably make it more explicit somehow.
  • I wrote a unit test to give an example of how this works

this is mostly for discussion right now. let me know what you think of this approach and whether it would satisfy our needs

Fixes #554.

@joelgrus
Copy link
Contributor Author

joelgrus commented Dec 7, 2017

also I know some of the names are goofy

@joelgrus
Copy link
Contributor Author

joelgrus commented Dec 7, 2017

actually, hold off looking at this, I thought of some ways to make it better

@matt-gardner
Copy link
Contributor

Something to keep in mind as you're trying to figure this out: some of the things we want to archive are large files that we may not want to keep in memory the whole time, or even ever read into memory. The Embedding class needs to use this mechanism to get the embedding file into the archive, and preferably also to remove the hackiness around modifying Params on load (they still need to be modified, but it's less hacky if Embedding does the modification itself).

@joelgrus
Copy link
Contributor Author

joelgrus commented Dec 7, 2017

so the use case you're describing is

  • some object has a path to a file
  • we want to add that file to model.tar.gz
  • when restoring a saved model, we want that file extracted and used (either by modifying Params to point at it or by directly telling the object to use it instead of what's in Params)

?

is my use case ("stick arbitrary in-memory objects in the archive") also necessary?

@matt-gardner
Copy link
Contributor

I can't think of an obvious use-case where you want to store in-memory objects. Maybe the Vocabulary? But that makes the vocab harder to look at manually, and probably isn't a good idea. The original motivation (#554) was that you get file inputs in your training configuration that you won't have access to during evaluation, so you need to save whatever files are necessary in the model archive. If the object is just stored in memory, you probably constructed it without file input from the configuration, and you can do that again with no trouble.

@DeNeutoy, @matt-peters, I know you were thinking you'd want some kind of file access for ELMo at test time - am I missing something here? And sorry @joelgrus, I didn't look closely enough earlier at the shelve stuff you shared to realize that it wasn't actually solving the file use case.

@matt-peters
Copy link
Contributor

The ELMo use case is to provide a configuration file at test time so that the user doesn't have to include it in their own configuration, and makes swapping out different ELMo models easy. However, there is a much lighter workaround then this PR for this in place now, just specify the required file as an S3 path and using the existing cached_path functionality to download the file at test time if it isn't already present. This to me is preferable to mangling the user's Params since it's explicit, easy to implement, and easy to grok from a user's perspective.

A secondary use case for generic model serialization that this PR does address is allowing model components to specify which of their parameters to serialize with the saved model, since for ELMo the pretrained, fixed parameters are ~350 MB so writing them out to every checkpoint is quite inefficient.

@joelgrus
Copy link
Contributor Author

joelgrus commented Dec 7, 2017

to be clear, this sort of serialization wouldn't happen at every checkpoint (at least not the way archiving currently works), it would only happen when training stops

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 7, 2017

(edit: as @joelgrus beat me to it), this PR doesn't actually address the inefficiency of writing parameters to disk, because (at least so far) it's not concerned with which things need to be saved when you use torch.save():

model_state = self._model.state_dict()
torch.save(model_state, model_path)

So, I think the only use case we want this for right now is to fix how we're handling the Embedding, so that you can actually expand your vocabulary at test time. I can think of a couple of other use cases that we might get to (e.g., a pre-computed byte-pair vocabulary, making ELMo option specification a little nicer, so you don't have to upload changes to S3 if you're training your own ELMo model), so it's good to do this in a general way like you're doing, but those aren't as high priority.

@joelgrus
Copy link
Contributor Author

joelgrus commented Dec 8, 2017

ok, attempt #2. this is not fully tested (the unit test tests like half of it), here is the idea

  • a class inherits from Archivable only if its constructor takes files that it wants to archive
  • in that case in its from_params method it has to set an instance variable _param_history to params.history (possibly there's some way to make this happen automatically?)
  • it also needs to override files_to_archive() which just returns a dict { params_key -> filename }

then the code for archiving models (this part I haven't tested yet) searches for all Archivable members, collects the files to archive, and sticks them in an archive along with the params_key -> filename mapping.

on deserialization, if such a mapping file exists in the archive, it gets used to create a bunch of overrides for those params, which are passed into the new Params object.

anyway, take a look and see what you think about this approach. I modified the Elmo code to show how it could work.

@matt-gardner
Copy link
Contributor

The general approach looks good to me. I just have three questions:

  1. We still need some way of telling Embedding or any other similar class that we're loading for training vs. loading for test. We currently do that by removing the key for the embedding file, but if we want to be able to expand vocab, we can't anymore, if we want to actually use the embedding file at test time to expand our vocab. One option here is instead of (or in addition to) passing a mapping from hocon keys to filenames, have a function that modifies the Params object on load. You currently have that in load_archive, but I wonder if we can pass that logic on to the FileArchiver somehow, so that Embedding can do more than just replace a filename. Seems a little tricky...

Actually, taking a step back on this point... The only reason we remove the filename key for the Embedding is because we might not have access to the original file, and we don't need it anyway, because the weights are saved in the model. If we did always have the file, we'd just be spending a few seconds loading the embeddings before replacing them with the loaded model state. So, removing the key from the parameters isn't actually necessary, it just might save a few seconds on loading.

This is a long way of saying that I originally thought there was something missing here, but, now that I think about it a bit more, this will work fine as you have it for Embedding. We'll just need to implement some "expand vocab" functionality now that we're guaranteed to have access to the embedding file at test time.

  1. Can FileArchivers clobber each other in the archive? It looks like they currently can, if two different archivers pick the same key.

  2. Is there an easy way to avoid doing this with multiple inheritance? The class-level _param_history made me cringe a little - you're doing that so mypy is happy, because you can't have a constructor, because you can't construct it due to the multiple inheritance issue. If this is the best solution, that's fine; I can't think of a better way to do this right now, either.

@joelgrus
Copy link
Contributor Author

joelgrus commented Dec 8, 2017

re (2) the idea was that the key will always be the hocon path to the object you're trying to replace, so unless you screw that up you shouldn't have a collision

re (3) another possibility is to require files_to_archive to return the full hocon path for each file. in that case you'd still need to store the history somehow but it wouldn't be part of the FileArchiver class, it's up to you.

@matt-gardner
Copy link
Contributor

  1. I must have missed something in my brief look. If this is already taken care of, then great.
  2. It's not a super strong preference, and that small change won't get rid of the multiple inheritance. What you have is fine; I think using multiple inheritance here is a fine tradeoff given the alternatives.

@joelgrus
Copy link
Contributor Author

joelgrus commented Dec 9, 2017

I mean, we're already doing multiple inheritance in a bunch of places, e.g.

class TokenEmbedder(torch.nn.Module, Registrable):

IMO it's fine in the case when all but one of the base classes are essentially "mixins", which is what we have here

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 9, 2017 via email

@joelgrus joelgrus changed the title [WIP] [for discussion] adding arbitrary objects to the archive modify Params to support adding arbitrary files to the archive Dec 13, 2017
@joelgrus
Copy link
Contributor Author

ok, I think this is ready for another look, with the new Params approach

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I really like how this turned out. Great work!

self.params = _replace_none(params)
self.history = history
self.loading_from_archive = loading_from_archive
self.files_to_archive = {} if files_to_archive is None else files_to_archive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about self.files_to_archive = files_to_archive or {}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it that way originally, but that creates a bug. basically, we want files_to_archive shared down through the children, but if you write it that way then when the parent has an empty dict the child gets a different empty dict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great, good point.

Any class in its ``from_params`` method can request that some of its
input files be added to the archive by calling this method.

For example, if some class ``A`` had a ``input_file`` parameter, it could call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/had a/had an/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -121,11 +121,16 @@ def forward(self, # pylint: disable=arguments-differ

@classmethod
def from_params(cls, params: Params) -> 'Elmo':
# Add files to archive
params.add_file_to_archive('options_file')
params.add_file_to_archive('weight_file')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't add the weight file to the archive here, I think. Instead the ELMo code should be able to load itself without the weight file, if params.loading_from_archive (because the weight file will be redundant with the model weights, and is just taking up space). But that's a bigger change that probably belongs in a different PR, so this is fine for now.

@joelgrus joelgrus merged commit 1793768 into master Dec 15, 2017
@joelgrus joelgrus deleted the archivable branch December 18, 2017 23:32
gabrielStanovsky pushed a commit to gabrielStanovsky/allennlp that referenced this pull request Sep 7, 2018
…ai#590)

* wip archivable

* clean up

* updates

* revamp

* modify elmo

* revert some changes

* update docstring

* update tests

* workaround pyhocon bug

* add one more assert

* pylint

* add back line

* change to dict

* rename Archivable to FileArchiver

* rename some things

* move files_to_archive over to Params

* update docs etc

* add end to end test

* return cls directly

* address pr feedback
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants