Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Make load_archive operate on serialization directories. #2554

Merged
merged 5 commits into from
Mar 4, 2019

Conversation

brendan-ai2
Copy link
Contributor

@brendan-ai2 brendan-ai2 commented Mar 1, 2019

@joelgrus
Copy link
Contributor

joelgrus commented Mar 1, 2019

I don't think this is the right solution, for a couple of reasons:

  1. this problem applies to other commands too
  2. we do want to support loading from a directory that really corresponds to the contents of an archive file; the problem here was that it didn't

for #1 that suggests that the right place for the fix is in load_archive
for #2 I think a better fix is to go where it's doing the fta replacement

https://github.com/allenai/allennlp/blob/master/allennlp/models/archival.py#L200

and actually checking if those files exist, and giving a very explicit warning if they don't

I would also write a test to verify this behavior

@joelgrus
Copy link
Contributor

joelgrus commented Mar 1, 2019

also, probably this belongs in a different PR, but it's not good that using overrides didn't work here

@brendan-ai2 brendan-ai2 changed the title Add warning on loading a directory with evaluate. Make load_archive operate on serialization directories. Mar 1, 2019
@brendan-ai2
Copy link
Contributor Author

Good points! Given those, it seemed about as easy to make load_archive work on directories, so I did that. Hopefully that wasn't misguided... Fixed the overrides bug too.

@julianmichael
Copy link
Contributor

Files only appear at fta/{key} inside the archive—not in the serialization directory—right? Unless I'm mistaken or this has changed, that means this fix will fail when running from a normal serialization directory. I think it should fall back to original_filename (as an absolute path) when the file is not found to cover this case.

I had to work around the bug myself recently and this behavior was very convenient for when I had to fix up some config files (because it was over NFS and editing/expanding/compressing huge tarballs was very slow).

@brendan-ai2
Copy link
Contributor Author

@julianmichael
Copy link
Contributor

Ah I see now. It also seems the trainer behavior has changed so it puts the best weights at weights.th instead of best.th. Now I can see that this would work. Sorry about that—thanks!

@brendan-ai2
Copy link
Contributor Author

No worries! Would have been an easy bug to write, so such feedback is appreciated. :)

@@ -175,8 +175,10 @@ def load_archive(archive_file: str,
logger.info(f"loading archive file {archive_file} from cache at {resolved_archive_file}")

if os.path.isdir(resolved_archive_file):
loading_dir = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would slightly prefer a more descriptive name like loading_from_directory or archive_is_directory or something like that. loading_dir sounds like it's the directory itself

Copy link
Contributor Author

@brendan-ai2 brendan-ai2 Mar 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Var removed.

@@ -190,18 +192,22 @@ def load_archive(archive_file: str,

# Check for supplemental files in archive
fta_filename = os.path.join(serialization_dir, _FTA_NAME)
if os.path.exists(fta_filename):
if not loading_dir and os.path.exists(fta_filename):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the right logic? imagine that I have a model.tar.gz, and then I untar it somewhere and then point to that location. in that case we'd have loading_dir == True but we'd still want to grab the files from the "archive" (which is a directory)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It does seem possible that some users are relying on this. I've changed the behavior again to act as it did before if the files are present, but to load their serialization directory equivalents otherwise.

(I also considered touching an ARCHIVE_MARKER or SERIALIZATION_DIR_MARKER that we could check explicitly, but it seem desirable to have this work for existing dirs and archives.)

Copy link
Contributor

@DeNeutoy DeNeutoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this issue describes the problem in more depth. I haven't reviewed the PR so it might fix it, but I hadn't seen the issue crop up in the discussions so maybe there's something there:

#1052

Copy link
Contributor

@joelgrus joelgrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, thanks

@brendan-ai2 brendan-ai2 merged commit d0f7170 into allenai:master Mar 4, 2019
@brendan-ai2
Copy link
Contributor Author

Thanks @DeNeutoy for the link and @joelgrus for the review!

@DeNeutoy
Copy link
Contributor

DeNeutoy commented Mar 4, 2019

Nice one, this has been a long running annoying issue, good to clear it out!

Whu-wxy added a commit to Whu-wxy/allennlp that referenced this pull request Mar 6, 2019
Make `load_archive` operate on serialization directories. (allenai#2554)
reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
- Fixes allenai#1052.
- Files like ELMo weight files aren't added to the serialization dir, but are added to the archive with `add_file_to_archive`.
- This results in misleading errors when using `load_archive` as it will attempt to load a serialization dir.
- Solution: Retain original paths when we're loading from a directory.
- Drive by fix for overrides not overriding.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using a serialisation directory as a model archive doesn't work with files_to_archive.json
4 participants