Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

fix race condition when extracting files with cached_path #5227

Merged
merged 4 commits into from
May 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- When `PretrainedTransformerIndexer` folds long sequences, it no longer loses the information from token type ids.
- Fixed documentation for `GradientDescentTrainer.cuda_device`.
- Fixed the potential for a race condition with `cached_path()` when extracting archives. Although the race condition
is still possible if used with `force_extract=True`.
- Fixed `wandb` callback to work in distributed training.


Expand Down
18 changes: 17 additions & 1 deletion allennlp/common/file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,10 @@ def cached_path(
force_extract : `bool`, optional (default = `False`)
If `True` and the file is an archive file, it will be extracted regardless
of whether or not the extracted directory already exists.

!!! Warning
Use this flag with caution! This can lead to race conditions if used
from multiple processes on the same file.
"""
if cache_dir is None:
cache_dir = CACHE_DIRECTORY
Expand Down Expand Up @@ -325,12 +329,24 @@ def cached_path(

if extraction_path is not None:
# If the extracted directory already exists (and is non-empty), then no
# need to extract again unless `force_extract=True`.
# need to create a lock file and extract again unless `force_extract=True`.
if os.path.isdir(extraction_path) and os.listdir(extraction_path) and not force_extract:
return extraction_path

# Extract it.
with FileLock(extraction_path + ".lock"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make sure that only one process/thread makes it into the block below, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeup

# Check again if the directory exists now that we've acquired the lock.
if os.path.isdir(extraction_path) and os.listdir(extraction_path):
if force_extract:
logger.warning(
"Extraction directory for %s (%s) already exists, "
"overwriting it since 'force_extract' is 'True'",
url_or_filename,
extraction_path,
)
else:
return extraction_path

logger.info("Extracting %s to %s", url_or_filename, extraction_path)
shutil.rmtree(extraction_path, ignore_errors=True)

Expand Down