-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Hf hub #5052
Hf hub #5052
Changes from 6 commits
8500c9d
de772d8
d5ed2b2
c636215
8783d3c
4d1a30f
fae94c3
f972a3d
eb7ec2f
c0bafdd
3ee4b85
a8277a1
e0874b5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,14 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |
|
||
### Added | ||
|
||
- Added support for the HuggingFace Hub as an alternative way to handle loading files. | ||
- The test for distributed metrics now takes a parameter specifying how often you want to run it. | ||
|
||
|
||
## [v2.3.0](https://github.com/allenai/allennlp/releases/tag/v2.3.0) - 2021-04-14 | ||
|
||
### Added | ||
|
||
- Ported the following Huggingface `LambdaLR`-based schedulers: `ConstantLearningRateScheduler`, `ConstantWithWarmupLearningRateScheduler`, `CosineWithWarmupLearningRateScheduler`, `CosineHardRestartsWithWarmupLearningRateScheduler`. | ||
- Ported the following HuggingFace `LambdaLR`-based schedulers: `ConstantLearningRateScheduler`, `ConstantWithWarmupLearningRateScheduler`, `CosineWithWarmupLearningRateScheduler`, `CosineHardRestartsWithWarmupLearningRateScheduler`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, good catch! 😉 |
||
- Added new `sub_token_mode` parameter to `pretrained_transformer_mismatched_embedder` class to support first sub-token embedding | ||
- Added a way to run a multi task model with a dataset reader as part of `allennlp predict`. | ||
- Added new `eval_mode` in `PretrainedTransformerEmbedder`. If it is set to `True`, the transformer is _always_ run in evaluation mode, which, e.g., disables dropout and does not update batch normalization statistics. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -50,10 +50,12 @@ | |
from overrides import overrides | ||
import requests | ||
from requests.adapters import HTTPAdapter | ||
from requests.exceptions import ConnectionError | ||
from requests.exceptions import ConnectionError, HTTPError | ||
from requests.packages.urllib3.util.retry import Retry | ||
import lmdb | ||
from torch import Tensor | ||
from huggingface_hub import hf_hub_url, cached_download, snapshot_download | ||
from allennlp.version import VERSION | ||
|
||
from allennlp.common.tqdm import Tqdm | ||
|
||
|
@@ -233,9 +235,45 @@ def cached_path( | |
cache_dir = os.path.expanduser(cache_dir) | ||
os.makedirs(cache_dir, exist_ok=True) | ||
|
||
extraction_path: Optional[str] = None | ||
|
||
if not isinstance(url_or_filename, str): | ||
url_or_filename = str(url_or_filename) | ||
|
||
if not os.path.isfile(url_or_filename) and not urlparse(url_or_filename).scheme in ("http", "https", "s3"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a little worried about this. It's not transparent to a user that the behavior of this hinges on the presence of absence of a local file in their current working directory. It also makes the AllenNLP configuration files a lot less self-contained. Do you think we could treat this like a URL, and make it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we could definitely treat it like an URL and have it preceded by either of the three propositions you have made. Is there one you favor? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think |
||
# If can't be resolved to a path or a URL, | ||
# let's try to find it on Hugging Face model hub | ||
# e.g. lysandre/pair-classification-roberta-mnli is a valid model id | ||
# and lysandre/pair-classification-roberta-mnli@main supports specifying a commit/branch/tag. | ||
if len(url_or_filename.split("/")) > 2: | ||
filename = url_or_filename.split("/")[2] | ||
url_or_filename = "/".join(url_or_filename.split("/")[:2]) | ||
else: | ||
filename = None | ||
|
||
if "@" in url_or_filename: | ||
repo_id = url_or_filename.split("@")[0] | ||
revision = url_or_filename.split("@")[1] | ||
else: | ||
repo_id = url_or_filename | ||
revision = None | ||
|
||
try: | ||
if filename is not None: | ||
url = hf_hub_url( | ||
repo_id=repo_id, filename=filename, revision=revision | ||
) | ||
url_or_filename = cached_download( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bit confusing to me. We set |
||
url=url, | ||
library_name="allennlp", | ||
library_version=VERSION, | ||
cache_dir=CACHE_DIRECTORY, | ||
) | ||
else: | ||
extraction_path = Path(snapshot_download(repo_id, revision=revision, cache_dir=CACHE_DIRECTORY)) | ||
except HTTPError as e: | ||
logger.warning(f"Tried to download from Hugging Face Hub but failed with {e}") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why warn here and not just let the exception propagate? I think it would be better to emit a |
||
|
||
file_path: str | ||
|
||
# If we're using the /a/b/foo.zip!c/d/file.txt syntax, handle it here. | ||
|
@@ -261,9 +299,7 @@ def cached_path( | |
|
||
parsed = urlparse(url_or_filename) | ||
|
||
extraction_path: Optional[str] = None | ||
|
||
if parsed.scheme in ("http", "https", "s3"): | ||
if parsed.scheme in ("http", "https", "s3") and extraction_path is None: | ||
# URL, so get it from the cache (downloading if necessary) | ||
file_path = get_from_cache(url_or_filename, cache_dir) | ||
|
||
|
@@ -272,7 +308,7 @@ def cached_path( | |
# For example ~/.allennlp/cache/234234.21341 -> ~/.allennlp/cache/234234.21341-extracted | ||
extraction_path = file_path + "-extracted" | ||
|
||
else: | ||
elif extraction_path is None: | ||
url_or_filename = os.path.expanduser(url_or_filename) | ||
|
||
if os.path.exists(url_or_filename): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with
huggingface://
orhf://
type urls, we should mention that here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with
hf://
! I updated the changelog.