Skip to content

Prevent serialization/pickling errors when using cloud-based datasets (e.g., S3). #4690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SajidAlamQB opened this issue Apr 25, 2025 · 2 comments

Comments

@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Apr 25, 2025

Description

Related to: #2162

Datasets backed by cloud storage (e.g., S3 via s3fs) often hold non-picklable internal references (e.g., S3FileSystem._glob), which cause ParallelRunner to fail during multiprocessing pickling.

Potential Fix:

  • Add a validation layer in ParallelRunner._validate_catalog() that warns users about cloud datasets and suggests using ThreadRunner. Detect datasets with _protocol in ("s3", "gcs", "adl", etc.) and raise a warning with suggested alternatives (e.g., ThreadRunner).

  • Update documentation with guidance and caveats for cloud-based storage and multiprocessing, explain why cloud datasets are will cause issues with ParallelRunner.

@datajoely
Copy link
Contributor

I wonder if we can recommend a better pickling library like cloudpickle

@astrojuanlu
Copy link
Member

Rather than looking at this from the Runners perspective, can we maybe do so from the Datasets perspective? #1936

According to the original issue #2162 this has to do with fsspec. Custom datasets can of course introduce non-pickleable attributes, but our official ones are unpickleable because of fsspec, so it's something we can address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants