You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Datasets backed by cloud storage (e.g., S3 via s3fs) often hold non-picklable internal references (e.g., S3FileSystem._glob), which cause ParallelRunner to fail during multiprocessing pickling.
Potential Fix:
Add a validation layer in ParallelRunner._validate_catalog() that warns users about cloud datasets and suggests using ThreadRunner. Detect datasets with _protocol in ("s3", "gcs", "adl", etc.) and raise a warning with suggested alternatives (e.g., ThreadRunner).
Update documentation with guidance and caveats for cloud-based storage and multiprocessing, explain why cloud datasets are will cause issues with ParallelRunner.
The text was updated successfully, but these errors were encountered:
Rather than looking at this from the Runners perspective, can we maybe do so from the Datasets perspective? #1936
According to the original issue #2162 this has to do with fsspec. Custom datasets can of course introduce non-pickleable attributes, but our official ones are unpickleable because of fsspec, so it's something we can address.
Uh oh!
There was an error while loading. Please reload this page.
Description
Related to: #2162
Datasets backed by cloud storage (e.g., S3 via s3fs) often hold non-picklable internal references (e.g.,
S3FileSystem._glob
), which causeParallelRunner
to fail during multiprocessing pickling.Potential Fix:
Add a validation layer in
ParallelRunner._validate_catalog()
that warns users about cloud datasets and suggests usingThreadRunner
. Detect datasets with_protocol
in ("s3", "gcs", "adl", etc.) and raise a warning with suggested alternatives (e.g.,ThreadRunner
).Update documentation with guidance and caveats for cloud-based storage and multiprocessing, explain why cloud datasets are will cause issues with
ParallelRunner
.The text was updated successfully, but these errors were encountered: