Prevent serialization/pickling errors when using cloud-based datasets (e.g., S3). #4690

SajidAlamQB · 2025-04-25T15:01:51Z

Description

Related to: #2162

Datasets backed by cloud storage (e.g., S3 via s3fs) often hold non-picklable internal references (e.g., S3FileSystem._glob), which cause ParallelRunner to fail during multiprocessing pickling.

Potential Fix:

Add a validation layer in ParallelRunner._validate_catalog() that warns users about cloud datasets and suggests using ThreadRunner. Detect datasets with _protocol in ("s3", "gcs", "adl", etc.) and raise a warning with suggested alternatives (e.g., ThreadRunner).
Update documentation with guidance and caveats for cloud-based storage and multiprocessing, explain why cloud datasets are will cause issues with ParallelRunner.

The text was updated successfully, but these errors were encountered:

datajoely · 2025-04-25T15:05:58Z

I wonder if we can recommend a better pickling library like cloudpickle

astrojuanlu · 2025-05-06T09:39:15Z

Rather than looking at this from the Runners perspective, can we maybe do so from the Datasets perspective? #1936

According to the original issue #2162 this has to do with fsspec. Custom datasets can of course introduce non-pickleable attributes, but our official ones are unpickleable because of fsspec, so it's something we can address.

github-project-automation bot added this to Kedro 🔶 Apr 25, 2025

github-actions bot mentioned this issue May 1, 2025

Monthly issue metrics report #4697

Open

SajidAlamQB moved this to To Do in Kedro 🔶 May 6, 2025

SajidAlamQB mentioned this issue May 6, 2025

ParallelRunner does not work with S3-linked data catalog #2162

Open

SajidAlamQB removed the status in Kedro 🔶 May 6, 2025

merelcht moved this to To Do in Kedro 🔶 May 19, 2025

merelcht removed the status in Kedro 🔶 May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent serialization/pickling errors when using cloud-based datasets (e.g., S3). #4690

Prevent serialization/pickling errors when using cloud-based datasets (e.g., S3). #4690

SajidAlamQB commented Apr 25, 2025 •

edited

Loading

datajoely commented Apr 25, 2025

Uh oh!

astrojuanlu commented May 6, 2025

Uh oh!

Prevent serialization/pickling errors when using cloud-based datasets (e.g., S3). #4690

Prevent serialization/pickling errors when using cloud-based datasets (e.g., S3). #4690

Comments

SajidAlamQB commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

datajoely commented Apr 25, 2025

Uh oh!

astrojuanlu commented May 6, 2025

Uh oh!

SajidAlamQB commented Apr 25, 2025 •

edited

Loading