feat(map): reuse unchanged columns when input_columns specified to reduce disk usage (#6013) #7626

ArjunJagdale · 2025-06-19T07:41:45Z

Summary

This PR addresses #6013 by reusing unchanged columns from the original dataset in the map() method when input_columns is specified.

What’s Implemented

Injected logic at the end of Dataset.map() to:
- Identify untouched columns not in input_columns or remove_columns
- Select those columns from the original dataset
- Concatenate them with the transformed result using pyarrow.concat_tables

Example Behavior

ds = Dataset.from_dict({"a": [1, 2], "b": [3, 4]})
ds2 = ds.map(lambda x: {"c": x["a"] + 10}, input_columns=["a"], remove_columns=["a"])
print(ds2.column_names)  # Output: ['b', 'c']

Column b is reused from the original dataset.

Notes

This keeps disk usage and caching minimal by avoiding full dataset duplication.
Only triggered when input_columns is set.

cc @lhoestq @mariosasko for review 🙂

ArjunJagdale added 2 commits June 19, 2025 13:06

Update arrow_dataset.py

ad2955e

Merge branch 'huggingface:main' into main

1ee0c99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(map): reuse unchanged columns when input_columns specified to reduce disk usage (#6013) #7626

feat(map): reuse unchanged columns when input_columns specified to reduce disk usage (#6013) #7626

Uh oh!

ArjunJagdale commented Jun 19, 2025

Uh oh!

Uh oh!

feat(map): reuse unchanged columns when input_columns specified to reduce disk usage (#6013) #7626

Are you sure you want to change the base?

feat(map): reuse unchanged columns when input_columns specified to reduce disk usage (#6013) #7626

Uh oh!

Conversation

ArjunJagdale commented Jun 19, 2025

Summary

What’s Implemented

Example Behavior

Notes

Uh oh!

Uh oh!