Skip to content

Huggingface Multimodal Universe dataset #274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

mtauraso
Copy link
Collaborator

@mtauraso mtauraso commented Apr 8, 2025

This is a prototype that allows the use of Huggingface data sets including multimodal universe within hyrax.

Essentially you just have to config it like

[data_set]
name=HyraxHFMapDataset # or HyraxHFIterableDataset

["data_set.HuggingFace"]
# Which dataset to use
dataset = "MultiModalUniverse/hsc"

# Sequence of keys in nested dict to use as tensor model input
dict_keys = ["image","flux"] # this will get item["image"]["flux"] as the output tensor

# Limit on size of data used
max_size = 100               # A good idea if you like your runs to finish

The iterative implementation is very slow and heavily instrumented, and may need to be removed. See comments in the code for the state of the perf investigation. Streaming is not supported, but could be added easily

I'm not sure if we should merge this, but putting it out for consideration.

I think there's a case for creating this sort of support after we set up dataset <-> model communication to be more flexible than just a tensor.

mtauraso added 2 commits April 8, 2025 13:00
- Adds an example CIFAR iterable dataset
- Changes to pytorch_ignite.py to work around an ignite bug in iterable datasets
  pytorch/ignite#3372
- New is_iterable() and is_map() interface on dataset base class to unify
  discernment logic
- Support for abstract base classes that derive from HyraxDataset not being
  themselves checked for required methods.
@mtauraso mtauraso marked this pull request as draft April 8, 2025 20:15
@mtauraso mtauraso self-assigned this Apr 10, 2025
@mtauraso mtauraso force-pushed the iterable_dataset branch 2 times, most recently from 8fc5c7f to fea8cda Compare April 17, 2025 20:51
Base automatically changed from iterable_dataset to main April 17, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant