Skip to content

Feature request: ability to truncate datasets on split rather than pad #2019

@bghira

Description

@bghira

As a Developer working on latent diffusion model training via SimpleTuner, it has become evident that the built-in mechanism for splitting datasets across processes is not smart enough to apply in cases where a robust sample-tracking mechanism is in use.

SimpleTuner uses a 'seen' list to keep track of samples per epoch so that we do not inadvertently oversample. This has the side effect of padding not actually working, since the repeated samples in the list are simply discarded.

What happens next, is that one of the GPUs runs out of data just before the other would have, and then causes a deadlock while the main process waits for the backward pass, which will never come.

My solution was to truncate the sets I'm splitting, taking into account the batch_size * gradient_steps * num_processes and then split it. But, it occurred to me, having this be built-in would be nice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestfeature requestRequest for a new feature to be added to Accelerate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions