Feature request: ability to truncate datasets on split rather than pad

As a Developer working on latent diffusion model training via SimpleTuner, it has become evident that the built-in mechanism for splitting datasets across processes is not smart enough to apply in cases where a robust sample-tracking mechanism is in use.

SimpleTuner uses a 'seen' list to keep track of samples per epoch so that we do not inadvertently oversample. This has the side effect of padding not actually working, since the repeated samples in the list are simply discarded.

What happens next, is that one of the GPUs runs out of data just before the other would have, and then causes a deadlock while the main process waits for the backward pass, which will never come.

My solution was to truncate the sets I'm splitting, taking into account the `batch_size * gradient_steps * num_processes` and **then** split it. But, it occurred to me, having this be built-in would be nice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: ability to truncate datasets on split rather than pad #2019

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: ability to truncate datasets on split rather than pad #2019

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions