Add validation and batched inference to flux #1205

CarlosGomes98 · 2025-05-19T16:28:54Z

Add val loss
Add batched inference

Ideally we would also add COCO2014 as dataset. However, I havent been able to find a hf dataset containing both the images and the captions. So, for now, Ive added a dataset which is just the first 30k samples of the training dataset, for functional verification

~~This also includes changes from #1138~~

…ng dataset index logic

facebook-github-bot · 2025-05-19T16:29:00Z

Hi @CarlosGomes98!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

wwwjn · 2025-05-19T17:46:48Z

torchtitan/experiments/flux/dataset/flux_dataset.py

If we take first 30_000 samples as validation dataset, will it overlap with the training dataset?

A alternative ways to specify the data_files (eg, dataset = load_dataset("json", data_files={"train": base_url + "train-v1.1.json", "validation": base_url + "dev-v1.1.json"}, field="data")). https://huggingface.co/docs/datasets/en/loading, if we are loading dataset from hugging face directly.

If we are loading data locally, we could keep a _info.json, (https://huggingface.co/datasets/pixparse/cc12m-wds/blob/main/_info.json) to specify train / validation split

Yes it will. This was just a temporary solution to functionally verify the validation loop. I wanted to ask if you had some insights on how we should include the coco2014 dataset, given that its not easily available on hf hub.

Would we add download instructions to the readme and load it locally?

Do you want to using coco dataset because the stable diffusion paper? I think we should keep it simplify and just cut some part from the cc12m dataset to work as validation group.

wwwjn · 2025-05-19T17:49:20Z

torchtitan/experiments/flux/dataset/flux_dataset.py

+
+    return result
+
+def _coco2014_data_processor(


If we are not using coco dataset as validation set right now, we should remove this function

wwwjn · 2025-05-19T17:55:10Z

torchtitan/experiments/flux/dataset/flux_dataset.py

                    continue
-            except (UnicodeDecodeError, SyntaxError, OSError) as e:


In my training, I added this line before to capture some data loading error, eg, corrupted image header when PIL.image is reading, or corrupted .tar file header etc

Makes sense, I must have removed it by accident

wwwjn · 2025-05-19T17:57:26Z

torchtitan/experiments/flux/sampling.py

@@ -176,9 +237,9 @@ def denoise(
    # create positional encodings
    POSITION_DIM = 3
    latent_pos_enc = create_position_encoding_for_latents(
-        bsz, latent_height, latent_width, POSITION_DIM
+        1, latent_height, latent_width, POSITION_DIM


Can you explain a little bit more on this line, why we change batch_size to 1 here?

The changes in this method from bsz to 1 allow the denoise method to deal with batches of images + the possible doubling of the batch size due to classifier free guidance.

For this case, since they will all have the same position encoding, we can set the batch dimension to 1 and allow PyTorch broadcasting to match it to whatever the batch dimension will be

@wwwjn would it be possible for you to test this batched inference code, with / without classifier free guidance, on your trained model? the code functionally runs but I have not tested if it correctly produces images as I dont have a properly trained checkpoint

wwwjn · 2025-05-19T17:58:55Z

torchtitan/experiments/flux/sampling.py

    output_name = os.path.join(output_dir, name)
    # bring into PIL format and save
    x = x.clamp(-1, 1)
-    x = rearrange(x[0], "c h w -> h w c")
+    if len(x.shape) == 4:


Why we add len(x.shape) == 4 here? Under which cases will this happen?

This change allows the save_image method to correctly handle being passed a single image with or without the batch dimension. In the current code, the image to be saved must always be passed with 4 dimensions, from which we take x[0]

torchtitan/train.py

wwwjn · 2025-05-19T23:36:32Z

torchtitan/experiments/flux/train.py

+            if (
+                parallel_dims.dp_replicate_enabled
+                or parallel_dims.dp_shard_enabled
+                or parallel_dims.cp_enabled


nit: currently we are not enabling cp for Flux model. We could remove this line

I just copied this from the train step. For consistency I would either keep it or remove it in both

wwwjn · 2025-05-19T23:41:26Z

torchtitan/experiments/flux/train.py

+                    self.step, force=(self.step == job_config.training.steps)
+                )
+
+                if self.step % job_config.eval.eval_freq == 0 and job_config.eval.dataset:


I think we could wrap all these parts into eval(), and make the main train() loop easier to read.

wwwjn · 2025-05-19T23:45:38Z

torchtitan/experiments/flux/train.py

+            return global_loss_per_timestep, global_timestep_counts
+
+    @record
+    def inference(self, prompts: list[str], bs: int = 1):


Can we add a line in README, or add a simliar file as run_train.sh to run the inference

Or a better way is to move the inference code outside of the train.py, and create another subclass of FluxTrainer() to do so

With simplicity in mind, I think I agree with your second suggestion.

I'm somewhat conflicted about using Trainer for inference. On one hand, we definitely would like to re-use all the logic for model loading and parallelization.

On the other hand, it forces us to do things like loading the training dataset, which doesnt really make much sense.

For now I have left it like that, but in the future creating a more light-weight Trainer-like class for inference only may be better.

wwwjn · 2025-05-19T23:47:06Z

torchtitan/experiments/flux/train.py

+            return global_loss_per_timestep, global_timestep_counts
+
+    @record
+    def inference(self, prompts: list[str], bs: int = 1):


Or a better way is to move the inference code outside of the train.py, and create another subclass of FluxTrainer() to do so

wwwjn · 2025-05-19T23:49:49Z

torchtitan/experiments/flux/train.py

+        results = torch.cat(results, dim=0)
+        return results
+
+    def generate_and_save_images(self, inputs) -> torch.Tensor:


Can we move this function to sampling.py, and reuse some of the functions there? We could calculate empty_batch in train.py, and pass it to the function call.

In general, we want to make the train.py similar and clean to read

wwwjn · 2025-05-19T23:50:26Z

torchtitan/experiments/flux/train.py

+            )
+        return images
+
+    def generate_val_timesteps(self, cur_val_timestep, samples):


Also, this can be moved to sampling.py

This one I would argue belongs here. It is really only relevant during the training process for validation, not for sampling in general

tianyu-l

Thanks for working on this!

It seems a lot of good stuff is being added. While I can clearly sense the values of most changes, to be honest it's a bit difficult for reviewers to keep track of all the changes and their motivations.

Do you think it's doable to split the changes into several PRs, each with its own theme and documentation as PR summary / doc string / comments?

torchtitan/distributed/utils.py

tianyu-l · 2025-05-20T16:31:00Z

torchtitan/experiments/flux/__init__.py

        name="flux",
        cls=FluxModel,
        config=flux_configs,
        parallelize_fn=parallelize_flux,
        pipelining_fn=None,
        build_optimizers_fn=build_optimizers,
        build_lr_schedulers_fn=build_lr_schedulers,
-        build_dataloader_fn=build_flux_dataloader,
+        build_dataloader_fn=build_flux_train_dataloader,


I think this aligns with my proposal to do validation in torchtitan (not just for flux but also for other models). #1210
I would hope we can take a more principled approach and make general improvements, instead of doing an ad hoc change here.

Absolutely. I wanted to enable this functionality for flux asap, so this is hacky.

Since it will involve changes to some central components in torchtitan, I didnt want to attempt a full implementation just yet, and Im not sure I'd have the bandwidth for this, especially if its work that someone is already doing / plans on doing.

I'm happy to remove the validation dataset bit and wait on a proper implementation being added to main. until then the validation metrics I added in this pr could instead target a subset of the training set, for example

torchtitan/experiments/flux/dataset/flux_dataset.py

tianyu-l · 2025-05-20T16:33:09Z

torchtitan/experiments/flux/dataset/flux_dataset.py

@@ -106,14 +109,14 @@ def _cc12m_wds_data_processor(
    result = {
        "image": img,
        "clip_tokens": clip_tokens,  # type: List[int]
-        "t5_tokens": t5_tokens,  # type: List[int]
+        "t5_tokens": t5_tokens,  # type: List[int],
+        "txt": sample["txt"],


hmm why adding this?

it is useful for evaluation and inference to be able to associate the generated image with the prompt that was used.

tianyu-l · 2025-05-20T16:34:56Z

torchtitan/experiments/flux/dataset/flux_dataset.py

@@ -285,43 +292,50 @@ def __init__(

        # Variables for checkpointing
        self._sample_idx = 0
-        self._all_samples: list[dict[str, Any]] = []
+        self._epoch = 0


what's the purpose of adding this variable and in general what's the purpose of making these changes around data loader?

The problem is around how a dataloader should initially set when we load from a checkpoint.
The issue with the current logic is that it does not behave well properly with non-infinite datasets. Once the end of the dataset is reached, sample_idx is not reset. This means that every subsequent time the dataset is used, all of its samples will be skipped.

To work around this, I introduce 2 thingsL

_restored_checkpoint flag, which ensures we only skip samples if we have just restored from a checkpoint. subsequent epochs should not skip samples.

Non-infinite datasets also need _sample_idx to be reset at the end

tianyu-l · 2025-05-20T16:35:32Z

torchtitan/experiments/flux/dataset/flux_dataset.py

    dp_world_size: int,
    dp_rank: int,
    job_config: JobConfig,
    # This parameter is not used, keep it for compatibility
    tokenizer: FluxTokenizer | None,
    infinite: bool = True,
    include_sample_id: bool = False,
+    batch_size: int = 4,


why this magic number?

this should indeed be a parameter

tianyu-l · 2025-05-20T16:37:39Z

torchtitan/experiments/flux/infer.py

why are we adding this file and what's its relationship with https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/tests/test_generate_image.py or https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/sampling.py

tianyu-l · 2025-05-20T16:38:43Z

torchtitan/experiments/flux/loss.py

    """Common MSE loss function for Transformer models training."""
-    return torch.nn.functional.mse_loss(pred.float(), labels.float().detach())
+    return torch.nn.functional.mse_loss(pred.float(), labels.float().detach(), reduction=reduction)


Maybe not 100% sure, but I think FSDP / PP doesn't work with reduction other than "mean".
Also, why do we need to alter this?

Im not sure about FSDP / PP. this seems to be working for FSDP.

As for why we need it, this is required for the validation we would like to implement for Flux.
This evaluates the model at different timesteps, and then requires taking a weighted average over those timesteps. In order to do this, we cannot simply average the full tensor here, but need to return it before the reduction.

oh I see. If you need reduction='none' it makes sense. My worry was reduction='sum' may not work.

CarlosGomes98 · 2025-05-21T06:50:04Z

Thanks for working on this!

It seems a lot of good stuff is being added. While I can clearly sense the values of most changes, to be honest it's a bit difficult for reviewers to keep track of all the changes and their motivations.

Do you think it's doable to split the changes into several PRs, each with its own theme and documentation as PR summary / doc string / comments?

Yes it did grow a bit out of hand. I can definitely split it at least into inference and validation. Will see if I can make it more granular than that

wwwjn · 2025-05-21T21:48:28Z

@CarlosGomes98 one quick note is flux-train is a little bit behind the main branch, let's just solve the comments and create a PR to main branch instead.

CarlosGomes98 · 2025-05-27T14:10:48Z

Closing this PR and splitting into smaller ones

Carlos Gomes added 10 commits May 19, 2025 17:11

allow for validation dataset, non-infinite datasets. simplify restori…

1a0e264

…ng dataset index logic

add machinery for building and using val dataset

8d2da80

enable batched inference

2318c69

add dist collect to enable flux validation

73feafd

add coco2014 subset as validation

37035db

add arg batch size to eval

3876386

add changes to batch_generator

f94bf0c

only send tensors to device

0df6366

remove coco, add subset

d1fa2d1

fix timestep generation for val

0459518

fix timestep generation for val again

bc2f123

wwwjn reviewed May 19, 2025

View reviewed changes

torchtitan/train.py Outdated Show resolved Hide resolved

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 19, 2025

wwwjn reviewed May 19, 2025

View reviewed changes

Carlos Gomes added 8 commits May 20, 2025 10:16

remove unused methods and imports

a04df23

add back exception catching for data reading error

c232b63

refactor evaluation and sampling logic

81e6f46

re-raise stopiteration

25c0af1

revert changes to next_batch

caaf8f5

color reset on val log

d389c1e

refactor inference

d06d1be

remove unused imports

a7b2630

fix inference

80e1768

tianyu-l reviewed May 20, 2025

View reviewed changes

Carlos Gomes added 2 commits May 21, 2025 12:41

fix validation

7ed24ae

fix indentation error in flux_dataset

47567a4

CarlosGomes98 closed this May 27, 2025

CarlosGomes98 mentioned this pull request May 27, 2025

[Flux] Add batched inference #1227

Open

		continue
		except (UnicodeDecodeError, SyntaxError, OSError) as e:

Add validation and batched inference to flux #1205

Add validation and batched inference to flux #1205

Uh oh!

Conversation

CarlosGomes98 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented May 19, 2025

Action Required

Process

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

CarlosGomes98 commented May 19, 2025 •

edited

Loading

CarlosGomes98 commented May 27, 2025 •

edited

Loading