Description
Hi there, thank you very much for this fantastic framework. In the Nemotron-H paper, I noticed there is a brief description of the throughput experiment:
"We use an input sequence length of 65536 and ask the models to generate 1024 output tokens. We use an initial Megatron-LM implementation for Nemotron-H inference and vLLM v0.7.312 for baselines. In these experiments, we try to maximize per-GPU inference throughput by using as large a batch size as possible, and we run all experiments on NVIDIA H100 GPUs."
Basically, it's saying to use Megatron-LM to test the models' throughput. Following official NeMo doc, when I used the NeMo Docker(nemo:25.04.nemotron-h) to test the inference throughput, I couldn't get a similar result(200tps for Nemotron-H-8B).
May I request a more detailed description of the experimental setup? like the batch size, warmup, and other acceleration methods used. And is there a script for throughput testing?