[QUESTION] Setting for testing Nemotron-H throughput

Hi there, thank you very much for this fantastic framework. In the Nemotron-H paper, I noticed there is a brief description of the throughput experiment:

"We use an input sequence length of 65536 and ask the models to generate 1024 output tokens. We use an initial Megatron-LM implementation for Nemotron-H inference and vLLM v0.7.312 for baselines. In these experiments, we try to maximize per-GPU inference throughput by using as large a batch size as possible, and we run all experiments on NVIDIA H100 GPUs."

Basically, it's saying to use Megatron-LM to test the models' throughput. Following [official NeMo doc](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/mamba.html), when I used the NeMo Docker(nemo:25.04.nemotron-h) to test the inference throughput, I couldn't get a similar result(200tps for Nemotron-H-8B). 

May I request a more detailed description of the experimental setup? like the batch size, warmup, and other acceleration methods used. And is there a script for throughput testing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] Setting for testing Nemotron-H throughput #1638

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] Setting for testing Nemotron-H throughput #1638

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions