Skip to content

[QUESTION] Setting for testing Nemotron-H throughput #1638

Open
@iamtonymwt

Description

@iamtonymwt

Hi there, thank you very much for this fantastic framework. In the Nemotron-H paper, I noticed there is a brief description of the throughput experiment:

"We use an input sequence length of 65536 and ask the models to generate 1024 output tokens. We use an initial Megatron-LM implementation for Nemotron-H inference and vLLM v0.7.312 for baselines. In these experiments, we try to maximize per-GPU inference throughput by using as large a batch size as possible, and we run all experiments on NVIDIA H100 GPUs."

Basically, it's saying to use Megatron-LM to test the models' throughput. Following official NeMo doc, when I used the NeMo Docker(nemo:25.04.nemotron-h) to test the inference throughput, I couldn't get a similar result(200tps for Nemotron-H-8B).

May I request a more detailed description of the experimental setup? like the batch size, warmup, and other acceleration methods used. And is there a script for throughput testing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions