-
Notifications
You must be signed in to change notification settings - Fork 11.4k
llama-bench: enhance benchmark with improved token throughput measurements #12874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
llama-bench: enhance benchmark with improved token throughput measurements #12874
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that a rate of tokens over both prompt processing and token generation is not a useful metric. This is because you are calculating the average of two clearly different phases of execution. I think a better metric would be just the total runtime of the test. Related discussion: #7199 . In any case, I think the way the information is presented with this PR is an improvement over master and I would still be willing to review and merge it unless someone else objects.
Other considerations:
- With these changes the documentation in the README file has become outdated, please update it prior to merging.
- The line width of the default prints is becoming too long I think. I would be fine with dropping the model size and number of parameters.
- I assume this Pr will have broken
scripts/compare_llama_bench.py
. It would be nice if this was fixed but I'm also fine with doing the fix myself.
"embeddings", "n_prompt", "n_gen", "test_time", | ||
"avg_e2e_ns", "stddev_e2e_ns", "avg_e2e_ts", "stddev_e2e_ts", | ||
"avg_prompt_ns", "stddev_prompt_ns", "avg_prompt_ts", "stddev_prompt_ts", | ||
"avg_gen_ns", "stddev_gen_ns", "avg_gen_ts", "stddev_gen_ts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please preserve vertical alignment.
I agree that there is room for improvement here. I have never used the pp+tg tests because the output didn't give me any useful information, so I would like that to change. The way this is handled by other applications is by calculating the combined pp+tg number as the total test time divided by the number of tokens generated. This gives you a useful metric of how fast you can generate tokens in back-to-back requests with a specific prompt size to process each time. I don't think we should extend the table with separate pp and tg t/s counts, since the default tests keep them on separate rows anyways. That would only make sense if we wanted to default to pp+tg tests (which can also be discussed). |
I disagree, that is in my view not a useful metric for comparison because the value that the rate is normalized to doesn't make sense.
What I think would be useful as a default for a table is generating some amount of tokens on an empty context and then the same amount of tokens with a non-empty context. From that you can roughly estimate both the maximum speed and how that speed declines with more context. What I would think would be best but also high-effort would be to first record the prompt processing and generation evaluation times in a differential way. Then, in a second step you could fit a polynomial to the runtime as a function of context size and plot the results. A t/s value as a function of context size can be obtained by transforming the y axis. |
Thanks for the review. I agree that e2e t/s is not a very useful metric. Separate pp and tg metrics are more useful to understand as these are two distinct phases. Total runtime is also not very helpful IMO since the runtime will vary based on the prompt length and the number of tokens generated and the total runtime doesn't give much insight on perf for either prompt or generation phase. Instead of total runtime, a better metric is time to first token (TTFT). This is alternative to pp t/s. We can use TTFT if no one has any objection. IMO, the separate pp and tg tests doesn't make sense either. However, we should keep pp+tg tests as default (if others agree). This is also consistent with other LLM-related libraries. My final recommendation will be
|
No, I think for pp and tg on their own it makes more sense to provide t/s instead of the runtime, I only think it doesn't make sense to provide a t/s value for a mixture of pp and t/g. |
Co-authored-by: Johannes Gäßler <[email protected]>
Separate metrics for pp/tg and pp+tg tests is confusing and I don't think we should do that |
Why? Tokens generated is the metric that the user cares about. Sure, it's less relevant than splitting up the metrics, but it is not useless. I agree with a text generation test for empty and full context to get min and max expected speeds. A graph is even better, but would take too long to measure to make it the default. |
The relevant metrics for a good user experience as I see them are a low latency until the first token is generated and a high rate of tokens during generation. But because the initial latency is relative to the length of the prompt it makes more sense to instead provide a rate at which tokens are processed. On a fundamental level, if more metrics are to be added they need to be justified in some way, either by providing useful information on their own or by facilitating comparisons. I don't see a situation where a rate of tokens relative to the runtime of pp + tg is ever useful information in isolation. And for comparisons of some pp + tg runs the total runtime is a better metric because lower/higher values correlate better with better/worse performance. |
This PR adds separate measurements for end-to-end, prompt processing, and token generation throughput in llama-bench. The changes allow for more detailed performance analysis by separately tracking and reporting:
The current implementation of
t/s
throughput metric is incorrect when-pg
flag is specified. It uses the formula(n_prompt+n_gen)/e2e_time
which does not accurately represent throughput and leads to misleading interpretation. The correct e2e throughput should be calculated asn_gen/e2e_time
Benefits
Old output
New output