Skip to content

whisper : fix "bench-all outputs an invalid result on larger models" #3002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 4, 2025

Conversation

fujimotos
Copy link
Contributor

When I run scripts/bench-all.sh on AWS c8g.xlarge, it outputs
an invalid number ("ms") in the result table.

Look at the 'Encode' column in the benchmark result below:

Running bench-all.sh (commit 6e7629b)

$ ./scripts/bench-all.sh 4 
...
|    CPU |     OS |           Config |         Model |  Th |  FA |    Enc. |    Dec. |    Bch5 |      PP |  Commit |
|    --- |    --- |              --- |           --- | --- | --- |     --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |          tiny |   4 |   0 |  389.04 |    1.17 |    0.68 |    0.55 | 6e7629b |
| <todo> | <todo> |             NEON |          base |   4 |   0 |  879.42 |    2.05 |    1.22 |    0.98 | 6e7629b |
| <todo> | <todo> |             NEON |         small |   4 |   0 | 3290.80 |    5.45 |    3.36 |    2.82 | 6e7629b |
| <todo> | <todo> |             NEON |        medium |   4 |   0 |      ms |   14.85 |    9.51 |    8.05 | 6e7629b |
| <todo> | <todo> |             NEON |      large-v2 |   4 |   0 |      ms |   28.67 |   17.79 |   15.08 | 6e7629b |
| <todo> | <todo> |             NEON | large-v3-turbo |   4 |   0 |      ms |    4.95 |    3.15 |    2.70 | 6e7629b |

The reason is that the benchmark script assumes that the 11th field is
a timestamp
, but this assumption can break when the target model
takes a longer time to process.

This is a trivial fix for the issue, adding an explicit whitespace before
the timestamp field.

Running bench-all.sh (commit a7cf427)

$ ./scripts/bench-all.sh 4 
...
|    CPU |     OS |           Config |         Model |  Th |  FA |    Enc. |    Dec. |    Bch5 |      PP |  Commit |
|    --- |    --- |              --- |           --- | --- | --- |     --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |          tiny |   4 |   0 |  389.81 |    1.22 |    0.70 |    0.56 | a7cf427 |
| <todo> | <todo> |             NEON |          base |   4 |   0 |  883.28 |    2.12 |    1.25 |    0.99 | a7cf427 |
| <todo> | <todo> |             NEON |         small |   4 |   0 | 3302.36 |    5.61 |    3.43 |    2.86 | a7cf427 |
| <todo> | <todo> |             NEON |        medium |   4 |   0 | 10561.90 |   15.42 |    9.71 |    8.14 | a7cf427 |
| <todo> | <todo> |             NEON |      large-v2 |   4 |   0 | 20608.38 |   29.33 |   18.36 |   15.26 | a7cf427 |
| <todo> | <todo> |             NEON | large-v3-turbo |   4 |   0 | 18801.69 |    5.10 |    3.27 |    2.73 | a7cf427 |

The benchmark script 'scripts/bench-all.sh' assumes that the 11th
field of the output line is a timestamp. This assumption does not
hold when the target model takes a bit longer to process.

Fix this issue by introducing an explicit whitespace to the output
lines of `whisper_print_timings()`.

Signed-off-by: Fujimoto Seiji <[email protected]>
@fujimotos
Copy link
Contributor Author

Note: I needed this fix to compute the benchmark result in #89 (comment).

To illustrate the point, this PR changes the following raw output:

$ ./build/bin/whisper-bench -m ./models/ggml-large-v2.bin -t 4
...
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 20320.76 ms /     1 runs (20320.76 ms per run)
whisper_print_timings:   decode time =  7299.80 ms /   256 runs (   28.51 ms per run)
whisper_print_timings:   batchd time =  5664.58 ms /   320 runs (   17.70 ms per run)
whisper_print_timings:   prompt time = 61682.86 ms /  4096 runs (   15.06 ms per run)

... to this:

$ ./build/bin/whisper-bench -m ./models/ggml-large-v2.bin -t 4
...
whisper_print_timings:   sample time =     0.00 ms /     1 runs (     0.00 ms per run)
whisper_print_timings:   encode time = 20625.59 ms /     1 runs ( 20625.59 ms per run)
whisper_print_timings:   decode time =  7580.05 ms /   256 runs (    29.61 ms per run)
whisper_print_timings:   batchd time =  5967.29 ms /   320 runs (    18.65 ms per run)
whisper_print_timings:   prompt time = 62751.71 ms /  4096 runs (    15.32 ms per run)

... which ensures that awk '{print $11}' always works.

@ggerganov ggerganov merged commit e6234cd into ggml-org:master Apr 4, 2025
@fujimotos fujimotos deleted the sf/bench-all branch April 5, 2025 10:18
fujimotos added a commit to fujimotos/whisper.cpp that referenced this pull request Apr 20, 2025
…gml-org#3002)

The benchmark script 'scripts/bench-all.sh' assumes that the 11th
field of the output line is a timestamp. This assumption does not
hold when the target model takes a bit longer to process.

Fix this issue by introducing an explicit whitespace to the output
lines of `whisper_print_timings()`.

Signed-off-by: Fujimoto Seiji <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants