whisper : fix "bench-all outputs an invalid result on larger models" #3002

fujimotos · 2025-04-04T06:41:23Z

When I run scripts/bench-all.sh on AWS c8g.xlarge, it outputs
an invalid number ("ms") in the result table.

Look at the 'Encode' column in the benchmark result below:

Running bench-all.sh (commit 6e7629b)

$ ./scripts/bench-all.sh 4 
...
|    CPU |     OS |           Config |         Model |  Th |  FA |    Enc. |    Dec. |    Bch5 |      PP |  Commit |
|    --- |    --- |              --- |           --- | --- | --- |     --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |          tiny |   4 |   0 |  389.04 |    1.17 |    0.68 |    0.55 | 6e7629b |
| <todo> | <todo> |             NEON |          base |   4 |   0 |  879.42 |    2.05 |    1.22 |    0.98 | 6e7629b |
| <todo> | <todo> |             NEON |         small |   4 |   0 | 3290.80 |    5.45 |    3.36 |    2.82 | 6e7629b |
| <todo> | <todo> |             NEON |        medium |   4 |   0 |      ms |   14.85 |    9.51 |    8.05 | 6e7629b |
| <todo> | <todo> |             NEON |      large-v2 |   4 |   0 |      ms |   28.67 |   17.79 |   15.08 | 6e7629b |
| <todo> | <todo> |             NEON | large-v3-turbo |   4 |   0 |      ms |    4.95 |    3.15 |    2.70 | 6e7629b |

The reason is that the benchmark script assumes that the 11th field is
a timestamp, but this assumption can break when the target model
takes a longer time to process.

This is a trivial fix for the issue, adding an explicit whitespace before
the timestamp field.

Running bench-all.sh (commit a7cf427)

$ ./scripts/bench-all.sh 4 
...
|    CPU |     OS |           Config |         Model |  Th |  FA |    Enc. |    Dec. |    Bch5 |      PP |  Commit |
|    --- |    --- |              --- |           --- | --- | --- |     --- |     --- |     --- |     --- |     --- |
| <todo> | <todo> |             NEON |          tiny |   4 |   0 |  389.81 |    1.22 |    0.70 |    0.56 | a7cf427 |
| <todo> | <todo> |             NEON |          base |   4 |   0 |  883.28 |    2.12 |    1.25 |    0.99 | a7cf427 |
| <todo> | <todo> |             NEON |         small |   4 |   0 | 3302.36 |    5.61 |    3.43 |    2.86 | a7cf427 |
| <todo> | <todo> |             NEON |        medium |   4 |   0 | 10561.90 |   15.42 |    9.71 |    8.14 | a7cf427 |
| <todo> | <todo> |             NEON |      large-v2 |   4 |   0 | 20608.38 |   29.33 |   18.36 |   15.26 | a7cf427 |
| <todo> | <todo> |             NEON | large-v3-turbo |   4 |   0 | 18801.69 |    5.10 |    3.27 |    2.73 | a7cf427 |

The benchmark script 'scripts/bench-all.sh' assumes that the 11th field of the output line is a timestamp. This assumption does not hold when the target model takes a bit longer to process. Fix this issue by introducing an explicit whitespace to the output lines of `whisper_print_timings()`. Signed-off-by: Fujimoto Seiji <[email protected]>

fujimotos · 2025-04-04T06:52:13Z

Note: I needed this fix to compute the benchmark result in #89 (comment).

To illustrate the point, this PR changes the following raw output:

$ ./build/bin/whisper-bench -m ./models/ggml-large-v2.bin -t 4
...
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time = 20320.76 ms /     1 runs (20320.76 ms per run)
whisper_print_timings:   decode time =  7299.80 ms /   256 runs (   28.51 ms per run)
whisper_print_timings:   batchd time =  5664.58 ms /   320 runs (   17.70 ms per run)
whisper_print_timings:   prompt time = 61682.86 ms /  4096 runs (   15.06 ms per run)

... to this:

$ ./build/bin/whisper-bench -m ./models/ggml-large-v2.bin -t 4
...
whisper_print_timings:   sample time =     0.00 ms /     1 runs (     0.00 ms per run)
whisper_print_timings:   encode time = 20625.59 ms /     1 runs ( 20625.59 ms per run)
whisper_print_timings:   decode time =  7580.05 ms /   256 runs (    29.61 ms per run)
whisper_print_timings:   batchd time =  5967.29 ms /   320 runs (    18.65 ms per run)
whisper_print_timings:   prompt time = 62751.71 ms /  4096 runs (    15.32 ms per run)

... which ensures that awk '{print $11}' always works.

…gml-org#3002) The benchmark script 'scripts/bench-all.sh' assumes that the 11th field of the output line is a timestamp. This assumption does not hold when the target model takes a bit longer to process. Fix this issue by introducing an explicit whitespace to the output lines of `whisper_print_timings()`. Signed-off-by: Fujimoto Seiji <[email protected]>

* ggerganov/master: (25 commits) examples : add HEAPU8 to exported runtime methods (ggml-org#3062) ruby : make Ruby bindings installed with build options (ggml-org#3056) whisper : add no_context parameter to whisper_params (ggml-org#3045) examples : add FFmpeg v7.0 support to ffmpeg-transcode.cpp (ggml-org#3038) ruby: use CMake in build process (ggml-org#3043) docs : update README.md to note newer nvidia gpus (ggml-org#3031) addon.node : support max_context api for addon.node (ggml-org#3025) whisper : reduce delta_min from 1000ms to 100ms (ggml-org#3028) docs : document how to use 'WHISPER_FFMPEG' build option (ggml-org#3029) docs : fix README.md (ggml-org#3024) xcf : use check for visionos build version (ggml-org#3021) ruby : fix types of arguments for rb_get_kwargs in ruby_whisper_params.c (ggml-org#3022) ruby : Update uri.rb (ggml-org#3016) models : fix dead link to models in readme (ggml-org#3006) ruby : change homepage URI in Ruby gemspec (ggml-org#3007) tests : add script to benchmark whisper.cpp on LibriSpeech corpus (ggml-org#2999) whisper : fix "bench-all outputs an invalid result on larger models" (ggml-org#3002) rename : ggerganov -> ggml-org (ggml-org#3005) examples : update server.py to match github pages app [no ci] (ggml-org#3004) whisper.wasm : fix unknown language issue (ggml-org#3000) ...

ggerganov approved these changes Apr 4, 2025

View reviewed changes

ggerganov merged commit e6234cd into ggml-org:master Apr 4, 2025

fujimotos deleted the sf/bench-all branch April 5, 2025 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper : fix "bench-all outputs an invalid result on larger models" #3002

whisper : fix "bench-all outputs an invalid result on larger models" #3002

Uh oh!

fujimotos commented Apr 4, 2025

Uh oh!

fujimotos commented Apr 4, 2025

Uh oh!

Uh oh!

whisper : fix "bench-all outputs an invalid result on larger models" #3002

whisper : fix "bench-all outputs an invalid result on larger models" #3002

Uh oh!

Conversation

fujimotos commented Apr 4, 2025

Uh oh!

fujimotos commented Apr 4, 2025

Uh oh!

Uh oh!